[gnu.gcc] More technical nonsense

tiemann@YAHI.STANFORD.EDU (Michael Tiemann) (06/18/89)

   action of the scheduling algorithms.  Tiemann's first cut would work well
   on a Cyber 205 which has really long pipeline delays and lots of registers.
   Tiemann's last cut clearly works well on the SPARC and will probably work
   well on the MIPS and other chips with short pipes.  The pipeline delays
   of the 88K are longer, for instance three clocks for a load and 6 for
   floating point, and the first cut might have different performance results
   for this chip.

To those who have not read the paper, my "first cut" was to schedule
for maximum parallelism.  I had to revise this because code with
unrolled loops would quickly run out of registers, even on the SPARC.
My second cut involved using a heuristic which proved effective at
limiting the life of registers, at some expense of potential
parallelism (cutting down from 8-way to 2-way).

Actaully, I would not have run out of registers had GCC been smarter
about induction variables.  In code that looks like this,

	a[i][j] = ...;
	a[i][j+1] = ...;
	a[i][j+2] = ...;
	a[i][j+3] = ...;

not atypical for an unrolled loop, GCC tries to create 4 induction
variables, one each for `i*size + j', `i*size+j+1', `i*size+j+2', and
`i*size+j+3'.  The loop optimization code rejects these as not worth
introducing, but it still goes along an computes longhand each of
these addresses, when by introducing one indution variable,
`i*size+j', and using it to compute the +1, +2, and +3 addresses,
really tight, and well-scheduled code could be generated. Any loop.c
hackers want to try to fix this?  If so, I can send you my matrix
multiply inner loop, with details about its lossages.

Speaking of inner loops...now that there is an instruction scheduler,
it would be very nice if GNU C provided some loop unrolling.  I see
this as a great opportunity for some enterprising hacker...

Here's my two cents about RISC architectures: you can't use what you
don't have.

The SPARC has a fair number of floating point registers (32 single
precision, or 16 double precision), and a longish fp multiply latency,
say 4 cycles.  This means one can have 4 fp multiplies executing
concurrently, hence the need for 4*3 = 12 fp registers.  Thus, the
SPARC can make good use of a pipelined floating point processor, a
crucial assumption of the scheduler.

The 88k has about 28 available registers, which are shared between
integer and floating point registers.  With a latency of 6 cycles (and
10 for double precision), this could tie up 6*3 = 18 fp registers, or
36 registers in double precision--more than the 88k has on board!
Hence, the 88k is too deeply pipelined, and my scheduler may generate
very bad schedules by introducing many register spills.

Michael