tiemann@YAHI.STANFORD.EDU (Michael Tiemann) (06/18/89)
action of the scheduling algorithms. Tiemann's first cut would work well on a Cyber 205 which has really long pipeline delays and lots of registers. Tiemann's last cut clearly works well on the SPARC and will probably work well on the MIPS and other chips with short pipes. The pipeline delays of the 88K are longer, for instance three clocks for a load and 6 for floating point, and the first cut might have different performance results for this chip. To those who have not read the paper, my "first cut" was to schedule for maximum parallelism. I had to revise this because code with unrolled loops would quickly run out of registers, even on the SPARC. My second cut involved using a heuristic which proved effective at limiting the life of registers, at some expense of potential parallelism (cutting down from 8-way to 2-way). Actaully, I would not have run out of registers had GCC been smarter about induction variables. In code that looks like this, a[i][j] = ...; a[i][j+1] = ...; a[i][j+2] = ...; a[i][j+3] = ...; not atypical for an unrolled loop, GCC tries to create 4 induction variables, one each for `i*size + j', `i*size+j+1', `i*size+j+2', and `i*size+j+3'. The loop optimization code rejects these as not worth introducing, but it still goes along an computes longhand each of these addresses, when by introducing one indution variable, `i*size+j', and using it to compute the +1, +2, and +3 addresses, really tight, and well-scheduled code could be generated. Any loop.c hackers want to try to fix this? If so, I can send you my matrix multiply inner loop, with details about its lossages. Speaking of inner loops...now that there is an instruction scheduler, it would be very nice if GNU C provided some loop unrolling. I see this as a great opportunity for some enterprising hacker... Here's my two cents about RISC architectures: you can't use what you don't have. The SPARC has a fair number of floating point registers (32 single precision, or 16 double precision), and a longish fp multiply latency, say 4 cycles. This means one can have 4 fp multiplies executing concurrently, hence the need for 4*3 = 12 fp registers. Thus, the SPARC can make good use of a pipelined floating point processor, a crucial assumption of the scheduler. The 88k has about 28 available registers, which are shared between integer and floating point registers. With a latency of 6 cycles (and 10 for double precision), this could tie up 6*3 = 18 fp registers, or 36 registers in double precision--more than the 88k has on board! Hence, the 88k is too deeply pipelined, and my scheduler may generate very bad schedules by introducing many register spills. Michael