rik@june.cs.washington.edu (Rik Littlefield) (09/01/88)
All of the examples of loop unfolding recently discussed on the net have implemented just copying. Several authors have suggested improving on the loop unfolding method by using a (pseudo-) standard routine like 'memcpy', or by declaring large structures that the compiler can generate good code for moving. I have seen at least three cases where loop unfolding was very productive but neither of the above suggestions seems to apply. All were in time- critical production applications. 1. For an ultrasonic inspection program, the inner loop contained a summation along the lines of s += *p++; 2. In an image processing program, the inner loop was an indexed move like *p++ = *(*q++); 3. A driver for a memory-mapped I/O device used multiple stores into a single address: *q = *p++; As I said, unrolling was a very effective way of removing virtually all the overhead from these loops. Can anyone suggest other solutions analogous to the alternatives mentioned above, or for that matter, any better solution other than assembly language? --Rik
pardo@june.cs.washington.edu (David Keppel) (09/01/88)
rik@june.cs.washington.edu (Rik Littlefield) writes: >[ code not handled by library routines and helped by unrolling] >[ loop unrolling, structure copy, ... how else? ] Perhaps I'm dreaming, but we've got all the tools if we want: * Strong compiler technology. * Programmer-supplied compiler hints. Somebody (Chris Torek?) picked up the idea of making assertions on the code to give ``noalias'' a run for the money. This same kind of assertion-driven technology is available to solve data bound problems, and pragmas can give the compiler a hint that a particular section of code needs to be worked on. Those hints and assertions firmly in hand, the code can be passed off to an optimizing compiler. Loop unrolling, code hoisting, ... are all well-understood technologies. A restricted "best fit" technique may be used to try *several different* optimizations, generate code for each of them, count the cost of each generated fragment, then select the one that does (appears to do) the best job. That's what compilers are for, isn't it? ;-D on ( My technology is bigger than your technology ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
earl@mips.COM (Earl Killian) (09/02/88)
In article <5612@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
...
As I said, unrolling was a very effective way of removing virtually
all the overhead from these loops. Can anyone suggest other solutions
analogous to the alternatives mentioned above, or for that matter,
any better solution other than assembly language?
I think unrolling is the right answer, but I would suggest just
writing straight-forward C and using a good compiler instead of doing
it in the source. I did that for your examples. Here are the
inner loops:
for (i = 0; i < n; i += 1) s += p[i];
lw t2,0(a1)
lw t3,4(a1)
lw t4,8(a1)
addu v1,v1,t2
lw t5,12(a1)
addu v1,v1,t3
addiu a1,a1,16
addu v1,v1,t4
bne a1,a2,0x68
addu v1,v1,t5
for (i = 0; i < n; i += 1) p[i] = *q[i];
lw t7,0(a2)
lw t9,4(a2)
lw t8,0(t7)
lw t0,8(a2)
sw t8,0(a1)
lw t1,0(t9)
lw t3,12(a2)
sw t1,4(a1)
lw t2,0(t0)
addiu a2,a2,16
sw t2,8(a1)
lw t4,0(t3)
addiu a1,a1,16
bne a2,a3,0x118
sw t4,-4(a1)
volatile int *r;
for (i = 0; i < n; i += 1) *r = p[i];
lw t0,0(a1)
addiu a1,a1,16
sw t0,0(a2)
lw t2,-12(a1)
nop
sw t2,0(a2)
lw t3,-8(a1)
nop
sw t3,0(a2)
lw t4,-4(a1)
bne a1,a3,0x1c4
sw t4,0(a2)
For the last an assembly language coder could do better because he can
know that r and p aren't aliased and get rid of the nops. For the
others, the compiler's code is hard to beat.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
sl@van-bc.UUCP (pri=-10 Stuart Lynne) (09/02/88)
In article <5612@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >All of the examples of loop unfolding recently discussed on the net have >implemented just copying. Several authors have suggested improving on >the loop unfolding method by using a (pseudo-) standard routine like >'memcpy', or by declaring large structures that the compiler can generate >good code for moving. > >I have seen at least three cases where loop unfolding was very productive >but neither of the above suggestions seems to apply. All were in time- >critical production applications. In the original Macintosh SCSI driver I was able to go from 3:1 interleave on the SCSI disks to 2:1 simply by replacing Apple's SCSI manager with a block copy with an unrolled loop, similiar to the Duff Device (only in 68000 asm). A recent example from a device driver. By unrolling the loop to check which device had produced the interrupt we achieved the following: - could use absolute memory addresses instead of a variable - saved time not having to save a register to store the address - saved time not having to load the address into the var - no loop control variables required - no need to allocate a register, init or test it - loop overhead eliminated The compiled code actually turned out to be almost the same size as the orignal looped implementation, but significantly faster. int (*cdsvint[4])() = { XXXXINT, XXXMINT, XXXFRINT, XXXEINT }; cdsintr( achan ) { register int intvec; intvec = RMSPC( SUNBASE2+4, RWR2 ); /* required, grab intvec here */ if ( (RMSPC( SUNBASE2, RWR0 ) & INTPEND ) ) { /* check pending */ (*cdsvint[(intvec >> 2)&0x3])( intvec&CHANA?2:3 ); WMSPC( SUNBASE2, RWR0, EOIRST ); /* End of Interrupt Service */ return( 1 ); } intvec = RMSPC( SUNBASE+4, RWR2 ); /* required, grab intvec here */ if ( (RMSPC( SUNBASE, RWR0 ) & INTPEND ) ) { /* check pending */ (*cdsvint[(intvec >> 2)&0x3])( intvec&CHANA?0:1 ); WMSPC( SUNBASE, RWR0, EOIRST ); /* End of Interrupt Service */ return( 1 ); } return( 0 ); } -- Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl Vancouver,BC,604-937-7532
samperi@marob.MASA.COM (Dominick Samperi) (09/07/88)
In article <3010@wright.mips.COM> earl@mips.COM (Earl Killian) writes: |> As I said, unrolling was a very effective way of removing virtually |> all the overhead from these loops. Can anyone suggest other solutions Sorry for jumping into this discussion rather late, but could someone please give a brief definition of "loop unrolling." -- Dominick Samperi, NYC samperi@acf8.NYU.EDU samperi@marob.MASA.COM cmcl2!phri!marob uunet!hombre!samperi (^ ell)