[comp.lang.c] Loop unfolding

rik@june.cs.washington.edu (Rik Littlefield) (09/01/88)

All of the examples of loop unfolding recently discussed on the net have
implemented just copying.  Several authors have suggested improving on
the loop unfolding method by using a (pseudo-) standard routine like 
'memcpy', or by declaring large structures that the compiler can generate
good code for moving.

I have seen at least three cases where loop unfolding was very productive
but neither of the above suggestions seems to apply.  All were in time-
critical production applications.

   1. For an ultrasonic inspection program, the inner loop contained a
      summation along the lines of

         s += *p++;

   2. In an image processing program, the inner loop was an indexed move
      like

         *p++ = *(*q++);

   3. A driver for a memory-mapped I/O device used multiple stores
      into a single address:

         *q = *p++;

As I said, unrolling was a very effective way of removing virtually
all the overhead from these loops.  Can anyone suggest other solutions
analogous to the alternatives mentioned above, or for that matter,
any better solution other than assembly language?

--Rik

pardo@june.cs.washington.edu (David Keppel) (09/01/88)

rik@june.cs.washington.edu (Rik Littlefield) writes:
>[ code not handled by library routines and helped by unrolling]
>[ loop unrolling, structure copy, ... how else? ]

Perhaps I'm dreaming, but we've got all the tools if we want:
* Strong compiler technology.
* Programmer-supplied compiler hints.

Somebody (Chris Torek?) picked up the idea of making assertions on the
code to give ``noalias'' a run for the money.  This same kind of
assertion-driven technology is available to solve data bound problems,
and pragmas can give the compiler a hint that a particular section of
code needs to be worked on.

Those hints and assertions firmly in hand, the code can be passed off
to an optimizing compiler.  Loop unrolling, code hoisting, ... are all
well-understood technologies.  A restricted "best fit" technique may
be used to try *several different* optimizations, generate code for
each of them, count the cost of each generated fragment, then select
the one that does (appears to do) the best job.

That's what compilers are for, isn't it?

    ;-D on  ( My technology is bigger than your technology )  Pardo
-- 
		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

earl@mips.COM (Earl Killian) (09/02/88)

In article <5612@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
   ...
   As I said, unrolling was a very effective way of removing virtually
   all the overhead from these loops.  Can anyone suggest other solutions
   analogous to the alternatives mentioned above, or for that matter,
   any better solution other than assembly language?

I think unrolling is the right answer, but I would suggest just
writing straight-forward C and using a good compiler instead of doing
it in the source.  I did that for your examples.  Here are the
inner loops:

  for (i = 0; i < n; i += 1) s += p[i];
	lw	t2,0(a1)
	lw	t3,4(a1)
	lw	t4,8(a1)
	addu	v1,v1,t2
	lw	t5,12(a1)
	addu	v1,v1,t3
	addiu	a1,a1,16
	addu	v1,v1,t4
	bne	a1,a2,0x68
	 addu	v1,v1,t5

  for (i = 0; i < n; i += 1) p[i] = *q[i];
	lw	t7,0(a2)
	lw	t9,4(a2)
	lw	t8,0(t7)
	lw	t0,8(a2)
	sw	t8,0(a1)
	lw	t1,0(t9)
	lw	t3,12(a2)
	sw	t1,4(a1)
	lw	t2,0(t0)
	addiu	a2,a2,16
	sw	t2,8(a1)
	lw	t4,0(t3)
	addiu	a1,a1,16
	bne	a2,a3,0x118
	 sw	 t4,-4(a1)

  volatile int *r;
  for (i = 0; i < n; i += 1) *r = p[i];
	lw	t0,0(a1)
	addiu	a1,a1,16
	sw	t0,0(a2)
	lw	t2,-12(a1)
	nop
	sw	t2,0(a2)
	lw	t3,-8(a1)
	nop
	sw	t3,0(a2)
	lw	t4,-4(a1)
	bne	a1,a3,0x1c4
	 sw	t4,0(a2)

For the last an assembly language coder could do better because he can
know that r and p aren't aliased and get rid of the nops.  For the
others, the compiler's code is hard to beat.
-- 
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086

sl@van-bc.UUCP (pri=-10 Stuart Lynne) (09/02/88)

In article <5612@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes:
>All of the examples of loop unfolding recently discussed on the net have
>implemented just copying.  Several authors have suggested improving on
>the loop unfolding method by using a (pseudo-) standard routine like 
>'memcpy', or by declaring large structures that the compiler can generate
>good code for moving.
>
>I have seen at least three cases where loop unfolding was very productive
>but neither of the above suggestions seems to apply.  All were in time-
>critical production applications.

In the original Macintosh SCSI driver I was able to go from 3:1 interleave
on the SCSI disks to 2:1 simply by replacing Apple's SCSI manager with a
block copy with an unrolled loop, similiar to the Duff Device (only in 68000
asm). 

A recent example from a device driver. By unrolling the loop to check which
device had produced the interrupt we achieved the following:

  - could use absolute memory addresses instead of a variable
    - saved time not having to save a register to store the address
    - saved time not having to load the address into the var
  - no loop control variables required
    - no need to allocate a register, init or test it
  - loop overhead eliminated

The compiled code actually turned out to be almost the same size as the
orignal looped implementation, but significantly faster.


int (*cdsvint[4])() = { XXXXINT, XXXMINT, XXXFRINT, XXXEINT };
cdsintr( achan )
{
  register int intvec;   
  intvec = RMSPC( SUNBASE2+4, RWR2 );	/* required, grab intvec here	*/
  if ( (RMSPC( SUNBASE2, RWR0 ) & INTPEND ) ) { /* check pending 	*/
    (*cdsvint[(intvec >> 2)&0x3])( intvec&CHANA?2:3 );
    WMSPC( SUNBASE2, RWR0, EOIRST );	/* End of Interrupt Service	*/
    return( 1 );
  }
  intvec = RMSPC( SUNBASE+4, RWR2 );	/* required, grab intvec here	*/
  if ( (RMSPC( SUNBASE, RWR0 ) & INTPEND ) ) { /* check pending 	*/
    (*cdsvint[(intvec >> 2)&0x3])( intvec&CHANA?0:1 );
    WMSPC( SUNBASE, RWR0, EOIRST );	/* End of Interrupt Service	*/
    return( 1 );
  } 
  return( 0 );
}

-- 
Stuart.Lynne@wimsey.bc.ca {ubc-cs,uunet}!van-bc!sl     Vancouver,BC,604-937-7532

samperi@marob.MASA.COM (Dominick Samperi) (09/07/88)

In article <3010@wright.mips.COM> earl@mips.COM (Earl Killian) writes:
|>   As I said, unrolling was a very effective way of removing virtually
|>   all the overhead from these loops.  Can anyone suggest other solutions

Sorry for jumping into this discussion rather late, but could someone
please give a brief definition of "loop unrolling."
-- 
Dominick Samperi, NYC
    samperi@acf8.NYU.EDU	samperi@marob.MASA.COM
    cmcl2!phri!marob        	uunet!hombre!samperi
      (^ ell)