[comp.arch] Loop unrolling and icache

rcg@lpi.liant.com (Rick Gorton) (11/02/90)

In article <42415@mips.mips.COM> cprice@mips.mips.COM (Charlie Price) writes:
>
>There are overhead costs for loop unrolling due to I-cache
>considerations and this *can* make things slower -- or at least
>change the ideas of good limits.
>
>Consider the following:
>
>1)  It costs some overhead to get instructions into cache.
>2)  A more-unrolled loop, being bigger, displaces more instructions
>    from the I-cache than a less-unrolled loop.
>3)  Additional "real" memory traffic is nearly always bad
>    in some limit.  It makes easy-to-program machines like
>    bus-based shared-memory machines poop out at a lower number
>    of processors.
>
>Charlie Price    cprice@mips.mips.com        (408) 720-1700
>MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

Loop unrolling also needs to be influenced by the pipelining
characteristics of the architecture, and its calling conventions.
Consider the following FORTRAN subroutine:
	SUBROUTINE foo(N, X, Y)
	IMPLICIT REAL*8(A-H,O-Z),INTEGER$4(I-N)
	DIMENSION X(N), Y(N)
	DO 10 I=1,N
		Y(I) = X(I)*N
	10 CONTINUE

The main body of the code would look something like this on an 88K:
	ld	r10, r2, 0	; r10 is N
	flt.ds	d6, r10		; r6/r7 is real*8 (N)
	or	r9, r0, 0	; r9 is I-1 (Arrays start at 1)
LOOP:	mak	r8, r9,0<2>	; r8 = r9*4 = byte index into array
	ldd	d14, r3, r8	; r14/r15 is X(I)
	fmul.ddd d14, d14, d6	; X(I)*N
	std	d14, r4, r8	; Y(I) assigned
	add	r9, r9, 1
	cmp	r5, r9, r10
	bb1	lt, r5, LOOP	; While (I-1) < N continue loop

There is a lot of stalling of the processor going on here:
ldd/fmul.ddd/std all take multiple cycles, and are interdependent.
However if the Loop is unrolled, and then scheduled,
multiple fmul.ddd's can be going on as well as multiple std's
(ldd's are trickier, but can be scheduled as well)

In this situation, the tradeoff of stalls has to be weighed against
any cache misses.  And if it had been Y(I) = X(I)/N, it would be
even more advantageous to unroll the loop.

-- 
Richard Gorton               rcg@lpi.liant.com  (508) 626-0006
Language Processors, Inc.    Framingham, MA 01760
Hey!  This is MY opinion.  Opinions have little to do with corporate policy.