[net.lang.c] Casting Pointers -- fast *portable* copy

tjt@kobold.UUCP (02/07/84)

uiucsb!grunwald described how to use the asm construct to do a fast
non-portable copy for a 68010.  Actually, it is fairly easy to write a
copy routine that is essentially just as fast (within ~10%) and does
*not* depend on the two-word instruction cache for the 68010.

Unroll those loops!  The usual code generated for:

	while (--count >= 0)
		*to++ = *from++;

is (where count, from and to have been declared as register variables):

		jra	L13
	L20001:
		movb	a5@+,a4@+
	L13:
		subql	#1,d7
		jge	L20001

The problem with this code is that 12 ticks are taken by the movb
instruction, 8 ticks for the subql and 10 ticks for the branch so that
less than 50% of the time is spent actually moving data.  Changing the
C code and the compiler to use a dbra instruction instead of the subql
and the branch would take 14 ticks, but still would spend less than
half the time moving data.  Note that this analysis assumes no wait
states and ignores what happens at the end of the loop (i.e. when the
branch falls through).

If we take this loop and "unroll" it thusly:

	while (count >= 8) {
		count -= 8;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
		*to++ = *from++;
	}
	while (--count >= 0)
		*to++ = *from++;

the code becomes:
		jra	L13
	L20001:
		subql	#8,d7
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
		movb	a5@+,a4@+
	L13:
		moveq	#8,d0
		cmpl	d7,d0
		jgt	L15
		jra	L20001
	L20003:
		movb	a5@+,a4@+
	L15:
		subql	#1,d7
		jge	L20003

Note that for the last 7 or fewer bytes moved, we still incur the
50% penalty, but in the unrolled loop, we execute a moveq (4 ticks), a
cmpl (8 ticks), a branch (10 ticks) and a subql (8 ticks) of control
instructions for every 8 data movement instructions, or 30 ticks of
control for 96 ticks of data movement.  On the 68010, the cmpl only
takes 6 ticks.

If the two-instruction loop is used instead ("loop mode") each time
around the loop uses 14 ticks instead of 12 so the overhead for eight
move instructions is 16 ticks.  The ratio of these is:

	(30+96)
	------- = 1.125 or 12.5% slower
	(16+96)

If you also check for properly aligned from and to pointers, and count
a multiple of sizeof(int) (i.e. so that a movl instruction can be used
inside the loop) the ratio becomes

	(30+8*20)
	--------- = 1.0795 or about 8% slower
	 (8*22)

If we unroll the loop 16 times rather than 8 times, the ratio is:

	(30+16*20)
	--------- = .9943
	 (16*22)

This is about dead even, although it may be less likely to occur for
e.g. structure assignments, it is guaranteed to apply for e.g. copying
pages and disk blocks.  The actual break even point comes when the loop
is unrolled 15 times, for either movb or movl. The extra cost in space
is the size of the unrolled loop: 13 words for a loop unrolled 8 times,
or 21 words for one unrolled 16 times.

Of course, "loop mode" is more attractive for inline code, and actually
*is* faster for smaller factors of unrolling.  On the other hand,
the unrolled loop runs as fast on the 68000 as it does on the 68010,
an important consideration since the 68010 is only starting to be
available in production quantities and there are a lot of 68000's out
there.

Note that unrolling is a *general* method for speeding up loops,
regardless of how many instructions are on the inside.  Unrolling
doesn't save as much if there is more to the body of the loop.  It is
also true that a highly optimizing compiler would unroll the loop for
you.
-- 
	Tom Teixeira,  Massachusetts Computer Corporation.  Westford MA
	...!{ihnp4,harpo,decvax}!masscomp!tjt   (617) 692-6200 x275