joec@u1100a.UUCP (Joe Carfagno) (04/04/85)
>>>>
Having noticed a discussion of the benefit of loop unrolling on
string copy (and other functions), I thought I'd share a similar
experience here as it gave us BIG gains.
The Sperry 1100 mainframe, on which a version of the UNIXtm system
has been running since 1979, is a WORD ADDRESSABLE machine (and the
words are 36 bit 1's complement). Needless to say, implementing a
C compiler is somewhat interesting, especially in the area of char
pointer dereferencing. At run time, the 20 bit psuedo-byte pointer
is split into its word and "byte" components, and then the proper
partial word is loaded from memory. This multi-instruction sequence
is much more expensive than on your usual machine.
Enter loop unrolling. Our large project (>1Mil lines C code) was
profiled and found to use lots of time in the str*() functions.
Noticing that the str* functions are sequentially processing their
arguments (char 0, then 1, ..., then n), you can determine the starting
partial word (1st 9 bits, 2nd, 3rd, or 4th) once and then predict
what the next 9 bits you need are going to be (2nd, 3rd, 4th, or
1st from next word). For strcpy, you create a 4 by 4 table of
entry points and away you go.
Moral of the story - this technique cut the cpu cost of the str*()
functions by 90% (they were already quite expensive), never to be
seen again on our cpu profiles. Loop unrolling will work on other
normal machines also since you process *cp, *(cp+1), *(cp+2),
etc. at the cost of a few extra words of memory (because you're
duplicating the load/store sequence with different offsets from your
original cp pointer which you put in a register beforehand).