[comp.lang.c] Hand optimized copying

g-rh@XAIT.Xerox.COM (Richard Harter) (10/08/88)

Several recent threads inspired me to take a close look at optimized
copying.  I tested a number of variations, doing both timing runs and
inspection of the assembly language code produced.  All tests were run
on a Sun 3; results may be different on other machines.  Here are my
observations and results:

Optimization:  It is critical to have the optimizer turned on; the
code generated for loop control is much tighter.

Count loops:  The following for a count loop (a loop which does something
n times) is optimal:

	for (i = n+1; --i > 0;) {loop body}

The >0 test does not cost anything extra.  Alternative forms for the loop
produce more expensive code.

Switches and Duff's device:  All variants using a switch were inferior.
The problem with using a switch statement is that the initial code to
check the range is a (relatively) expensive overhead.  If one does use
Duff's device the do-while loop conditional should be --count rather than
count--.

Functions versus macros:  A function call adds a fixed overhead that dominates
execution times for short copies.  

Loop unrolling:  There is a gain of about 20-25% to be gained by loop
unrolling.  The best size appears to be 8; gains by using a larger factor
are nominal, if any.

Memcpy:  Memcpy appears to use word copying if the source and destination
fall on word boundaries and a tight byte copy otherwise.  A tightly coded
copy macro beats it if the source and destination are not on word boundaries.
If you wish to avoid memcpy for portability reasons, it is profitable to use
separate char, int, short, and long copy macros.

Subscripts versus pointers:  Pointers are superior.

Best code:  The following is a macro that appears to be optimal:

#define copy(dest,src,nb) {register char *a, *b; register int i;\
a = dest; b = src;\
for (i = (nb %7)+1; --i;) *a++ = *b++;\
for (i = (nb>>3)+1; --i>0;) {\
*a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
*a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
}}

Disclaimer 1:  This code is not guaranteed to be free of typos.
Disclaimer 2:  Results are for the Sun 3.
Disclaimer 3:  There may tricks that I haven't thought of.
-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.

renglish@hpisod1.HP.COM (Robert English) (10/18/88)

/ g-rh@XAIT.Xerox.COM (Richard Harter) /  2:38 pm  Oct  7, 1988 /

> #define copy(dest,src,nb) {register char *a, *b; register int i;\
> a = dest; b = src;\
> for (i = (nb %7)+1; --i;) *a++ = *b++;\
> for (i = (nb>>3)+1; --i>0;) {\
> *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
> *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
> }}

Methinks that line 3 should read:

for (i = (nb & 07) + 1; --i;)...

( or " % 8" )

--bob--

g-rh@XAIT.Xerox.COM (Richard Harter) (10/18/88)

In article <10130002@hpisod1.HP.COM> renglish@hpisod1.HP.COM (Robert English) writes:
]/ g-rh@XAIT.Xerox.COM (Richard Harter) /  2:38 pm  Oct  7, 1988 /

]> #define copy(dest,src,nb) {register char *a, *b; register int i;\
]> a = dest; b = src;\
]> for (i = (nb %7)+1; --i;) *a++ = *b++;\
]> for (i = (nb>>3)+1; --i>0;) {\
]> *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
]> *a++ = *b++; *a++ = *b++; *a++ = *b++; *a++ = *b++;\
]> }}

]Methinks that line 3 should read:
]for (i = (nb & 07) + 1; --i;)...
]( or " % 8" )

You are correct -- it should read &7 rather than %7.  %8 is slightly
more expensive.  The "%" is a typo.  Also as noted previously, the 
variable a, b, and i should by replaced names specific to the macro,
and dest, src, and nb should be in parenthesis.

-- 

In the fields of Hell where the grass grows high
Are the graves of dreams allowed to die.
	Richard Harter, SMDS  Inc.