[comp.lang.c] A study in code optimization in C

rh@smds.UUCP (Richard Harter) (07/26/90)

The macro shown below is an optimized memory to memory copy macro.
It is probably faster than memcopy on your machine -- I have checked
it on several machines and have always found it to be faster.

Some may find it instructive as an example of code optimization.
Techniques used are:  loop unrolling, localized use of register
variables, count down loops, and shift and modulo operators instead
of arithmetic operators.

Some general notes:

There are two major ways to unroll a loop with a variable remainder.
The method used here is to use two loops, one for the remainder and
one for the bulk.  The other method is to jump into a loop inside a
switch (legal but bizarre.)  The second method has the advantage of
not needing the "short" loop for the remainder.  However timing studies
show no real difference between the two methods.  The reason is that
the overhead of setting up the switch balances off with the higher cost
of the short loop.

Given two loops, one which is unrolled by a constant, and a short loop
handling the remainder, a loop unrolling factor of 8 is optimal under
most circumstances.

A count down loop saves several instructions in a CISC machine,
provided that (a) the loop variable is a register variable, and (b)
the optimation option is selected.  The savings are due to (1) the
comparison is against 0 rather than a constant, (2) CISC machines
typically have a branch on result of comparison, (3) optimization
and register declaration avoids a store of the loop variable.  For
some compilers there is also a savings by placing the calculation
of the limit in the first clause of the loop when the limit is a
complex expression.  It should be noted that the savings are only
signifigant in tight loops.

A count down loop should have one of the following two forms:

	for (index=(expression)+1;--index>0;) {body}
	for (index=(expression);--index>=0;) {body}

Plum Hall uses the first form; I use and recommend the second as
being clearer.  The second form will fail (infinite loop) if the
index is unsigned.  However the savings will be lost in any case
if the index is unsigned.  These forms should be used as is; the
loop optimization is lost if you don't predecrement in the test.

The macros do not include or exploit an alignment test on the grounds
that (a) the resulting code is not portable, (b) the overhead of the
tests removes the savings in short and medium length copies, (c)
many aligned copies can be caught by passing the proper type, and
(d) I didn't feel like coding it.

The copyright notice is included solely on the grounds that I feel
that if you use somebody elses code you should acknowlege where you
got it from.

------------------------ CUT HERE -----------------------------------
/*
	Copyright (c) 1990, Software Maintenance and Development
	Systems Inc.  Permission to copy, to modify, and to distribute,
	whether for sale or not is granted, provided that a copy of
	this notice is included and is not altered.

	Macro name:  gencopy
	Function:    In memory copy
	Author:	     Richard Harter
	Arguments:
		dest	Pointer to destination
		src	Pointer to source
		nb	Number of items to copy
		type	Type of items
	Synopsis:

	This macro is an optimized in line memory to memory copy
	routine.  The following restrictions apply: (a) if dest and
	src overlap, dest must have a lower address than src; (b)
	the item type must be one for which assignment is valid.

	Bugs:

	No check is made for negative length.  Results may be
	surprising.
*/

#define gencopy(dest,src,nb,type ) {register type *cpy_a;\
register type *cpy_b;\
register int cpy_i,cpy_j;cpy_a=(dest);cpy_b=(src);cpy_j=(nb);\
for(cpy_i=(cpy_j & 7); --cpy_i>=0;) *cpy_a++ = *cpy_b++;\
for (cpy_i=(cpy_j>>3); --cpy_i>=0;) { *cpy_a++ = *cpy_b++;*cpy_a++ = *cpy_b++;\
*cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++;\
*cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; }}

#define copy(dest,src,nb) gencopy(dest,src,nb,char)
------------------------ CUT HERE -----------------------------------
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

mcdonald@aries.scs.uiuc.edu (Doug McDonald) (07/26/90)

In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>
>The macro shown below is an optimized memory to memory copy macro.
>It is probably faster than memcopy on your machine -- I have checked
>it on several machines and have always found it to be faster.
                                 !!!!!!
>
Oh My!.

Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array:

                          His code                   library memcpy            
Compiler 1:
               (chars)     12.6                            2.7
               (ints)       6.9                            2.7

Compiler 2:
               (chars)     23.6                            1.3
               (ints)       6.9                            1.3

Results were the same for library memmove, aligned or unaligned.

Sure looks like his code loses. As I would expect, of course:  The
library can do the whole 20 kilobytes in one instruction, assuming
aligned operands. Unaligned, it takes a few more.

When people post things like this, perhaps they could test them, at least
on the most common computers. My computer is an example of the next to
the top of the line of the most common processor line(*) the world has ever
known. At least, folks, test on one of these!!

Doug McDonald

(*) It is a 20 MHz Dell 310, 80386. The compilers were Microsoft C and
Microway NDPC.

andrew@alice.UUCP (Andrew Hume) (07/27/90)

i hate to interject experimental results into a computer
science discussion but i implemented mr harter's `always faster'
memcpy just to see. the machine was an SGI 4D/240 running
irix 3.2.1. the timing is for 100000 copies of n bytes
(n uniform between 0 999). results:
	memcpy 2.81s
	harter 9.02s

in my experience, memcpy is osmething vendors REALLY work on;
to suggest that a simple portable implementation would beat
the system's memcpy implies you have only tried this on batshit systems.

joe@proto.COM (Joe Huffman) (07/28/90)

In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
> >
> >The macro shown below is an optimized memory to memory copy macro.
> >It is probably faster than memcopy on your machine -- I have checked
> >it on several machines and have always found it to be faster.
>                                  !!!!!!
> >
> Oh My!.
> 
> Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array:
> 
>                           His code                   library memcpy            
> Compiler 1:
>                (chars)     12.6                            2.7
>                (ints)       6.9                            2.7
> 
> 
> Compiler 2:
>                (chars)     23.6                            1.3
>                (ints)       6.9                            1.3
> 
[Stuff deleted... compilers were Microsoft and Microway NDPC, machine was
20 MHz 386]

I just ran it on a 20 MHz 386 running SCO UNIX.  The timing were done with 
5000 copies but then divided by 5 to make the numbers comparable.


			   His code		       library memcpy
SCO supplied MSC 5.1
		(chars)	    14.0		             2.05

Zortech
	     386 code generator not available		     1.80


The reason why the macro is so much slower is (as Doug alluded to and I 
deleted) is that the library routines were written in assembly language and
tweaked for near optimum speed and/or size.  Not something you usually should
do if you are trying to write portable code.  Hence let the compiler writers
do it -- usually they can be trusted to squeeze the most performance out of the
machine.  Especially in a competitive marketplace.
---
Zortech is my major source of income.  Statements about them or their 
competitors cannot be totally without bias.
-- 
joe@proto.com
uunet!proto!joe
FAX: 208-263-8772

rh@smds.UUCP (Richard Harter) (07/28/90)

In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:

> >The macro shown below is an optimized memory to memory copy macro.
> >It is probably faster than memcopy on your machine -- I have checked
> >it on several machines and have always found it to be faster.
>                                  !!!!!!

> Oh My!.

	[Superior timings for a 20KB move on a 386 by memmove given]

Ouch.  I should have phrased that more carefully.  Yes, the gentleman
is quite right.  As to be expected, a hardware bulk move is always
going to beat code that uses item by item move instructions.  Furthermore,
there is nothing to stop a system implementor from using the tightest
code possible. (My observation is that the quality of implementations
vary a great deal.)

In defense I have to point out that the quoted remark is accurate;
timings were made on 680x0 boxes, vaxes (!), and some risc boxes,
but not, obviously, on any 386 boxes.  Small comfort for the chap
who took my posting blindly and put it on his 386.  Apologies.

It should also be pointed out that the cited code is a template
for tight loops and does win when there isn't a hardware instruction
available.

Does it matter?  Yes, if you are coding for both speed and portability.
In this particular case, however, the gencopy macro should include a
machine dependent ifdef which switches to memcpy (memmove) on the
appropriate machines.

Thanks to Doug for catching this and posting.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

bruce@seismo.gps.caltech.edu (Bruce Worden) (07/29/90)

In article <1349@proto.COM> joe@proto.COM (Joe Huffman) writes:
>In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
>> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>> >
>> >The macro shown below is an optimized memory to memory copy macro.
>> >It is probably faster than memcopy on your machine -- I have checked
>> >it on several machines and have always found it to be faster.
>>                                  !!!!!!
>> Oh My!.
>> Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array:
>>                           His code                   library memcpy       
>> Compiler 1:
>>                (chars)     12.6                            2.7
>>                (ints)       6.9                            2.7
>> Compiler 2:
>>                (chars)     23.6                            1.3
>>                (ints)       6.9                            1.3
>[Stuff deleted... compilers were Microsoft and Microway NDPC, machine was
>20 MHz 386]
>
>I just ran it on a 20 MHz 386 running SCO UNIX.  The timing were done with 
>5000 copies but then divided by 5 to make the numbers comparable.
>			   His code		       library memcpy
>SCO supplied MSC 5.1
>		(chars)	    14.0		             2.05
>Zortech
>	     386 code generator not available		     1.80

Here are the results on some machines I could find the other day.  The 
compilers are the native compilers unless otherwise stated.  I used 
whatever compiler optimizations I could.  20kbyte arrays, 1000 copies:

Sun Sparcstation 1+
        Him  		 memcpy    
chars: 7.6 		 2.0     
ints:  2.0 		 2.0     

Sun 4/280
	Him              memcpy    
chars: 9.8               2.8     
ints:  2.5               2.8

Sun Sparcstation SLC
        Him              memcpy    
chars: 9.9               2.6     
ints:  2.5               2.6

Sun 386i
	Him              memcpy    
chars: 9.5               2.6     
ints:  2.4               2.6

Sun 3/160
	Him              memcpy    
chars: 13.7              4.5     
ints:  3.4               4.5

Inmos T800 (Meiko, 25MHz, kind-of unfair because of block_copy instruction)
        Him              memcpy    
chars: 37.6              1.6     
ints:  8.4               1.6

i860 (Meiko, 40MHz, Green Hills C-I860 1.8.5, beta assembler 1.41, beta 
linker 1.2)
	Him              memcpy     
chars: 2.1               3.9      
ints:  0.9               3.9

Convex C120 (Vector--yes his code vectorizes nicely, memcpy not available, 
used bcopy)
	Him              memcpy    
chars: 3.0               1.0      
ints:  1.0               1.0

Convex C120 (Scalar, memcpy not available, used bcopy)
        Him              memcpy    
chars: 28.4              1.5     
ints:  7.5               1.5

BBN TC2000 (Motorola 88000-based, Green Hills C-88000 2.35(1.8.4))
        Him              memcpy    
chars: 10.3              12.0     
ints:  4.9               12.0

In general, I'd say Richard's code does a pretty good job when moving int's,
and also when compared to young machines (the BBN and the Meiko i860.)
In addition, his code is about 20% faster than a simple "for" loop on my
Sparc 1+, so it illustrates a useful principle as well.  I intend to
use it in some selected applications, thanks for posting it.

BIG TIME DISCLAIMER: I in no way intended this to be a comparison of 
different machines, but of the performance of a piece of C code on each of
several different machines.  There are a lot of ways to do timings, and most 
of them aren't very good, so please don't flame me if I didn't do justice to 
some machine's absolute performance, it is the relative timings that matter.
If I screwed that up, flame away (though a nice note explaining the error
might be more instructive.)
						Bruce
P.S. For timing I used getusecclock() on the BBN, ticks() on the Meiko's, and 
getrusage() on everything else.

dwho@nmt.edu (David Olix) (07/29/90)

In article <134@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
>> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
>
>> >The macro shown below is an optimized memory to memory copy macro.
>> >It is probably faster than memcopy on your machine -- I have checked
>> >it on several machines and have always found it to be faster.
>>                                  !!!!!!
>
>> Oh My!.
>
>	[Superior timings for a 20KB move on a 386 by memmove given]
>

[Explanation for slower performance on 386 deleted]

>In defense I have to point out that the quoted remark is accurate;
>timings were made on 680x0 boxes, vaxes (!), and some risc boxes, [...]

   Just for kicks I did the following test on these machines:
   Sun 3/280 (68020) running SunOS 4.1,
   HP 9000/835 (HP Precision Architecture RISC) running HP-UX 7.0,
   and a VAX 750 running 4.3 BSD (don't laugh, it still runs!)

   Timings are in seconds based on 1000 copies of a 20K char array.
   Compile commands were:  cc -O and gcc -O -fstrength-reduce
   -finline-functions.

                Macro (copy)             Library (memcpy)
Compiler     Sun    HP    VAX           Sun    HP    VAX
   cc         7     7      52            3      1     95  <--
  gcc         8    ---     51            2     ---    96

Oddly, the macro *DOES* seem to work faster on the VAX (Gee, so I guess
all the world *is* a VAX :-) :-) :-)

>Richard Harter, Software Maintenance and Development Systems, Inc.
>Net address: jjmhome!smds!rh Phone: 508-369-7398 
>US Mail: SMDS Inc., PO Box 555, Concord MA 01742
>This sentence no verb.  This sentence short.  This signature done.

--David Olix
dwho@{nmtvax|nmtsun|minos}.nmt.edu  .sig under construction
Any connection between my views and NMIMT's is purely coincidental

tkacik@kyzyl.mi.org (Tom Tkacik) (07/29/90)

In article <134@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
> > In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
 
> > >The macro shown below is an optimized memory to memory copy macro.
> > >It is probably faster than memcopy on your machine -- I have checked
> > >it on several machines and have always found it to be faster.
> >                                  !!!!!!
 
> > Oh My!.
 
> 	[Superior timings for a 20KB move on a 386 by memmove given]
 
> Ouch.  I should have phrased that more carefully.  Yes, the gentleman
> is quite right.  As to be expected, a hardware bulk move is always
> going to beat code that uses item by item move instructions.  Furthermore,
 
> In defense I have to point out that the quoted remark is accurate;
> timings were made on 680x0 boxes, vaxes (!), and some risc boxes,
> but not, obviously, on any 386 boxes.
 
Oh My!.

I also tried this using 1000 copies of 20K blocks.
This was performed on an AT&T UnixPC (68010 @ 10MHz).
Here are my times:
			gencopy			memcpy
     GCC  (char)20000	31.3sec			12.0
	  (int)  5000	12.2			12.0

     CC	  (char)20000	31.3sec			12.0
	  (int)  5000   12.3			12.0

Both CC and GCC used the library version of memcpy, so there is little
suprise that they both used the same time.
 
>                                        Small comfort for the chap
> who took my posting blindly and put it on his 386.  Apologies.

Small comfort for the chap who blindly puts it on a 680x0.
I think I will stick to the library version.:-)

-- 
Tom Tkacik		tkacik@kyzyl.mi.org
Speaking only for myself here in Royal Oak.

hascall@cs.iastate.edu (John Hascall) (07/30/90)

rh@smds.UUCP (Richard Harter) writes:
}mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
}> rh@smds.UUCP (Richard Harter) writes:

}> >The macro shown below is an optimized memory to memory copy macro.
}> >It is probably faster than memcopy on your machine -- I have checked
}> >it on several machines and have always found it to be faster.

}> Oh My!.

}	[Superior timings for a 20KB move on a 386 by memmove given]

}Ouch.  I should have phrased that more carefully.  Yes, the gentleman
}is quite right.  As to be expected, a hardware bulk move is always
}going to beat code that uses item by item move instructions.

}In defense I have to point out that the quoted remark is accurate;
}timings were made on 680x0 boxes, vaxes (!), and some risc boxes,
				   ~~~~~
 Hmmm, sorry, but here's a little more rain on your parade:

 VAXstation 3200 (admittedly, not the most modern VAX), 1000 20KB copies:

		memcpy: 3.18       your macro: 15.43

John Hascall  /  john@iastate.edu  /  hascall@atanasoff.cs.iastate.edu

john@frog.UUCP (John Woods) (07/31/90)

In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes:
> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes:
> >The macro shown below is an optimized memory to memory copy macro.
> >It is probably faster than memcopy on your machine
> Oh My!.
> Sure looks like his code loses. As I would expect, of course:  The
> library can do the whole 20 kilobytes in one instruction, assuming
> aligned operands. Unaligned, it takes a few more.

Well, it turns out that even without being able to do memory-to-memory moves
with a single SPIN_CYCLE instruction, a poor little garden variety 68020 can
do better than the supplied macro.  I checked our movebytes() subroutine,
and it did about three times better than the macro.  It, of course, uses
all of the valuable optimization tricks mentioned in Richard's article
(reason enough to stash it away), and also takes care of minor little problems
like odd-address alignment (which is a problem on the 68000 though not on
the '20).  movebytes() is, as it happens, written in C, using a compiler
(GreenHills) which translates it into really good assembly code.
-- 
John Woods, Charles River Data Systems, Framingham MA, (508) 626-1101
...!decvax!frog!john, john@frog.UUCP, ...!mit-eddie!jfw, jfw@eddie.mit.edu

steve@thelake.mn.org (Steve Yelvington) (08/01/90)

[In article <1990Jul28.203800.17258@laguna.ccsf.caltech.edu>,
     bruce@seismo.gps.caltech.edu (Bruce Worden) writes ... ]

> In general, I'd say Richard's code does a pretty good job when moving int's,
> and also when compared to young machines (the BBN and the Meiko i860.)
> In addition, his code is about 20% faster than a simple "for" loop on my
> Sparc 1+, so it illustrates a useful principle as well.  I intend to
> use it in some selected applications, thanks for posting it.

Bruce is one of the few people who seems to have seen the point -- which
(to me, anyway) was just an illustration of C coding technique, not a
claim that it's possible to beat Brand X compiler's mondo-optimized
assembler memcpy().

For collectors of useless numbers, here are results from an 8-megaHertz
16-bit Motorola 68000 (Atari ST), 1000 iterations, 20K buffers:

library memcpy:           17.125 seconds
Richard's gencpy char:    40.755 seconds
Richard's gencpy int:     20.385 seconds
Richard's gencpy long:    15.460 seconds

Details: Sozobon C compiler, dLibs public-domain C library.
Optimizer turned on. sizeof(int) == 16 bits; sizeof(long) == 32 bits.

The dLibs memcpy is coded in assembler, moves 16-bit words when possible,
and DOES check for overlaps (as in memmove). The copy is a simple loop.
Loading and dumping registers with movem.l might be faster; I have not tried.
-- 
   Steve Yelvington at the (rain-replenished) lake in Minnesota
   steve@thelake.mn.org

boyd@necisa.ho.necisa.oz (Boyd Roberts) (08/01/90)

When will these exercises in futility end?  How many time have we
seen the `my hack beats the library routine' postings?  And, how
many times have we seen such misguided claims proved to be false?

Too many times!

It should be obvious that there are too many variables in these
situations and that generic statements will be found to be false
for some large number of cases.  Apples and oranges...


Boyd Roberts			boyd@necisa.ho.necisa.oz.au

``When the going gets wierd, the weird turn pro...''