[comp.lang.c] Memory copy timings

rh@smds.UUCP (Richard Harter) (08/02/90)

A number of memcpy versus "his macro" results have posted.  As far
I can recall all were large block moves (5K - 20K).  None of the
postings mentioned checking unaligned moves, i.e. situtations where
the destination and/or source do not lie on even word boundaries.
This is not entirely realistic.  If you are going to use a memory
copy routine (and you should) you are going to use it to copy short
blocks as well as long; if you are going to copy character strings
it will often be the case that they are unaligned.

I ran an experiment on four machines a generic 386, a SUN 3/50,
a Mac II/cx running AUX, and a tekronix XD 88/10.  On each of the
four machines I copied 100,000,000 bytes.  I set up three cases,
(a) aligned character moves, (b) unaligned character moves, and
(c) integer moves.  In all cases I used the maximum optimization
available (none of postings mentioned whether they had optimization
turned on.) 

In each case I used six different blocks ranging from 10-1000.
(larger block sizes are dominated by the inner loop -- shorter
blocks have sundry overhead costs.)  Block sizes for ints were
4 times larger than block sizes for characters (experiment design
flaw.)  

The following tables have six lines, one for each block size.
Column 1 is the block size, columns 2 and 3 are the times for
aligned character moves for the macro and memcpy respectively,
columns 4 and 5 are the times for unaligned character moves,
and columns 6 and 7 are the times for integer moves.  In each
case the times are the times to move 100,000,000 bytes.  Results
and comments follow:

	386 Timings -- Esix operating system
	
	  10     187     88    188     88     51     31
	  25     131     40    134     42     39     16
	  50     121     30    120     30     36     14
	 100     115     16    117     29     27     10
	 250     106     13    107     22     27      7
	1000      90      9     86     19     26      6
	
Comments:  The 386 has a hardware block move instruction.  Hardware
beats software hands down.  Clearly one wants to use memcpy, even
for very short copies.  Enough said.

	SUN 3/50       OS 3.5 
	
	  10     154    199    154    219     39     62
	  25     115     93    115    149     29     40
	  50     105     60    105    127     26     33
	 100      99     42     99    116     25     29
	 250      98     38     98    110     24     28
	1000      93     28     93    105     23     26

Comments:  The SUN memcpy apparently checks for alignment and switches
to word moves when alignment is right; however it doesn't apparently
doesn't use loop unrolling.  The macro doesn't check alignment.  You
could add code to check alignment, but I don't see a clean, portable
way to do it.

Is it worth using the macro?  It's debatable.  If you use it for
ints, short char moves, and known unaligned moves you buy 10-40%.
On the other hand using memcpy saves thought and maintenance costs
and it will be superior when and if SUN optimizes the routine.
This is a tradeoff situtation.
	
	MACINTOSH IICX  AUX 2.0
	
	  10     134    186    130    183     33    127
	  25     101    137     97    134     25    114
	  50      92    121     89    117     23    107
	 100      88    113     84    110     23    107
	 250      84    109     80    105     21    106
	1000      82    106     79    102     20    106

Comments:  AUX is a young OS.  One suspects that mempcy is two
lines of C.  If performance is an issue, you might well consider
rolling your own copy routine.

	Tektronix XD88/10 -- Greenhills C-88000 1.8.4
	
	  10      44     21     44     37     11      8
	  25      39     12     39     32     10      5
	  50      38      8     38     30     10      4
	 100      38      6     38     28     10      4
	 250      38      4     38     28      9      4
	1000      37      4     37     28      9      4

Comments:  Greenhills has a very good reputation; from these 
timings it appears warranted.  Memcpy is the winner here by
a clear margin.  An interesting point here is that optimization
in a compiled language depends in part on helping the compiler
produce efficient code.  The arrangement of code gives the
compiler information.  The cited macro is basically a CISC
optimization; compilers for RISC machines probably need information
that the macro does not supply.

----

Conclusions:  Memcpy is safe, portable (mostly), and doesn't involve
any maintenance issues.  On many machines it will be faster than any
thing you can code.  It should be; the systems people can do anything that
you can do plus machine-code specific optimizations that you don't
have access to.  However it is clear that the quality of the implementation
of system utilities varies a great deal.  If performance is an important
issue (or you have a system without memcpy or equivalent) you may want
to write your own.  Enough on this topic.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (08/04/90)

In article <144@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> A number of memcpy versus "his macro" results have posted.  ...
> None of the postings mentioned checking unaligned moves, ...
> This is not entirely realistic.

I don't see why.  Whatever method you are using for doing bulk
moves (BLT instruction, TRT instruction, bcopy(), memcyp(),
memmove(), ...) having things aligned is likely to help.  Any
programmer who cares enough about the performance of block transfer
to be wondering whether memcpy() is fast enough should really have
taken care of alignment first.

This also applies to fread(), fwrite(), and (on systems with a POSIX
interface, such as DEC are promising RSN for VMS) read() and write().
On all the UNIX systems where I've tried the comparison, I've found
that making read/write buffers be "well" aligned (the alignment that
malloc() guarantees is fine) was usefully faster than having them be
misaligned.

Most of the time, the best way to speed up block transfer is not to
do it at all, but to twiddle your pointers around.  Which is why
MVS and VMS offer "locate" mode transput as well as "move" mode, a
distinction reflected at the PL/I language level.  This is also the
point of <fio.h>, anyone have a PD implementation of fio?

-- 
Distinguishing between a work written in Hebrew and one written in Aramaic
when we have only a Latin version made from a Greek translation is not easy.
(D.J.Harrington, discussing pseudo-Philo)

rh@smds.UUCP (Richard Harter) (08/05/90)

In article <3510@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> In article <144@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> > A number of memcpy versus "his macro" results have posted.  ...
> > None of the postings mentioned checking unaligned moves, ...
> > This is not entirely realistic.

> I don't see why.  Whatever method you are using for doing bulk
> moves (BLT instruction, TRT instruction, bcopy(), memcyp(),
> memmove(), ...) having things aligned is likely to help.  Any
> programmer who cares enough about the performance of block transfer
> to be wondering whether memcpy() is fast enough should really have
> taken care of alignment first.

Agreed -- alignment helps and anyone concerned about performance should
worry about alignment.  The point is that if one is copying a block of
characters as such, e.g. strings, one quite regularly hits unaligned
moves.  An example would be copying a substring out of an array (or into
an array).

> This also applies to fread(), fwrite(), and (on systems with a POSIX
> interface, such as DEC are promising RSN for VMS) read() and write().
> On all the UNIX systems where I've tried the comparison, I've found
> that making read/write buffers be "well" aligned (the alignment that
> malloc() guarantees is fine) was usefully faster than having them be
> misaligned.

That is almost guaranteed to be the case.  If I am not mistaken almost
all systems read fixed block sizes into internal buffers and copy
the results from the buffers to your specified destination.  (Waiting
to be told that I am wrong. :-))  There is a rumour to the effect that
your I/O will be faster if you read and write in fixed block sizes that
are integral multiples of the system block size.  Does anyone have 
opinions or information on the trade offs involved?  For example is there
a performance gain one way or another by doing one read into a buffer of
1024 bytes and then copying out a series of items versus doing a series
of reads for each item?  Does anyone have any data on this?  Does anyone
care? :-)

> Most of the time, the best way to speed up block transfer is not to
> do it at all, but to twiddle your pointers around...

Yep, no argument here.  There are cases where that doesn't apply, e.g.
the data you want is in a transient area, you're changing object size,
you're going to modify the copied data later on, etc.  Another area where
there are large potential gains is the allocation and deallocation of objects
of the same type.  One often wins big by maintaining a free list of objects.
And so on...
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.