[comp.unix.wizards] fastest way to copy hunks of memory

gwyn@smoke.BRL.MIL (Doug Gwyn) (05/02/90)

In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes:
>Which method is fastest?
>1.  char *p1, *p2;
>    for (j = 0; j < size; j++)
>	    *p1++ = *p2++;
>2.  bcopy(p2,p1,size);
>3.  memcpy(p1,p2,size);
>In general, system calls are slower, right? (ie, 1 faster that 2 and 3)
>BTW, what's the difference between bcopy and memcpy anyway??

bcopy() and memcpy() are never (to my knowledge) system calls, but are
functions in the system's C library.  While there is certainly
considerable overhead in making a system call, there is much less
overhead making a library function call.

bcopy() is provided on 4BSD-based systems, while memcpy() is provided
on System V-based systems.  The ANSI C standard requires support for
memcpy() AND memmove(); the difference is that memmove() is guaranteed
to "do the right thing" when the source and destination buffers overlap,
whereas memcpy() doesn't necessarily follow such a bit-blit model (but
this allows it to be slightly faster, sometimes).  bcopy()'s behavior
in such circumstances is not as well defined, although there seems to
be some sentiment that it is supposed to "do the right thing" also.

Generally, bcopy() and memcpy() are implemented to exploit whatever
"block move" instructions the processor might support, which makes
them much faster when `size' is fairly large.  For small values of
`size', the in-line loop code might be faster, but unless it's in a
bottleneck section of code, why bother.  Indeed, instead of copying
one byte at a time, you might copy a word at a time, and change the
loop test to be against 0, and use do..while so that a subtract-one-
and-branch-if-nonzero instruction is generated, and ...

Since this is neither a UNIX-specific nor a Wizardly question, I've
directed follow-ups to comp.lang.c (INFO-C).

clewis@eci386.uucp (Chris Lewis) (05/03/90)

In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes:
 
> Which method is fastest?

It depends.  What kind of processor are you?  How good is your compiler?
Does your vendor program in assembler?  How long is the string?  Are the
operands word aligned?  What's the phase of the moon?  etc.

Perhaps 

    while(size--)
	*p1++ = *p2++;

would be faster.

Yes, library routines are usually slower than in-line code because
of the subroutine call overhead.  But, what if the subroutine is
written in assembler and uses a block move instruction?  Or, the
compiler automatically in-lines small functions?  etc.  Or, the
library routine writer does fancy things in C (eg: "Duff's device")?

There are many fancy short-cuts that can be used even in something as
simple as a block move.

[I'll let someone else describe Duff's device, a topic that comes
up every couple of weeks ;-)]

Generally, it's not really worth bothering about the performance of
simple pieces of code such as block moves and so on.  Use whatever's
easiest to code and support.  Only get fancy when you program is *not*
fast enough and it's spending a significant amount of its time there - 
and you find that out later...

There's nothing worse than finding really obfuscated C code in something, 
where the obfuscation was just to gain a cycle or two out of millions.  
Especially when your compiler's optimizer is so good that it doesn't
matter how "dumb" you code it.

Well, there is one thing worse, puzzle-coded hyper-speed programs that
get the wrong answer.

> In general, system calls are slower, right? (ie, 1 faster that 2 and 3)

Very generally, *function* calls are slower than in-line code.  But chances
are your in-line code would be slower than a vendor's library routine.
You'll never know until you build the program and measure it....
 
> BTW, what's the difference between bcopy and memcpy anyway??

Who wrote it....  memcpy is a Unix System V routine (introduced in SIII
I think), and bcopy is a routine that first appeared as a generally
available function in BSD (it may have been in V7, I disremember), but
since that time, most newer versions of UNIX have both....  (bcopy is 
guaranteed to work even if the source and target overlap, whereas
memcpy isn't.  That's the main difference other than the source and
target operands being backwards)
-- 
Chris Lewis, Elegant Communications Inc, {uunet!attcan,utzoo}!lsuc!eci386!clewis
Ferret mailing list: eci386!ferret-list, psroff mailing list: eci386!psroff-list

guy@auspex.auspex.com (Guy Harris) (05/04/90)

>Who wrote it....  memcpy is a Unix System V routine (introduced in SIII
>I think),

Nope, introduced in S5; which release, I don't remember.

>and bcopy is a routine that first appeared as a generally
>available function in BSD (it may have been in V7, I disremember),

It wasn't in V7.  I don't think it was in 4.1BSD, either; as I remember,
it first showed up in 4.2BSD.

>but since that time, most newer versions of UNIX have both....

I don't think S5 has "bcopy()" until, possibly, S5R4.  4.2BSD has only
"bcopy()"; 4.3BSD has a "memcpy()", but for some unfathomable reason
they whipped up a quick-and-dirty C version that copies a byte at a
time, rather than just tweaking the assembler-language "bcopy()" to flip
the first two arguments.

mercer@ncrstp.StPaul.NCR.COM (Dan Mercer) (05/04/90)

In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes:
:
:Which method is fastest?
:
:1.  char *p1, *p2;
:
:    for (j = 0; j < size; j++)
:       *p1++ = *p2++;
:
:2.  bcopy(p2,p1,size);
:
:3.  memcpy(p1,p2,size);
:
:In general, system calls are slower, right? (ie, 1 faster that 2 and 3)
:
:BTW, what's the difference between bcopy and memcpy anyway??
:
:THANKS.  
:
:
:/---------------------------------------v-------------------------------------\
:| Brian Tierney, Computer Graphics Lab  | internet:   tierney@ux1.lbl.gov   |
:| Lawrence Berkeley Laboratory          |    or       BLTierney@lbl.gov      |
:|      "I am her hero, and she is my heroin": Rocky Ericson                  |


It's c language implementation independent.  Bcopy or memcpy may be
implemented as inline assembly (as strcpy is on the NCR Tower's SVr3
machines if <istring.h> is included).  They may also be implemented
in other fashions that are quicker than standard C function calls.
If the hardware architecture supports move instructions the savings can
be significant.  

for (j = 0; j < size; j++) *p1++ = *p2++; 

At the minimal,  this breaks down to:

    exclusive or register x against register x - 1 cycle
    load register a with size - 1 cycle
    label:
    move character based at value in x plus base y to position at x plus 
        base z - 1 cycle
    increment x - 1 cycle
    compare x to a - 1 cycle
    branch not equal to label - 1 cycle

Cost = 2 + (size * 4) cycles

    In fact,  if the architecture supports a branch and decrement
instruction,  the following c code would be faster -

for (j=size;j;j--) *p1++ = *p2++; 

    load register x with size - 1 cycle
    label:
    move character based at value in x plus base y to position at x plus 
        base z - 1 cycle
    decrement x and branch to label if not zero - 1 cycle

Cost = 2 + (size * 2) cycles

However,  if the bcopy or memcpy is expanded inline and the architecture
support move character instructions,  the savings may be far greater.

For instance,  we have a processor that does an MVC - move characters.
While the bytes being moved are not on full words,  or the length to
be moved is less that a full word,  one cycle is used for each byte.
But if full words can be moved,  only one cycle is required to move
each 4 byte chunk.

Socrates once came upon a bunch of philosophers arguing about how many
teeth a horse had.  One group argued for a certain number based on
the length of the horses snout,  a second upon the diet the horse
ate,  a third group based upon number theory.  Socrates disappeared
for a minute and came back and told them the exact number.  They were
incensed.  How could he be so sure of his number,  when he hadn't even
offered a theory to explain it.  Simple,  he replied,  while they were
arguing,  he had gone back to the stables and counted.

So,  if you want the real answer,  I suggest you do the same.  If
you can't decipher the answer from the assembly generated,  put
together a nice looping package and try that.

-- 

Dan Mercer
Reply-To: mercer@ncrstp.StPaul.NCR.COM (Dan Mercer)

ag@cbmvax.commodore.com (Keith Gabryelski) (05/04/90)

In article <3296@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
>I don't think S5 has "bcopy()" until, possibly, S5R4.

A lot of vendors added bcopy() to S5R3 when they hacked in TCP/IP
support, just for ease of porting.

S5R4 has support for bcopy() in libucb.a.

Pax, Keith

c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) (05/05/90)

In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
>Perhaps 
>
>    while(size--)
>	*p1++ = *p2++;

or even

void *memcpy(void *to, void *from, size_t size){
	register int 	size_l = size / 4,	/* or (size >> log2(sizeof int)) */
					tail = size % 4;	/* or (size & log2(sizeof int)) */
	void			*result = to;				
	while(size_l--) (int *)to++ = (int *)from++;
	while(tail--) (char *)p1++ = (char *)p2++;
	return result;
}

	This shold work almost 4 times as fast compared to just inclementing
by bytes--it uses full length of register.  The problem is that it doesn't
work if either (void *to) and (void *from) is not aligned and the macine
architecure doesn't allow unaligned assignment.  Such functions as
memcpy() should be written in assembler, I think...

---
##################  Dan The "ioccc contestant this year" Man
+ ____  __  __   +  (Aka Dan Kogai)
+     ||__||__|  +  E-mail:     dankg@ocf.berkeley.edu
+ ____| ______   +  Voice:      415-549-6111
+ |     |__|__|  +  USnail:     1730 Laloma Berkeley, CA 94709
+ |___  |__|__|  +              U.S.A
+     |____|____ +  Disclaimer: I'd rather be blackmailed for my long .sig
+   \_|    |     +              than give up my cool name in Kanji. And my
+ <- THE MAN  -> +              citizenship of People's Republic o' Berkeley
##################              has nothing 2 do w/ whatever I post, ThanQ.

meissner@osf.org (Michael Meissner) (05/07/90)

In article <11311@cbmvax.commodore.com> ag@cbmvax.commodore.com (Keith
Gabryelski) writes:

| In article <3296@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes:
| >I don't think S5 has "bcopy()" until, possibly, S5R4.
| 
| A lot of vendors added bcopy() to S5R3 when they hacked in TCP/IP
| support, just for ease of porting.
| 
| S5R4 has support for bcopy() in libucb.a.
| 
| Pax, Keith

But of course, the question still is which routine has your vendor
spent time optimizing, and for which is it just a byte-by-byte copy?
The answers are different for each vendor (some unfortunately don't
optimize any of them, even if they provide both).
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

meissner@osf.org (Michael Meissner) (05/07/90)

In article <1990May4.172145.4085@agate.berkeley.edu>
c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) writes:

| In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes:
| >Perhaps 
| >
| >    while(size--)
| >	*p1++ = *p2++;
| 
| or even
| 
| void *memcpy(void *to, void *from, size_t size){
| 	register int 	size_l = size / 4,	/* or (size >> log2(sizeof int)) */
| 					tail = size % 4;	/* or (size & log2(sizeof int)) */
| 	void			*result = to;				
| 	while(size_l--) (int *)to++ = (int *)from++;
| 	while(tail--) (char *)p1++ = (char *)p2++;
| 	return result;
| }
| 
| 	This shold work almost 4 times as fast compared to just inclementing
| by bytes--it uses full length of register.  The problem is that it doesn't
| work if either (void *to) and (void *from) is not aligned and the macine
| architecure doesn't allow unaligned assignment.  Such functions as
| memcpy() should be written in assembler, I think...

The above code will not work on machines with strict alignment
requirements (ie, RISC machines) if either the 'to' or 'from' pointers
are not aligned on input, since the user could certainly do something
like:

	memcpy (to+1, from, size);

It also will not work under ANSI C compilers, since the construction:

	(int *)to++ = ...

is illegal ANSI C.  Finally, to get the most of the performance on
RISC machines, you have to know about the underlying machine
characteristics.  For example, on the 88k, there is a 2 cycle delay
after the load instruction has been initiated, and before it is in a
register (there are hardware interlocks, so that even naive code will
work).  Thus on the 88k, after dealing with any initial unaligned
pointers, and such, the main loop would look like:

	...

	{
		register int word1, word2, *word_to, *word_from;

		word_to = (int *) to;
		word_from = (int *) from;

		do {
			word1 = word_from[0];
			word2 = word_from[1];
			word_from += 2;
			size -= 2 * sizeof (int);
			word_to[0] = word1;
			word_to[1] = word2;
			word_to += 2;
		} while ( size > 2 * sizeof(int) );
	}

Optimizing bcopy/memcpy/memmove is not as simple as it looks.  It
takes a lot of skull sweat, and worrying about unusual cases.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so