gwyn@smoke.BRL.MIL (Doug Gwyn) (05/02/90)
In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes: >Which method is fastest? >1. char *p1, *p2; > for (j = 0; j < size; j++) > *p1++ = *p2++; >2. bcopy(p2,p1,size); >3. memcpy(p1,p2,size); >In general, system calls are slower, right? (ie, 1 faster that 2 and 3) >BTW, what's the difference between bcopy and memcpy anyway?? bcopy() and memcpy() are never (to my knowledge) system calls, but are functions in the system's C library. While there is certainly considerable overhead in making a system call, there is much less overhead making a library function call. bcopy() is provided on 4BSD-based systems, while memcpy() is provided on System V-based systems. The ANSI C standard requires support for memcpy() AND memmove(); the difference is that memmove() is guaranteed to "do the right thing" when the source and destination buffers overlap, whereas memcpy() doesn't necessarily follow such a bit-blit model (but this allows it to be slightly faster, sometimes). bcopy()'s behavior in such circumstances is not as well defined, although there seems to be some sentiment that it is supposed to "do the right thing" also. Generally, bcopy() and memcpy() are implemented to exploit whatever "block move" instructions the processor might support, which makes them much faster when `size' is fairly large. For small values of `size', the in-line loop code might be faster, but unless it's in a bottleneck section of code, why bother. Indeed, instead of copying one byte at a time, you might copy a word at a time, and change the loop test to be against 0, and use do..while so that a subtract-one- and-branch-if-nonzero instruction is generated, and ... Since this is neither a UNIX-specific nor a Wizardly question, I've directed follow-ups to comp.lang.c (INFO-C).
clewis@eci386.uucp (Chris Lewis) (05/03/90)
In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes: > Which method is fastest? It depends. What kind of processor are you? How good is your compiler? Does your vendor program in assembler? How long is the string? Are the operands word aligned? What's the phase of the moon? etc. Perhaps while(size--) *p1++ = *p2++; would be faster. Yes, library routines are usually slower than in-line code because of the subroutine call overhead. But, what if the subroutine is written in assembler and uses a block move instruction? Or, the compiler automatically in-lines small functions? etc. Or, the library routine writer does fancy things in C (eg: "Duff's device")? There are many fancy short-cuts that can be used even in something as simple as a block move. [I'll let someone else describe Duff's device, a topic that comes up every couple of weeks ;-)] Generally, it's not really worth bothering about the performance of simple pieces of code such as block moves and so on. Use whatever's easiest to code and support. Only get fancy when you program is *not* fast enough and it's spending a significant amount of its time there - and you find that out later... There's nothing worse than finding really obfuscated C code in something, where the obfuscation was just to gain a cycle or two out of millions. Especially when your compiler's optimizer is so good that it doesn't matter how "dumb" you code it. Well, there is one thing worse, puzzle-coded hyper-speed programs that get the wrong answer. > In general, system calls are slower, right? (ie, 1 faster that 2 and 3) Very generally, *function* calls are slower than in-line code. But chances are your in-line code would be slower than a vendor's library routine. You'll never know until you build the program and measure it.... > BTW, what's the difference between bcopy and memcpy anyway?? Who wrote it.... memcpy is a Unix System V routine (introduced in SIII I think), and bcopy is a routine that first appeared as a generally available function in BSD (it may have been in V7, I disremember), but since that time, most newer versions of UNIX have both.... (bcopy is guaranteed to work even if the source and target overlap, whereas memcpy isn't. That's the main difference other than the source and target operands being backwards) -- Chris Lewis, Elegant Communications Inc, {uunet!attcan,utzoo}!lsuc!eci386!clewis Ferret mailing list: eci386!ferret-list, psroff mailing list: eci386!psroff-list
guy@auspex.auspex.com (Guy Harris) (05/04/90)
>Who wrote it.... memcpy is a Unix System V routine (introduced in SIII >I think), Nope, introduced in S5; which release, I don't remember. >and bcopy is a routine that first appeared as a generally >available function in BSD (it may have been in V7, I disremember), It wasn't in V7. I don't think it was in 4.1BSD, either; as I remember, it first showed up in 4.2BSD. >but since that time, most newer versions of UNIX have both.... I don't think S5 has "bcopy()" until, possibly, S5R4. 4.2BSD has only "bcopy()"; 4.3BSD has a "memcpy()", but for some unfathomable reason they whipped up a quick-and-dirty C version that copies a byte at a time, rather than just tweaking the assembler-language "bcopy()" to flip the first two arguments.
mercer@ncrstp.StPaul.NCR.COM (Dan Mercer) (05/04/90)
In article <5531@helios.ee.lbl.gov> tierney@ux1.lbl.gov (Brian Tierney) writes:
:
:Which method is fastest?
:
:1. char *p1, *p2;
:
: for (j = 0; j < size; j++)
: *p1++ = *p2++;
:
:2. bcopy(p2,p1,size);
:
:3. memcpy(p1,p2,size);
:
:In general, system calls are slower, right? (ie, 1 faster that 2 and 3)
:
:BTW, what's the difference between bcopy and memcpy anyway??
:
:THANKS.
:
:
:/---------------------------------------v-------------------------------------\
:| Brian Tierney, Computer Graphics Lab | internet: tierney@ux1.lbl.gov |
:| Lawrence Berkeley Laboratory | or BLTierney@lbl.gov |
:| "I am her hero, and she is my heroin": Rocky Ericson |
It's c language implementation independent. Bcopy or memcpy may be
implemented as inline assembly (as strcpy is on the NCR Tower's SVr3
machines if <istring.h> is included). They may also be implemented
in other fashions that are quicker than standard C function calls.
If the hardware architecture supports move instructions the savings can
be significant.
for (j = 0; j < size; j++) *p1++ = *p2++;
At the minimal, this breaks down to:
exclusive or register x against register x - 1 cycle
load register a with size - 1 cycle
label:
move character based at value in x plus base y to position at x plus
base z - 1 cycle
increment x - 1 cycle
compare x to a - 1 cycle
branch not equal to label - 1 cycle
Cost = 2 + (size * 4) cycles
In fact, if the architecture supports a branch and decrement
instruction, the following c code would be faster -
for (j=size;j;j--) *p1++ = *p2++;
load register x with size - 1 cycle
label:
move character based at value in x plus base y to position at x plus
base z - 1 cycle
decrement x and branch to label if not zero - 1 cycle
Cost = 2 + (size * 2) cycles
However, if the bcopy or memcpy is expanded inline and the architecture
support move character instructions, the savings may be far greater.
For instance, we have a processor that does an MVC - move characters.
While the bytes being moved are not on full words, or the length to
be moved is less that a full word, one cycle is used for each byte.
But if full words can be moved, only one cycle is required to move
each 4 byte chunk.
Socrates once came upon a bunch of philosophers arguing about how many
teeth a horse had. One group argued for a certain number based on
the length of the horses snout, a second upon the diet the horse
ate, a third group based upon number theory. Socrates disappeared
for a minute and came back and told them the exact number. They were
incensed. How could he be so sure of his number, when he hadn't even
offered a theory to explain it. Simple, he replied, while they were
arguing, he had gone back to the stables and counted.
So, if you want the real answer, I suggest you do the same. If
you can't decipher the answer from the assembly generated, put
together a nice looping package and try that.
--
Dan Mercer
Reply-To: mercer@ncrstp.StPaul.NCR.COM (Dan Mercer)
ag@cbmvax.commodore.com (Keith Gabryelski) (05/04/90)
In article <3296@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: >I don't think S5 has "bcopy()" until, possibly, S5R4. A lot of vendors added bcopy() to S5R3 when they hacked in TCP/IP support, just for ease of porting. S5R4 has support for bcopy() in libucb.a. Pax, Keith
c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) (05/05/90)
In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: >Perhaps > > while(size--) > *p1++ = *p2++; or even void *memcpy(void *to, void *from, size_t size){ register int size_l = size / 4, /* or (size >> log2(sizeof int)) */ tail = size % 4; /* or (size & log2(sizeof int)) */ void *result = to; while(size_l--) (int *)to++ = (int *)from++; while(tail--) (char *)p1++ = (char *)p2++; return result; } This shold work almost 4 times as fast compared to just inclementing by bytes--it uses full length of register. The problem is that it doesn't work if either (void *to) and (void *from) is not aligned and the macine architecure doesn't allow unaligned assignment. Such functions as memcpy() should be written in assembler, I think... --- ################## Dan The "ioccc contestant this year" Man + ____ __ __ + (Aka Dan Kogai) + ||__||__| + E-mail: dankg@ocf.berkeley.edu + ____| ______ + Voice: 415-549-6111 + | |__|__| + USnail: 1730 Laloma Berkeley, CA 94709 + |___ |__|__| + U.S.A + |____|____ + Disclaimer: I'd rather be blackmailed for my long .sig + \_| | + than give up my cool name in Kanji. And my + <- THE MAN -> + citizenship of People's Republic o' Berkeley ################## has nothing 2 do w/ whatever I post, ThanQ.
meissner@osf.org (Michael Meissner) (05/07/90)
In article <11311@cbmvax.commodore.com> ag@cbmvax.commodore.com (Keith Gabryelski) writes: | In article <3296@auspex.auspex.com> guy@auspex.auspex.com (Guy Harris) writes: | >I don't think S5 has "bcopy()" until, possibly, S5R4. | | A lot of vendors added bcopy() to S5R3 when they hacked in TCP/IP | support, just for ease of porting. | | S5R4 has support for bcopy() in libucb.a. | | Pax, Keith But of course, the question still is which routine has your vendor spent time optimizing, and for which is it just a byte-by-byte copy? The answers are different for each vendor (some unfortunately don't optimize any of them, even if they provide both). -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
meissner@osf.org (Michael Meissner) (05/07/90)
In article <1990May4.172145.4085@agate.berkeley.edu> c60c-3cf@e260-3f.berkeley.edu (Dan Kogai) writes: | In article <1990May2.200732.11851@eci386.uucp> clewis@eci386.UUCP (Chris Lewis) writes: | >Perhaps | > | > while(size--) | > *p1++ = *p2++; | | or even | | void *memcpy(void *to, void *from, size_t size){ | register int size_l = size / 4, /* or (size >> log2(sizeof int)) */ | tail = size % 4; /* or (size & log2(sizeof int)) */ | void *result = to; | while(size_l--) (int *)to++ = (int *)from++; | while(tail--) (char *)p1++ = (char *)p2++; | return result; | } | | This shold work almost 4 times as fast compared to just inclementing | by bytes--it uses full length of register. The problem is that it doesn't | work if either (void *to) and (void *from) is not aligned and the macine | architecure doesn't allow unaligned assignment. Such functions as | memcpy() should be written in assembler, I think... The above code will not work on machines with strict alignment requirements (ie, RISC machines) if either the 'to' or 'from' pointers are not aligned on input, since the user could certainly do something like: memcpy (to+1, from, size); It also will not work under ANSI C compilers, since the construction: (int *)to++ = ... is illegal ANSI C. Finally, to get the most of the performance on RISC machines, you have to know about the underlying machine characteristics. For example, on the 88k, there is a 2 cycle delay after the load instruction has been initiated, and before it is in a register (there are hardware interlocks, so that even naive code will work). Thus on the 88k, after dealing with any initial unaligned pointers, and such, the main loop would look like: ... { register int word1, word2, *word_to, *word_from; word_to = (int *) to; word_from = (int *) from; do { word1 = word_from[0]; word2 = word_from[1]; word_from += 2; size -= 2 * sizeof (int); word_to[0] = word1; word_to[1] = word2; word_to += 2; } while ( size > 2 * sizeof(int) ); } Optimizing bcopy/memcpy/memmove is not as simple as it looks. It takes a lot of skull sweat, and worrying about unusual cases. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so