chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) (12/27/89)
In several books I've seen that assignment of structures is usually more efficient than using memcpy(), at leant on most modern processors. I did a few experiments to see if this is true...using the following short program, I attempted to extract the machine code produced on different machines. struct bozo { int one; char two; long three; } foo, bar; main() { foo = bar; (void)memcpy((char *)&foo, (char *)&bar, sizeof(struct bozo)); } On an 8086 CPU, the compiler - MSC5.1 (yuck!) - produces the following code for the assignment when full optimization is on: ; foo = bar lea di, WORD PTR[bp-8] ; foo lea si, WORD PTR[bp-16] ; bar push ss pop es movsw ; the four movesw statements are more movsw ; space/speed efficient than a movsw ; mov cx,sizeof(foo)/2 movsw ; rep movsw combination.... On a VAX using gcc, the following code is produced: ; foo = bar; subl3 $76,fp,sp movab -64(fp),r1 movab -76(fp),r0 movl $12,r2 movblk The VAX naturally produces the more efficient code, but I would imagine the 8086 would do just as good of a job with larger structures, so that a mov cx, sizeof(struct bozo)/2 rep movsw could be used under appropriate circumstances. However, this is only have the question. Does the assignment win over memcpy? On the 8086, the following code is produced: ; (void)memcpy((char *)&foo, (char *)&bar, sizeof(struct foo)); lea ax, WORD PTR[bp-16] ; foo mov WORD PTR[bp-18], ax mov cx, 8 lea di, WORD PTR[bp-8] ; foo lea si, WORD PTR[bp-16] ; bar mov ax, ss shr cx, 1 rep movsw adc cx, cx rep movsb The compiler is smart enough to make memcpy an intrinsic function, so as to avoid a costly call statement. On the vax, a call to memcpy (or in this case bcopy(), which is the same thing) was produced, so I wasn't able to analyze the code directly. However, using gcc on bcopy.c produces the following code: .globl _bcopy _bcopy: .word 0x0 movl 4(fp),r4 movl 8(fp),r3 movl 12(fp),r2 tstl r2 jeql L1 cmpl r4,r3 jeql L1 L2: decl r2 tstl r2 jneq L2 L4: movl r3,r0 addl2 $4,r3 movl r4,r1 addl2 $4,r4 movl (r1),(r0) decl r2 tstl r2 jneq L4 ret Which, seems like quite a bit compared to the assignment. However, in almost all C code I have seen written, comments always state something along the lines of "/* use memcpy for structures larger than int */" which seems to go against the results shown above. In _general_ what is the rule for the assignment of two large structures? memcpy vs. assignment? Which is generally better?
chris@mimsy.umd.edu (Chris Torek) (12/27/89)
In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes: >On a VAX using gcc, the following code is produced: > >; foo = bar; > subl3 $76,fp,sp > movab -64(fp),r1 > movab -76(fp),r0 > movl $12,r2 > movblk Your `VAX' GCC is producing Tahoe instructions. (The Tahoe movblk instruction corresponds fairly well to movc3 on the VAX.) As to the original question: if #define pointer_t char * /* or void * */ struct foo src, dst; (void) memcpy((pointer_t)&dst, (pointer_t)&src, len); is faster than dst = src; you have a really stupid compiler, since the assignment could be treated internally as a call to memcpy. (In analysing an assignment statement, compilers could replace the tree (assign (name dst) (name src)) with the tree (cast void (call (name memcpy) (cast pointer_t (addressof (name dst))) (cast pointer_t (addressof (name src))) (constant (sizeof (structtype foo)))) which is exactly what it would have built for the memcpy line above.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
tim@nucleus.amd.com (Tim Olson) (12/30/89)
In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes: | In several books I've seen that assignment of structures is usually | more efficient than using memcpy(), at leant on most modern | processors. I did a few experiments to see if this is true...using | the following short program, I attempted to extract the machine | code produced on different machines. [ code examples deleted ] | However, this is only have the question. Does the assignment win | over memcpy? On the 8086, the following code is produced: | In _general_ what is the rule for the assignment of two large | structures? memcpy vs. assignment? Which is generally better? Assignment should *always* be better, or at least equal to a call to memcpy, in terms of performance. The compiler knows (at compile time) all of the sizes and alignments of the structures being copied, so it can choose the most efficient assignment method for a given processor. For example, if the structures are aligned and padded to 4-byte boundaries on a 32-bit processor, then 32-bit load/store instructions can be used to copy the structure 4 bytes at a time. A general-purpose runtime routine such as memcpy must perform the copy a byte at a time, or perform runtime checks on the size and alignments of the memory areas being copied. If the compiler recognizes and inlines library routines, then a call to memcpy() may be as fast as structure assignment, but you are better off using the assignment, because: 1) it will be more efficient on many machine/compiler combinations 2) it expresses the programmer's intent more clearly -- Tim Olson Advanced Micro Devices (tim@amd.com)
henry@utzoo.uucp (Henry Spencer) (12/31/89)
In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes: >In several books I've seen that assignment of structures is usually >more efficient than using memcpy(), at leant on most modern >processors... >In _general_ what is the rule for the assignment of two large >structures? memcpy vs. assignment? Which is generally better? With a good and fully modern compiler, there is no reason why there should be any difference in efficiency. It's the same operation, a block copy. An ANSI C implementation is entitled to recognize memcpy() and produce inline code, although it might have to invest significant effort to be sure that certain helpful constraints which are implicit in assignment are being observed by the memcpy. With poor or old compilers, it's an open question. Many such compilers will not inline memcpy(), and the function-call overhead will hurt. On the other hand, many such compilers will generate simple rather than optimal copy code for the assignment, while the memcpy() may be well optimized once you get past the startup overhead. (In particular, there is a naive belief that hardware provisions for fast copy -- string/block instructions, "loop mode", etc. -- are always the fastest way to do such operations, which is often untrue. The clever tricks that can get you a factor of 2 or more over hardware instructions are more often found in library routines, because modifying compilers is harder and few benchmarks use struct assignment much.) (Lest anyone think I'm kidding about the factor of 2, I've got an experimental memchr() which, on long strings, beats every manufacturer's implementation we've tested by a factor of at least 2 and usually 3-4... despite being written in portable C rather than assembler.) Unless efficiency is crucial, in which case you're probably tuning to match the characteristics of a specific compiler anyway, you should use the form which expresses your intent better and communicates it more clearly to the compiler. I.e., if you want to assign a structure, use assignment. -- 1972: Saturn V #15 flight-ready| Henry Spencer at U of Toronto Zoology 1989: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (01/03/90)
In article <1989Dec31.005904.1910@utzoo.uucp> I wrote: >... (Lest anyone think I'm kidding >about the factor of 2, I've got an experimental memchr() which, on long >strings, beats every manufacturer's implementation we've tested by a >factor of at least 2 and usually 3-4... despite being written in portable >C rather than assembler.) Several people have written asking about memchr. It, along with similar speedups for a lot of the other string functions, will be in the second release of my freely-redistributable string library. The release date is somewhat uncertain, although "sometime in spring" would be a safe guess. I need to get the C News to-do list under control before I can spare much time for strings. The stuff currently isn't in particularly good shape for distribution. It works, but the source is a mess, and I'm still experimenting with further optimizations. There is nothing proprietary about it, but I'd prefer not to release it widely until it's cleaned up. The crucial trick, incidentally, is that one does the search a word at a time rather than a byte at a time. -- 1972: Saturn V #15 flight-ready| Henry Spencer at U of Toronto Zoology 1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
ruediger@ramz.UUCP (Ruediger Helsch) (01/04/90)
In article <1989Dec31.005904.1910@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >On the other hand, many such compilers will generate simple rather than >optimal copy code for the assignment, while the memcpy() may be well >optimized once you get past the startup overhead. (In particular, there >is a naive belief that hardware provisions for fast copy -- string/block >instructions, "loop mode", etc. -- are always the fastest way to do such >operations, which is often untrue. On the other side, you are not sure wether your computers memcpy() uses the fast hardware instructions. Here follows the Ultrix (3.0) implementation of memcpy(): /* * Copy s2 to s1, always copy n bytes. * Return s1 */ char * memcpy(s1, s2, n) register char *s1, *s2; register int n; { register char *os1 = s1; while (--n >= 0) *s1++ = *s2++; return (os1); } VAXen do have fast copy commands, but even copying wordwise would surely be faster than byte after byte. P.S.: I hope i didn't break any copyrights!
scott@bbxsda.UUCP (Scott Amspoker) (01/04/90)
I recall a '286 C compiler that generated a series of MOVS instructions to move structures. I figured that for small structures this was faster than the overhead required to call a memcpy() routine. Just for yuks one day I defined a structure with a 1000 element int array in it. The C compiler generated tons and tons of MOVS instructions (talk about un-rolling a loop :-). -- Scott Amspoker Basis International, Albuquerque, NM (505) 345-5232 unmvax.cs.unm.edu!bbx!bbxsda!scott