[comp.lang.c] memcpy versus assignment

chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) (12/27/89)

In several books I've seen that assignment of structures is usually
more efficient than using memcpy(), at leant on most modern
processors.  I did a few experiments to see if this is true...using
the following short program, I attempted to extract the machine
code produced on different machines.

struct bozo {
    int one;
    char two;
    long three;
} foo, bar;

main()
{
    foo = bar;
    (void)memcpy((char *)&foo, (char *)&bar, sizeof(struct bozo));
}

On an 8086 CPU, the compiler - MSC5.1 (yuck!) - produces the
following code for the assignment when full optimization is on:

; foo = bar
	lea	di, WORD PTR[bp-8]	; foo
	lea	si, WORD PTR[bp-16]	; bar
	push	ss
	pop	es
	movsw				; the four movesw statements are more
	movsw				; space/speed efficient than a 
	movsw				; mov cx,sizeof(foo)/2
	movsw				; rep movsw combination....

On a VAX using gcc, the following code is produced:

; foo = bar;
	subl3 $76,fp,sp
	movab -64(fp),r1
	movab -76(fp),r0
	movl $12,r2
	movblk

The VAX naturally produces the more efficient code, but I would
imagine the 8086 would do just as good of a job with larger
structures, so that a
	mov cx, sizeof(struct bozo)/2 
	rep movsw 
could be used under appropriate circumstances.

However, this is only have the question.  Does the assignment win
over memcpy?   On the 8086, the following code is produced:

; (void)memcpy((char *)&foo, (char *)&bar, sizeof(struct foo));
	lea	ax, WORD PTR[bp-16]	; foo
	mov	WORD PTR[bp-18], ax
	mov	cx, 8
	lea	di, WORD PTR[bp-8]	; foo
	lea	si, WORD PTR[bp-16]	; bar
	mov	ax, ss
	shr	cx, 1
	rep	movsw
	adc	cx, cx
	rep	movsb

The compiler is smart enough to make memcpy an intrinsic function,
so as to avoid a costly call statement.  On the vax, a call to
memcpy (or in this case bcopy(), which is the same thing) was
produced, so I wasn't able to analyze the code directly. However,
using gcc on bcopy.c produces the following code:

.globl _bcopy
_bcopy:
	.word 0x0
	movl 4(fp),r4
	movl 8(fp),r3
	movl 12(fp),r2
	tstl r2
	jeql L1
	cmpl r4,r3
	jeql L1
L2:
	decl r2
	tstl r2
	jneq L2
L4:
	movl r3,r0
	addl2 $4,r3
	movl r4,r1
	addl2 $4,r4
	movl (r1),(r0)
	decl r2
	tstl r2
	jneq L4
	ret

Which, seems like quite a bit compared to the assignment.  However,
in almost all C code I have seen written, comments always state
something along the lines of "/* use memcpy for structures larger
than int */" which seems to go against the results shown above.
In _general_ what is the rule for the assignment of two large
structures?  memcpy vs. assignment?  Which is generally better?

chris@mimsy.umd.edu (Chris Torek) (12/27/89)

In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes:
>On a VAX using gcc, the following code is produced:
>
>; foo = bar;
>	subl3 $76,fp,sp
>	movab -64(fp),r1
>	movab -76(fp),r0
>	movl $12,r2
>	movblk

Your `VAX' GCC is producing Tahoe instructions.  (The Tahoe movblk
instruction corresponds fairly well to movc3 on the VAX.)

As to the original question: if

	#define pointer_t char * /* or void * */
	struct foo src, dst;

	(void) memcpy((pointer_t)&dst, (pointer_t)&src, len);

is faster than

	dst = src;

you have a really stupid compiler, since the assignment could be
treated internally as a call to memcpy.  (In analysing an assignment
statement, compilers could replace the tree

	(assign (name dst) (name src))

with the tree

	(cast void
	      (call (name memcpy)
		    (cast pointer_t (addressof (name dst)))
		    (cast pointer_t (addressof (name src)))
		    (constant (sizeof (structtype foo))))

which is exactly what it would have built for the memcpy line above.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

tim@nucleus.amd.com (Tim Olson) (12/30/89)

In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes:
| In several books I've seen that assignment of structures is usually
| more efficient than using memcpy(), at leant on most modern
| processors.  I did a few experiments to see if this is true...using
| the following short program, I attempted to extract the machine
| code produced on different machines.

  [ code examples deleted ]

| However, this is only have the question.  Does the assignment win
| over memcpy?   On the 8086, the following code is produced:
| In _general_ what is the rule for the assignment of two large
| structures?  memcpy vs. assignment?  Which is generally better?

Assignment should *always* be better, or at least equal to a call to
memcpy, in terms of performance.  The compiler knows (at compile time)
all of the sizes and alignments of the structures being copied, so it
can choose the most efficient assignment method for a given processor.
For example, if the structures are aligned and padded to 4-byte
boundaries on a 32-bit processor, then 32-bit load/store instructions
can be used to copy the structure 4 bytes at a time.  A
general-purpose runtime routine such as memcpy must perform the copy a
byte at a time, or perform runtime checks on the size and alignments
of the memory areas being copied.

If the compiler recognizes and inlines library routines, then a call
to memcpy() may be as fast as structure assignment, but you are better
off using the assignment, because:

    1) it will be more efficient on many machine/compiler combinations

    2) it expresses the programmer's intent more clearly


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

henry@utzoo.uucp (Henry Spencer) (12/31/89)

In article <1657@uwm.edu> chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) writes:
>In several books I've seen that assignment of structures is usually
>more efficient than using memcpy(), at leant on most modern
>processors...
>In _general_ what is the rule for the assignment of two large
>structures?  memcpy vs. assignment?  Which is generally better?

With a good and fully modern compiler, there is no reason why there should
be any difference in efficiency.  It's the same operation, a block copy.
An ANSI C implementation is entitled to recognize memcpy() and produce 
inline code, although it might have to invest significant effort to be sure
that certain helpful constraints which are implicit in assignment are being
observed by the memcpy.

With poor or old compilers, it's an open question.  Many such compilers
will not inline memcpy(), and the function-call overhead will hurt.  On
the other hand, many such compilers will generate simple rather than
optimal copy code for the assignment, while the memcpy() may be well
optimized once you get past the startup overhead.  (In particular, there
is a naive belief that hardware provisions for fast copy -- string/block
instructions, "loop mode", etc. -- are always the fastest way to do such
operations, which is often untrue.  The clever tricks that can get you
a factor of 2 or more over hardware instructions are more often found
in library routines, because modifying compilers is harder and few
benchmarks use struct assignment much.)  (Lest anyone think I'm kidding
about the factor of 2, I've got an experimental memchr() which, on long
strings, beats every manufacturer's implementation we've tested by a
factor of at least 2 and usually 3-4... despite being written in portable
C rather than assembler.)

Unless efficiency is crucial, in which case you're probably tuning to match
the characteristics of a specific compiler anyway, you should use the form
which expresses your intent better and communicates it more clearly to the
compiler.  I.e., if you want to assign a structure, use assignment.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1989: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

henry@utzoo.uucp (Henry Spencer) (01/03/90)

In article <1989Dec31.005904.1910@utzoo.uucp> I wrote:
>... (Lest anyone think I'm kidding
>about the factor of 2, I've got an experimental memchr() which, on long
>strings, beats every manufacturer's implementation we've tested by a
>factor of at least 2 and usually 3-4... despite being written in portable
>C rather than assembler.)

Several people have written asking about memchr.  It, along with similar
speedups for a lot of the other string functions, will be in the second
release of my freely-redistributable string library.  The release date is
somewhat uncertain, although "sometime in spring" would be a safe guess.
I need to get the C News to-do list under control before I can spare much
time for strings.

The stuff currently isn't in particularly good shape for distribution.
It works, but the source is a mess, and I'm still experimenting with
further optimizations.  There is nothing proprietary about it, but I'd
prefer not to release it widely until it's cleaned up.

The crucial trick, incidentally, is that one does the search a word at a
time rather than a byte at a time.
-- 
1972: Saturn V #15 flight-ready|     Henry Spencer at U of Toronto Zoology
1990: birds nesting in engines | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

ruediger@ramz.UUCP (Ruediger Helsch) (01/04/90)

In article <1989Dec31.005904.1910@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>On the other hand, many such compilers will generate simple rather than
>optimal copy code for the assignment, while the memcpy() may be well
>optimized once you get past the startup overhead.  (In particular, there
>is a naive belief that hardware provisions for fast copy -- string/block
>instructions, "loop mode", etc. -- are always the fastest way to do such
>operations, which is often untrue. 

On the other side, you are not sure wether your computers memcpy() uses
the fast hardware instructions. Here follows the Ultrix (3.0) implementation
of memcpy(): 

/*
 * Copy s2 to s1, always copy n bytes.
 * Return s1
 */
char *
memcpy(s1, s2, n)
register char *s1, *s2;
register int n;
{
	register char *os1 = s1;

	while (--n >= 0)
		*s1++ = *s2++;
	return (os1);
}

VAXen do have fast copy commands, but even copying wordwise would surely 
be faster than byte after byte. 

P.S.: I hope i didn't break any copyrights!

scott@bbxsda.UUCP (Scott Amspoker) (01/04/90)

I recall a '286 C compiler that generated a series of MOVS instructions
to move structures.  I figured that for small structures this was
faster than the overhead required to call a memcpy() routine.  Just
for yuks one day I defined a structure with a 1000 element int array
in it.  The C compiler generated tons and tons of MOVS instructions (talk
about un-rolling a loop :-).

-- 
Scott Amspoker
Basis International, Albuquerque, NM
(505) 345-5232
unmvax.cs.unm.edu!bbx!bbxsda!scott