[comp.lang.c] faster bcopy using duffs device

stergios@Jessica.stanford.edu (stergios marinopoulos) (09/08/89)

I wanted a faster bcopy, so I used duffs device as a basis for it.  In
addition, it copies ints at a time instead of chars, and the loop is
unrolled  a little too.  Its been working well for me today, so it has
to be perfect right?

I have been seeing 4X speed ups, so I thought I would pass it along.

A potential problem is the char*'s not being alligned, but I have not
run into it.  Also, this probably will not copy strings smaller than
32 bytes (no problem for me, I wanted to copy megs-o-stuff.)

Let me know what you think.  Of the code or anything else for that
matter.

sm

**********************************************************************


#define IFACTOR 4

dcopy(chardest, charsrc, size)
	char *chardest, *charsrc ;
	int size ;
{
	register int *src, *dest, intcount ;
	int startcharcpy, intoffset, numints2cpy, i ;

	numints2cpy = size >> 2 ;
	startcharcpy = numints2cpy << 2 ;
	intcount = numints2cpy & ~(IFACTOR-1) ;
	intoffset = numints2cpy - intcount ;

	src = (int *)(((int) charsrc) + intcount*sizeof(int*)) ;
	dest = (int *)(((int) chardest) + intcount*sizeof(int*)) ;

	/* copy the ints */
	switch(intoffset)
		do {
		case 0: dest[3] = src[3] ;
		case 3: dest[2] = src[2] ;
		case 2: dest[1] = src[1] ;
		case 1: dest[0] = src[0] ;
			intcount -= IFACTOR ;
			dest -= IFACTOR ;
			src -= IFACTOR ;
		} while (intcount >= 0) ;

	/* copy the chars left over by the int copy at the end */
	for(i=startcharcpy ; i<size ; i++)
		chardest[i] = charsrc[i] ;
}

chris@mimsy.UUCP (Chris Torek) (09/08/89)

In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu
(stergios marinopoulos) writes:
>I wanted a faster bcopy, so I used duffs device as a basis for it.

bcopy() should be written in assembly (on most processors), put in
a library, and forgotten about, because---for instance---a dbra loop
beats a Duff loop on a 68010, every time.  (And on a 68000, a loop
using movml is best.  68020s have an I-cache, so a hand-coded `Duffish'
loop is a good bet.  Some VAXen have a special instruction which does
a good job.  [movc3 is done in software on the 610.]  `rep movsb' [or
is there a `movsw'?] is best on an 80x86.  LDIR is best on a Z80.  A
Duff-style loop is probably best on a PDP-11.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

mike@thor.acc.stolaf.edu (Mike Haertel) (09/09/89)

In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>bcopy() should be written in assembly (on most processors), put in
>a library, and forgotten about, because---for instance---a dbra loop
>beats a Duff loop on a 68010, every time.  (And on a 68000, a loop
>using movml is best.  68020s have an I-cache, so a hand-coded `Duffish'
>loop is a good bet.  Some VAXen have a special instruction which does
>a good job. [ . . . ]

I just tried the obvious bcopy "while (n--) *s++ = *d++;"
on a 68010 using gcc.  It produced a dbra loop that beat
the sh*t out of the supposedly carefully handcoded one
in the C library.  (Which is a Duffish sort of thing that
tries to copy fullwords at a time.  Not Duff's device,
but structurally similar.)

If you have a halfway decent compiler, I bet a lot of
the string routines will compile to excellent code using
just the obvious C implementations.
-- 
Mike Haertel <mike@stolaf.edu>
``There's nothing remarkable about it.  All one has to do is hit the right
  keys at the right time and the instrument plays itself.'' -- J. S. Bach

poffen@lookitthat (Russ Poffenberger) (09/09/89)

In article <5180@portia.Stanford.EDU> stergios@Jessica.Stanford.EDU (stergios marinopoulos) writes:
>I wanted a faster bcopy, so I used duffs device as a basis for it.  In
>addition, it copies ints at a time instead of chars, and the loop is
>unrolled  a little too.  Its been working well for me today, so it has
>to be perfect right?
>
>A potential problem is the char*'s not being alligned, but I have not
>run into it.  Also, this probably will not copy strings smaller than

This will definitely NOT work on some risc machines (namely SPARC for one)
that have strict alignment requirements.

Russ

karl@haddock.ima.isc.com (Karl Heuer) (09/09/89)

In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>bcopy() should be written in assembly (on most processors), put in
>a library, and forgotten about...

Yes.  (And it should be called memcpy(), that being the standard name; of
course bcopy() can be an alternate synonym if it's done properly.)  But if you
really want to blow the socks off the benchmarks, you should have the compiler
recognize memcpy() as a builtin.  This not only allows it to be inlined, but
also makes possible certain compile-time optimizations such as constant length
move (for which unrolling might be a win) and known-alignment pointers (for
which it could safely copy larger-sized chunks).

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

bvs@light.uucp (Bakul Shah) (09/09/89)

In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu
>(stergios marinopoulos) writes:
>>I wanted a faster bcopy, so I used duffs device as a basis for it.
>
>bcopy() should be written in assembly (on most processors), put in
>a library, and forgotten about, because---for instance---a dbra loop
>beats a Duff loop on a 68010, every time.

A couple more points.

Even on a single processor different trade-offs exist for different
amount of copying (e.g. use of movems on a 68000 for large copies) or
different alignments (e.g. word copies when src,dst are word aligned,
something else when they are not).  Vendors providing stdlib should mess
with such details.

It is preferable to use standard functions whenever possible (memcpy
instead of bcopy), since ANSI compilers can optimize them much better.
For instance, on a particular machine a compiler may choose to inline
something like

	memcpy(void * dst, void * src, unsigned count)
	{
		/* copy right here for small counts */
		if (count < BREAKEVENCOUNT) {
			char * d = (char *)dst;
			char * s = (char *)src;
			unsigned c = count;
			while (c-- != 0)
				*d++ = *s++;
			return dst;
		}
		/* call a function depending on relative alignment */
		return
			(((unsigned)dst&3) == ((unsigned)src&3) ?
			 __alignedcpy : __unalignedcpy
			) (dst, src, count);
	}

Ain't that Disgusting!  Even more can be done if the count happens
to be a constant -- though this case can only be handled in a compiler
(as there is no preprocessor equiv. of #if defined(xxx) for detecting
constants).  Inlining is especially useful when small amounts have to
be copied.  Anyway, it is best to hide this in a compiler or stdlib.h.

To give you a datapoint, on a AMD29000 such tricks cut time down from
about 10 cycles/byte in C code to under 0.7 cycles/byte for aligned src,
dst and 0.9 cycle/byte for unaligned src, dst (for copying about 100
bytes).  For very large copies it is possible to approach 29k's limit of
0.5 cycles/bytes within 5% -- assuming data memory can stream.

-- Bakul Shah <..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs>

chris@mimsy.UUCP (Chris Torek) (09/09/89)

>In article <19473@mimsy.UUCP> I suggest that
>>bcopy() should be written in assembly (on most processors), put in
>>a library, and forgotten about ....

In article <5603@thor.acc.stolaf.edu> mike@thor.acc.stolaf.edu (Mike Haertel)
writes:
>I just tried the obvious bcopy "while (n--) *s++ = *d++;"
>on a 68010 using gcc.  It produced a dbra loop that beat
>the sh*t out of the supposedly carefully handcoded one
>in the C library.  (Which is a Duffish sort of thing that
>tries to copy fullwords at a time.  Not Duff's device,
>but structurally similar.)

Alas, gcc is rather the exception, although this is changing (more
vendors are discovering gcc!), and it is disappointing how many
supposedly carefully handcoded simple-but-time-consuming routines
(bcopy, bzero, bcmp; or on SysV, memmove, memset, memcmp) were actually
thrown together by someone who had no idea what was optimal.

Just for fun, here is my bcopy-in-C for a 68010, assuming you have a
compiler that is as smart as gcc.  (In fact, it has been tested with
gcc, and, at least with the gcc on our Suns, compilers very nicely,
aside from one #ifdef.  Curiously, the gcc-specific hack is necessary
only for one of the two loops.)

/*
 * Copyright (c) 1989 by the University of Maryland Computer Science
 * Department.  All rights reserved.  Permission to copy for any
 * purpose is hereby granted so long as the copyright notice remains
 * intact.
 */

#ifdef test
typedef unsigned int size_t;
#else
#include <stddef.h>
#endif

/*
 * Block copy from src to dst, len bytes.
 * Allows misaligned copies (with attendant performance penalty)
 * and allows overlapping regions.
 *
 * We make the asumption that <read,write,read,write> cycles are
 * as fast as <read,read,write,write> cycles so that we can move
 * shortwords at a time.  If this is not true (e.g., if writes are
 * done `instantly' through a write buffer), this is suboptimal,
 * and the loop should be changed to move longwords at a time.
 */
void bcopy(register char *src, register char *dst, register size_t len) {
	register size_t l2;

	if (len == 0)
		return;
	/* check for potential overlap */
	if ((unsigned long)src + len > (unsigned long)dst) {
		/* copy backwards */
		src += len;
		dst += len;
		/* get operands word-aligned */
		if ((long)src & 1)
			*--dst = *--src;
		if ((long)dst & 1) {
			/* cannot word-align both operands */
			while (--len != (size_t)-1)
				*--dst = *--src;
			return;
		}
		/* copy words until no words left */
		l2 = len >> 1;
		while (--l2 != (size_t)-1) {
#ifndef __GNUC__
			/* for some reason, gcc does not optimise this */
			*(short *)dst = *(short *)src;
			dst -= sizeof (short);
			src -= sizeof (short);
#else
			/* so we use a gcc extension */
			*--(short *)dst = *--(short *)src;
#endif
		}
		/* copy last byte, if any */
		if (len & 1)
			*--dst = *--src;
	} else {
		/* copy forwards, otherwise identical */
		if ((long)src & 1)
			*dst++ = *src++;
		if ((long)dst & 1) {
			while (--len != (size_t)-1)
				*dst++ = *src++;
			return;
		}
		l2 = len >> 1;
		while (--l2 != (size_t)-1) {
			*(short *)dst = *(short *)src;
			dst += sizeof (short);
			src += sizeof (short);
		}
		if (len & 1)
			*dst = *src;
	}
}
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

scs@hstbme.mit.edu (Steve Summit) (09/09/89)

In article <14558@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>bcopy() should be written in assembly (on most processors), put in
>>a library, and forgotten about...
>
>Yes.  (And it should be called memcpy(), that being the standard name;

Actually, when ANSIifying code containing calls to bcopy, memmove
would be a safer replacement, unless or until you inspect the
code to see if the usage needs an overlap guarantee.  Under 4bsd,
at least (some versions, anyway), bcopy was guaranteed to work
correctly if the source and destination overlapped.  memmove has
this guarantee; memcpy does not.  (It seems it would have been
nice to add this guarantee to memcpy rather than cluttering the
library with another call...)

                                           Steve Summit

gwyn@smoke.BRL.MIL (Doug Gwyn) (09/09/89)

In article <14174@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:
>(It seems it would have been nice to add this guarantee to memcpy
>rather than cluttering the library with another call...)

There are computers for which memcpy() can be compiled into very efficient
in-line code, while memmove() behavior required considerably more code.
Since there were valid arguments both ways, the committee decided to let
the programmer specify the preferred behavior instead of forcing one of
them on him.

henry@utzoo.uucp (Henry Spencer) (09/10/89)

In article <5603@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes:
>I just tried the obvious bcopy "while (n--) *s++ = *d++;"
>on a 68010 using gcc.  It produced a dbra loop that beat
>the sh*t out of the supposedly carefully handcoded one
>in the C library.  (Which is a Duffish sort of thing...

The odds are pretty good that the library one was built for a 68020 or 68000.
The various 68XXXs are more or less the same architecture, but performance
tradeoffs are *very* different.  The simple tight loop with autoincrement
and decrement usually wins on the 010 because of the "loop mode" feature
of the hardware.  On a 68000 or 68020, not so.

>If you have a halfway decent compiler, I bet a lot of
>the string routines will compile to excellent code using
>just the obvious C implementations.

This depends *a lot* on exactly what hardware you've got.  And cleverness
is a win regardless, if it's the right cleverness.  Copying a word at a
time rather than a byte at a time, size and alignment permitting, in a
"while (n--) *s++ = *d++;" loop will be a big win even on a 68010.
-- 
V7 /bin/mail source: 554 lines.|     Henry Spencer at U of Toronto Zoology
1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

bright@Data-IO.COM (Walter Bright) (09/12/89)

In article <5603@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes:
<In article <19473@mimsy.UUCP< chris@mimsy.UUCP (Chris Torek) writes:
<<bcopy() should be written in assembly (on most processors), put in
<<a library, and forgotten about
<I just tried the obvious bcopy "while (n--) *s++ = *d++;"
<on a 68010 using gcc.  It produced a dbra loop that beat
<the sh*t out of the supposedly carefully handcoded one
<in the C library.

Why not contribute your improved version to GNU? That way, *users* of GNU
C will get the advantage of your faster code, instead of just netters.
If the compiler library benefits from successive waves of programmers
improving on it, eventually it will get to the point where it can be
just used with the confidence that it's the best it can be.

tkacik@rphroy.UUCP (Tom Tkacik) (09/12/89)

In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>
>Alas, gcc is rather the exception, although this is changing (more
>vendors are discovering gcc!), and it is disappointing how many ...
>
>Just for fun, here is my bcopy-in-C for a 68010, assuming you have a
>compiler that is as smart as gcc.  (In fact, it has been tested with
>gcc, and, at least with the gcc on our Suns, compilers very nicely,
>aside from one #ifdef.  Curiously, the gcc-specific hack is necessary
>only for one of the two loops.)
>
>#ifndef __GNUC__
>			/* for some reason, gcc does not optimise this */
>			*(short *)dst = *(short *)src;
>			dst -= sizeof (short);
>			src -= sizeof (short);
>#else
>			/* so we use a gcc extension */
>			*--(short *)dst = *--(short *)src;
>#endif

Do these two program fragments really do the same thing?
Does not the non-GCC code decrement dst and src after doing the move,
while the GCC code decrements before doing the move.

Could this be why gcc does not optimize the former.
The 68010 does not have a post autoincrement addressing mode.

I think that the gcc code should be
			*(short *)dst-- = *(short *)src--;

which is still not optimizable (at least it should not generate the
code that Chris would like it to :-)).

Or is that gcc-extension doing something I do not see?
What am I missing?  Perhaps a RTFM is in order!
	
I just did that and do not see what extenstion Chris is referring to.
Chris, which is it?
~

-- 
---
Tom Tkacik		GM Research Labs,   Warren MI  48090
uunet!edsews!rphroy!megatron!tkacik
"If you can't stand the bugs, stay out of the roach-motel."  Ron Guilmette

chris@mimsy.UUCP (Chris Torek) (09/12/89)

>In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>>Just for fun, here is my bcopy-in-C for (gcc+68010)
>>
>>#ifndef __GNUC__
>>			/* for some reason, gcc does not optimise this */
>>			*(short *)dst = *(short *)src;
>>			dst -= sizeof (short);
>>			src -= sizeof (short);
>>#else
>>			/* so we use a gcc extension */
>>			*--(short *)dst = *--(short *)src;
>>#endif

In article <16902@rphroy.UUCP> tkacik@rphroy.UUCP (Tom Tkacik) writes:
>Do these two program fragments really do the same thing?

OOPS!  I did not need the gcc-specific grungy hack after all.  The
first three (#ifndef __GNUC__) lines *should* have read:

	dst -= sizeof (short);
	src -= sizeof (short);
	*(short *)dst = *(short *)src;

which gcc is perfectly willing to optimise.

>I think that the gcc code should be
>			*(short *)dst-- = *(short *)src--;

No, this is legal C, and means to:

	evaluate dst, evaluate src;
	sometime in the near future, assign dst-1 to dst and
		src-1 to src;
	fetch a short from the location named by the
		conversion of the original value of the
		character pointer `src' to a pointer-to-short
		(a machine-dependent operation);
	store that short to the location named by the
		conversion of ... `dst' (another machine-dependent
		operation);
	<sequence point>: finish side effects on dst and src.

The gcc-specific hack is not legal C; it means to

	pretend dst and src are pointers to short, rather than
		pointers to char (this operation is not only
		machine-dependent, it is hopeless on some machines,
		e.g., those where `char *' is 48 bits long but
		`short *' is 32 bits long);
	assign dst-sizeof(short) to dst and src-sizeof(short)
		to src, sometime in the near future;
	fetch a short from the location named by the result of
		the decrement of the (pretended word pointer) src;
	store that short in the location named by the result ... dst;
	<sequence point>: finish side effects on dst and src.

>I just did that and do not see what extenstion Chris is referring to.
>Chris, which is it?

The one that says that a cast on a pointer is a type pun, rather than
a conversion.  This, if nothing else, tells why gcc cannot be used to
compile C code for a Prime (32 bit word pointers, 48 bit char pointers).
(In fact, gcc assumes internally that sizeof(any-pointer) == sizeof(int)
and that all pointers are `byte pointers', as far as I can tell.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

BEAR@S34.Prime.COM (09/12/89)

>In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>                 This, if nothing else, tells why gcc cannot be used to
>compile C code for a Prime (32 bit word pointers, 48 bit char pointers).
>(In fact, gcc assumes internally that sizeof(any-pointer) == sizeof(int)
>and that all pointers are `byte pointers', as far as I can tell.)

FYI, this is not necessarily true. Without going into lots of details,
our 32IX mode C compiler uses 32 bit pointers for *everything*. This
compiler has been available for about 6 years and can be used on any
of our current 50-series processors. Older machines (shipped before
1983, non-IX mode) do indeed require 48 bit pointers to address anything
smaller than a halfword (the 48 bit pointers are actually bit pointers!).

Bob Beckwith
Prime Computer, Inc.
(508)879-2960 x 4209
bear@s34.prime.com