stergios@Jessica.stanford.edu (stergios marinopoulos) (09/08/89)
I wanted a faster bcopy, so I used duffs device as a basis for it. In addition, it copies ints at a time instead of chars, and the loop is unrolled a little too. Its been working well for me today, so it has to be perfect right? I have been seeing 4X speed ups, so I thought I would pass it along. A potential problem is the char*'s not being alligned, but I have not run into it. Also, this probably will not copy strings smaller than 32 bytes (no problem for me, I wanted to copy megs-o-stuff.) Let me know what you think. Of the code or anything else for that matter. sm ********************************************************************** #define IFACTOR 4 dcopy(chardest, charsrc, size) char *chardest, *charsrc ; int size ; { register int *src, *dest, intcount ; int startcharcpy, intoffset, numints2cpy, i ; numints2cpy = size >> 2 ; startcharcpy = numints2cpy << 2 ; intcount = numints2cpy & ~(IFACTOR-1) ; intoffset = numints2cpy - intcount ; src = (int *)(((int) charsrc) + intcount*sizeof(int*)) ; dest = (int *)(((int) chardest) + intcount*sizeof(int*)) ; /* copy the ints */ switch(intoffset) do { case 0: dest[3] = src[3] ; case 3: dest[2] = src[2] ; case 2: dest[1] = src[1] ; case 1: dest[0] = src[0] ; intcount -= IFACTOR ; dest -= IFACTOR ; src -= IFACTOR ; } while (intcount >= 0) ; /* copy the chars left over by the int copy at the end */ for(i=startcharcpy ; i<size ; i++) chardest[i] = charsrc[i] ; }
chris@mimsy.UUCP (Chris Torek) (09/08/89)
In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu (stergios marinopoulos) writes: >I wanted a faster bcopy, so I used duffs device as a basis for it. bcopy() should be written in assembly (on most processors), put in a library, and forgotten about, because---for instance---a dbra loop beats a Duff loop on a 68010, every time. (And on a 68000, a loop using movml is best. 68020s have an I-cache, so a hand-coded `Duffish' loop is a good bet. Some VAXen have a special instruction which does a good job. [movc3 is done in software on the 610.] `rep movsb' [or is there a `movsw'?] is best on an 80x86. LDIR is best on a Z80. A Duff-style loop is probably best on a PDP-11.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
mike@thor.acc.stolaf.edu (Mike Haertel) (09/09/89)
In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >bcopy() should be written in assembly (on most processors), put in >a library, and forgotten about, because---for instance---a dbra loop >beats a Duff loop on a 68010, every time. (And on a 68000, a loop >using movml is best. 68020s have an I-cache, so a hand-coded `Duffish' >loop is a good bet. Some VAXen have a special instruction which does >a good job. [ . . . ] I just tried the obvious bcopy "while (n--) *s++ = *d++;" on a 68010 using gcc. It produced a dbra loop that beat the sh*t out of the supposedly carefully handcoded one in the C library. (Which is a Duffish sort of thing that tries to copy fullwords at a time. Not Duff's device, but structurally similar.) If you have a halfway decent compiler, I bet a lot of the string routines will compile to excellent code using just the obvious C implementations. -- Mike Haertel <mike@stolaf.edu> ``There's nothing remarkable about it. All one has to do is hit the right keys at the right time and the instrument plays itself.'' -- J. S. Bach
poffen@lookitthat (Russ Poffenberger) (09/09/89)
In article <5180@portia.Stanford.EDU> stergios@Jessica.Stanford.EDU (stergios marinopoulos) writes: >I wanted a faster bcopy, so I used duffs device as a basis for it. In >addition, it copies ints at a time instead of chars, and the loop is >unrolled a little too. Its been working well for me today, so it has >to be perfect right? > >A potential problem is the char*'s not being alligned, but I have not >run into it. Also, this probably will not copy strings smaller than This will definitely NOT work on some risc machines (namely SPARC for one) that have strict alignment requirements. Russ
karl@haddock.ima.isc.com (Karl Heuer) (09/09/89)
In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >bcopy() should be written in assembly (on most processors), put in >a library, and forgotten about... Yes. (And it should be called memcpy(), that being the standard name; of course bcopy() can be an alternate synonym if it's done properly.) But if you really want to blow the socks off the benchmarks, you should have the compiler recognize memcpy() as a builtin. This not only allows it to be inlined, but also makes possible certain compile-time optimizations such as constant length move (for which unrolling might be a win) and known-alignment pointers (for which it could safely copy larger-sized chunks). Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
bvs@light.uucp (Bakul Shah) (09/09/89)
In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >In article <5180@portia.Stanford.EDU> stergios@Jessica.stanford.edu >(stergios marinopoulos) writes: >>I wanted a faster bcopy, so I used duffs device as a basis for it. > >bcopy() should be written in assembly (on most processors), put in >a library, and forgotten about, because---for instance---a dbra loop >beats a Duff loop on a 68010, every time. A couple more points. Even on a single processor different trade-offs exist for different amount of copying (e.g. use of movems on a 68000 for large copies) or different alignments (e.g. word copies when src,dst are word aligned, something else when they are not). Vendors providing stdlib should mess with such details. It is preferable to use standard functions whenever possible (memcpy instead of bcopy), since ANSI compilers can optimize them much better. For instance, on a particular machine a compiler may choose to inline something like memcpy(void * dst, void * src, unsigned count) { /* copy right here for small counts */ if (count < BREAKEVENCOUNT) { char * d = (char *)dst; char * s = (char *)src; unsigned c = count; while (c-- != 0) *d++ = *s++; return dst; } /* call a function depending on relative alignment */ return (((unsigned)dst&3) == ((unsigned)src&3) ? __alignedcpy : __unalignedcpy ) (dst, src, count); } Ain't that Disgusting! Even more can be done if the count happens to be a constant -- though this case can only be handled in a compiler (as there is no preprocessor equiv. of #if defined(xxx) for detecting constants). Inlining is especially useful when small amounts have to be copied. Anyway, it is best to hide this in a compiler or stdlib.h. To give you a datapoint, on a AMD29000 such tricks cut time down from about 10 cycles/byte in C code to under 0.7 cycles/byte for aligned src, dst and 0.9 cycle/byte for unaligned src, dst (for copying about 100 bytes). For very large copies it is possible to approach 29k's limit of 0.5 cycles/bytes within 5% -- assuming data memory can stream. -- Bakul Shah <..!{ames,sun,ucbvax,uunet}!amdcad!light!bvs>
chris@mimsy.UUCP (Chris Torek) (09/09/89)
>In article <19473@mimsy.UUCP> I suggest that >>bcopy() should be written in assembly (on most processors), put in >>a library, and forgotten about .... In article <5603@thor.acc.stolaf.edu> mike@thor.acc.stolaf.edu (Mike Haertel) writes: >I just tried the obvious bcopy "while (n--) *s++ = *d++;" >on a 68010 using gcc. It produced a dbra loop that beat >the sh*t out of the supposedly carefully handcoded one >in the C library. (Which is a Duffish sort of thing that >tries to copy fullwords at a time. Not Duff's device, >but structurally similar.) Alas, gcc is rather the exception, although this is changing (more vendors are discovering gcc!), and it is disappointing how many supposedly carefully handcoded simple-but-time-consuming routines (bcopy, bzero, bcmp; or on SysV, memmove, memset, memcmp) were actually thrown together by someone who had no idea what was optimal. Just for fun, here is my bcopy-in-C for a 68010, assuming you have a compiler that is as smart as gcc. (In fact, it has been tested with gcc, and, at least with the gcc on our Suns, compilers very nicely, aside from one #ifdef. Curiously, the gcc-specific hack is necessary only for one of the two loops.) /* * Copyright (c) 1989 by the University of Maryland Computer Science * Department. All rights reserved. Permission to copy for any * purpose is hereby granted so long as the copyright notice remains * intact. */ #ifdef test typedef unsigned int size_t; #else #include <stddef.h> #endif /* * Block copy from src to dst, len bytes. * Allows misaligned copies (with attendant performance penalty) * and allows overlapping regions. * * We make the asumption that <read,write,read,write> cycles are * as fast as <read,read,write,write> cycles so that we can move * shortwords at a time. If this is not true (e.g., if writes are * done `instantly' through a write buffer), this is suboptimal, * and the loop should be changed to move longwords at a time. */ void bcopy(register char *src, register char *dst, register size_t len) { register size_t l2; if (len == 0) return; /* check for potential overlap */ if ((unsigned long)src + len > (unsigned long)dst) { /* copy backwards */ src += len; dst += len; /* get operands word-aligned */ if ((long)src & 1) *--dst = *--src; if ((long)dst & 1) { /* cannot word-align both operands */ while (--len != (size_t)-1) *--dst = *--src; return; } /* copy words until no words left */ l2 = len >> 1; while (--l2 != (size_t)-1) { #ifndef __GNUC__ /* for some reason, gcc does not optimise this */ *(short *)dst = *(short *)src; dst -= sizeof (short); src -= sizeof (short); #else /* so we use a gcc extension */ *--(short *)dst = *--(short *)src; #endif } /* copy last byte, if any */ if (len & 1) *--dst = *--src; } else { /* copy forwards, otherwise identical */ if ((long)src & 1) *dst++ = *src++; if ((long)dst & 1) { while (--len != (size_t)-1) *dst++ = *src++; return; } l2 = len >> 1; while (--l2 != (size_t)-1) { *(short *)dst = *(short *)src; dst += sizeof (short); src += sizeof (short); } if (len & 1) *dst = *src; } } -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
scs@hstbme.mit.edu (Steve Summit) (09/09/89)
In article <14558@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >In article <19473@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >>bcopy() should be written in assembly (on most processors), put in >>a library, and forgotten about... > >Yes. (And it should be called memcpy(), that being the standard name; Actually, when ANSIifying code containing calls to bcopy, memmove would be a safer replacement, unless or until you inspect the code to see if the usage needs an overlap guarantee. Under 4bsd, at least (some versions, anyway), bcopy was guaranteed to work correctly if the source and destination overlapped. memmove has this guarantee; memcpy does not. (It seems it would have been nice to add this guarantee to memcpy rather than cluttering the library with another call...) Steve Summit
gwyn@smoke.BRL.MIL (Doug Gwyn) (09/09/89)
In article <14174@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes: >(It seems it would have been nice to add this guarantee to memcpy >rather than cluttering the library with another call...) There are computers for which memcpy() can be compiled into very efficient in-line code, while memmove() behavior required considerably more code. Since there were valid arguments both ways, the committee decided to let the programmer specify the preferred behavior instead of forcing one of them on him.
henry@utzoo.uucp (Henry Spencer) (09/10/89)
In article <5603@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes: >I just tried the obvious bcopy "while (n--) *s++ = *d++;" >on a 68010 using gcc. It produced a dbra loop that beat >the sh*t out of the supposedly carefully handcoded one >in the C library. (Which is a Duffish sort of thing... The odds are pretty good that the library one was built for a 68020 or 68000. The various 68XXXs are more or less the same architecture, but performance tradeoffs are *very* different. The simple tight loop with autoincrement and decrement usually wins on the 010 because of the "loop mode" feature of the hardware. On a 68000 or 68020, not so. >If you have a halfway decent compiler, I bet a lot of >the string routines will compile to excellent code using >just the obvious C implementations. This depends *a lot* on exactly what hardware you've got. And cleverness is a win regardless, if it's the right cleverness. Copying a word at a time rather than a byte at a time, size and alignment permitting, in a "while (n--) *s++ = *d++;" loop will be a big win even on a 68010. -- V7 /bin/mail source: 554 lines.| Henry Spencer at U of Toronto Zoology 1989 X.400 specs: 2200+ pages. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bright@Data-IO.COM (Walter Bright) (09/12/89)
In article <5603@thor.acc.stolaf.edu> mike@thor.stolaf.edu () writes: <In article <19473@mimsy.UUCP< chris@mimsy.UUCP (Chris Torek) writes: <<bcopy() should be written in assembly (on most processors), put in <<a library, and forgotten about <I just tried the obvious bcopy "while (n--) *s++ = *d++;" <on a 68010 using gcc. It produced a dbra loop that beat <the sh*t out of the supposedly carefully handcoded one <in the C library. Why not contribute your improved version to GNU? That way, *users* of GNU C will get the advantage of your faster code, instead of just netters. If the compiler library benefits from successive waves of programmers improving on it, eventually it will get to the point where it can be just used with the confidence that it's the best it can be.
tkacik@rphroy.UUCP (Tom Tkacik) (09/12/89)
In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > >Alas, gcc is rather the exception, although this is changing (more >vendors are discovering gcc!), and it is disappointing how many ... > >Just for fun, here is my bcopy-in-C for a 68010, assuming you have a >compiler that is as smart as gcc. (In fact, it has been tested with >gcc, and, at least with the gcc on our Suns, compilers very nicely, >aside from one #ifdef. Curiously, the gcc-specific hack is necessary >only for one of the two loops.) > >#ifndef __GNUC__ > /* for some reason, gcc does not optimise this */ > *(short *)dst = *(short *)src; > dst -= sizeof (short); > src -= sizeof (short); >#else > /* so we use a gcc extension */ > *--(short *)dst = *--(short *)src; >#endif Do these two program fragments really do the same thing? Does not the non-GCC code decrement dst and src after doing the move, while the GCC code decrements before doing the move. Could this be why gcc does not optimize the former. The 68010 does not have a post autoincrement addressing mode. I think that the gcc code should be *(short *)dst-- = *(short *)src--; which is still not optimizable (at least it should not generate the code that Chris would like it to :-)). Or is that gcc-extension doing something I do not see? What am I missing? Perhaps a RTFM is in order! I just did that and do not see what extenstion Chris is referring to. Chris, which is it? ~ -- --- Tom Tkacik GM Research Labs, Warren MI 48090 uunet!edsews!rphroy!megatron!tkacik "If you can't stand the bugs, stay out of the roach-motel." Ron Guilmette
chris@mimsy.UUCP (Chris Torek) (09/12/89)
>In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: >>Just for fun, here is my bcopy-in-C for (gcc+68010) >> >>#ifndef __GNUC__ >> /* for some reason, gcc does not optimise this */ >> *(short *)dst = *(short *)src; >> dst -= sizeof (short); >> src -= sizeof (short); >>#else >> /* so we use a gcc extension */ >> *--(short *)dst = *--(short *)src; >>#endif In article <16902@rphroy.UUCP> tkacik@rphroy.UUCP (Tom Tkacik) writes: >Do these two program fragments really do the same thing? OOPS! I did not need the gcc-specific grungy hack after all. The first three (#ifndef __GNUC__) lines *should* have read: dst -= sizeof (short); src -= sizeof (short); *(short *)dst = *(short *)src; which gcc is perfectly willing to optimise. >I think that the gcc code should be > *(short *)dst-- = *(short *)src--; No, this is legal C, and means to: evaluate dst, evaluate src; sometime in the near future, assign dst-1 to dst and src-1 to src; fetch a short from the location named by the conversion of the original value of the character pointer `src' to a pointer-to-short (a machine-dependent operation); store that short to the location named by the conversion of ... `dst' (another machine-dependent operation); <sequence point>: finish side effects on dst and src. The gcc-specific hack is not legal C; it means to pretend dst and src are pointers to short, rather than pointers to char (this operation is not only machine-dependent, it is hopeless on some machines, e.g., those where `char *' is 48 bits long but `short *' is 32 bits long); assign dst-sizeof(short) to dst and src-sizeof(short) to src, sometime in the near future; fetch a short from the location named by the result of the decrement of the (pretended word pointer) src; store that short in the location named by the result ... dst; <sequence point>: finish side effects on dst and src. >I just did that and do not see what extenstion Chris is referring to. >Chris, which is it? The one that says that a cast on a pointer is a type pun, rather than a conversion. This, if nothing else, tells why gcc cannot be used to compile C code for a Prime (32 bit word pointers, 48 bit char pointers). (In fact, gcc assumes internally that sizeof(any-pointer) == sizeof(int) and that all pointers are `byte pointers', as far as I can tell.) -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
BEAR@S34.Prime.COM (09/12/89)
>In article <19491@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes: > This, if nothing else, tells why gcc cannot be used to >compile C code for a Prime (32 bit word pointers, 48 bit char pointers). >(In fact, gcc assumes internally that sizeof(any-pointer) == sizeof(int) >and that all pointers are `byte pointers', as far as I can tell.) FYI, this is not necessarily true. Without going into lots of details, our 32IX mode C compiler uses 32 bit pointers for *everything*. This compiler has been available for about 6 years and can be used on any of our current 50-series processors. Older machines (shipped before 1983, non-IX mode) do indeed require 48 bit pointers to address anything smaller than a halfword (the 48 bit pointers are actually bit pointers!). Bob Beckwith Prime Computer, Inc. (508)879-2960 x 4209 bear@s34.prime.com