rh@smds.UUCP (Richard Harter) (07/26/90)
The macro shown below is an optimized memory to memory copy macro. It is probably faster than memcopy on your machine -- I have checked it on several machines and have always found it to be faster. Some may find it instructive as an example of code optimization. Techniques used are: loop unrolling, localized use of register variables, count down loops, and shift and modulo operators instead of arithmetic operators. Some general notes: There are two major ways to unroll a loop with a variable remainder. The method used here is to use two loops, one for the remainder and one for the bulk. The other method is to jump into a loop inside a switch (legal but bizarre.) The second method has the advantage of not needing the "short" loop for the remainder. However timing studies show no real difference between the two methods. The reason is that the overhead of setting up the switch balances off with the higher cost of the short loop. Given two loops, one which is unrolled by a constant, and a short loop handling the remainder, a loop unrolling factor of 8 is optimal under most circumstances. A count down loop saves several instructions in a CISC machine, provided that (a) the loop variable is a register variable, and (b) the optimation option is selected. The savings are due to (1) the comparison is against 0 rather than a constant, (2) CISC machines typically have a branch on result of comparison, (3) optimization and register declaration avoids a store of the loop variable. For some compilers there is also a savings by placing the calculation of the limit in the first clause of the loop when the limit is a complex expression. It should be noted that the savings are only signifigant in tight loops. A count down loop should have one of the following two forms: for (index=(expression)+1;--index>0;) {body} for (index=(expression);--index>=0;) {body} Plum Hall uses the first form; I use and recommend the second as being clearer. The second form will fail (infinite loop) if the index is unsigned. However the savings will be lost in any case if the index is unsigned. These forms should be used as is; the loop optimization is lost if you don't predecrement in the test. The macros do not include or exploit an alignment test on the grounds that (a) the resulting code is not portable, (b) the overhead of the tests removes the savings in short and medium length copies, (c) many aligned copies can be caught by passing the proper type, and (d) I didn't feel like coding it. The copyright notice is included solely on the grounds that I feel that if you use somebody elses code you should acknowlege where you got it from. ------------------------ CUT HERE ----------------------------------- /* Copyright (c) 1990, Software Maintenance and Development Systems Inc. Permission to copy, to modify, and to distribute, whether for sale or not is granted, provided that a copy of this notice is included and is not altered. Macro name: gencopy Function: In memory copy Author: Richard Harter Arguments: dest Pointer to destination src Pointer to source nb Number of items to copy type Type of items Synopsis: This macro is an optimized in line memory to memory copy routine. The following restrictions apply: (a) if dest and src overlap, dest must have a lower address than src; (b) the item type must be one for which assignment is valid. Bugs: No check is made for negative length. Results may be surprising. */ #define gencopy(dest,src,nb,type ) {register type *cpy_a;\ register type *cpy_b;\ register int cpy_i,cpy_j;cpy_a=(dest);cpy_b=(src);cpy_j=(nb);\ for(cpy_i=(cpy_j & 7); --cpy_i>=0;) *cpy_a++ = *cpy_b++;\ for (cpy_i=(cpy_j>>3); --cpy_i>=0;) { *cpy_a++ = *cpy_b++;*cpy_a++ = *cpy_b++;\ *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++;\ *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; *cpy_a++ = *cpy_b++; }} #define copy(dest,src,nb) gencopy(dest,src,nb,char) ------------------------ CUT HERE ----------------------------------- -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
mcdonald@aries.scs.uiuc.edu (Doug McDonald) (07/26/90)
In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > >The macro shown below is an optimized memory to memory copy macro. >It is probably faster than memcopy on your machine -- I have checked >it on several machines and have always found it to be faster. !!!!!! > Oh My!. Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array: His code library memcpy Compiler 1: (chars) 12.6 2.7 (ints) 6.9 2.7 Compiler 2: (chars) 23.6 1.3 (ints) 6.9 1.3 Results were the same for library memmove, aligned or unaligned. Sure looks like his code loses. As I would expect, of course: The library can do the whole 20 kilobytes in one instruction, assuming aligned operands. Unaligned, it takes a few more. When people post things like this, perhaps they could test them, at least on the most common computers. My computer is an example of the next to the top of the line of the most common processor line(*) the world has ever known. At least, folks, test on one of these!! Doug McDonald (*) It is a 20 MHz Dell 310, 80386. The compilers were Microsoft C and Microway NDPC.
andrew@alice.UUCP (Andrew Hume) (07/27/90)
i hate to interject experimental results into a computer science discussion but i implemented mr harter's `always faster' memcpy just to see. the machine was an SGI 4D/240 running irix 3.2.1. the timing is for 100000 copies of n bytes (n uniform between 0 999). results: memcpy 2.81s harter 9.02s in my experience, memcpy is osmething vendors REALLY work on; to suggest that a simple portable implementation would beat the system's memcpy implies you have only tried this on batshit systems.
joe@proto.COM (Joe Huffman) (07/28/90)
In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: > In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > > > >The macro shown below is an optimized memory to memory copy macro. > >It is probably faster than memcopy on your machine -- I have checked > >it on several machines and have always found it to be faster. > !!!!!! > > > Oh My!. > > Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array: > > His code library memcpy > Compiler 1: > (chars) 12.6 2.7 > (ints) 6.9 2.7 > > > Compiler 2: > (chars) 23.6 1.3 > (ints) 6.9 1.3 > [Stuff deleted... compilers were Microsoft and Microway NDPC, machine was 20 MHz 386] I just ran it on a 20 MHz 386 running SCO UNIX. The timing were done with 5000 copies but then divided by 5 to make the numbers comparable. His code library memcpy SCO supplied MSC 5.1 (chars) 14.0 2.05 Zortech 386 code generator not available 1.80 The reason why the macro is so much slower is (as Doug alluded to and I deleted) is that the library routines were written in assembly language and tweaked for near optimum speed and/or size. Not something you usually should do if you are trying to write portable code. Hence let the compiler writers do it -- usually they can be trusted to squeeze the most performance out of the machine. Especially in a competitive marketplace. --- Zortech is my major source of income. Statements about them or their competitors cannot be totally without bias. -- joe@proto.com uunet!proto!joe FAX: 208-263-8772
rh@smds.UUCP (Richard Harter) (07/28/90)
In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: > In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > >The macro shown below is an optimized memory to memory copy macro. > >It is probably faster than memcopy on your machine -- I have checked > >it on several machines and have always found it to be faster. > !!!!!! > Oh My!. [Superior timings for a 20KB move on a 386 by memmove given] Ouch. I should have phrased that more carefully. Yes, the gentleman is quite right. As to be expected, a hardware bulk move is always going to beat code that uses item by item move instructions. Furthermore, there is nothing to stop a system implementor from using the tightest code possible. (My observation is that the quality of implementations vary a great deal.) In defense I have to point out that the quoted remark is accurate; timings were made on 680x0 boxes, vaxes (!), and some risc boxes, but not, obviously, on any 386 boxes. Small comfort for the chap who took my posting blindly and put it on his 386. Apologies. It should also be pointed out that the cited code is a template for tight loops and does win when there isn't a hardware instruction available. Does it matter? Yes, if you are coding for both speed and portability. In this particular case, however, the gencopy macro should include a machine dependent ifdef which switches to memcpy (memmove) on the appropriate machines. Thanks to Doug for catching this and posting. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
bruce@seismo.gps.caltech.edu (Bruce Worden) (07/29/90)
In article <1349@proto.COM> joe@proto.COM (Joe Huffman) writes: >In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: >> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >> > >> >The macro shown below is an optimized memory to memory copy macro. >> >It is probably faster than memcopy on your machine -- I have checked >> >it on several machines and have always found it to be faster. >> !!!!!! >> Oh My!. >> Time on my computer, in seconds, for 1000 copies of a 20 kilobyte array: >> His code library memcpy >> Compiler 1: >> (chars) 12.6 2.7 >> (ints) 6.9 2.7 >> Compiler 2: >> (chars) 23.6 1.3 >> (ints) 6.9 1.3 >[Stuff deleted... compilers were Microsoft and Microway NDPC, machine was >20 MHz 386] > >I just ran it on a 20 MHz 386 running SCO UNIX. The timing were done with >5000 copies but then divided by 5 to make the numbers comparable. > His code library memcpy >SCO supplied MSC 5.1 > (chars) 14.0 2.05 >Zortech > 386 code generator not available 1.80 Here are the results on some machines I could find the other day. The compilers are the native compilers unless otherwise stated. I used whatever compiler optimizations I could. 20kbyte arrays, 1000 copies: Sun Sparcstation 1+ Him memcpy chars: 7.6 2.0 ints: 2.0 2.0 Sun 4/280 Him memcpy chars: 9.8 2.8 ints: 2.5 2.8 Sun Sparcstation SLC Him memcpy chars: 9.9 2.6 ints: 2.5 2.6 Sun 386i Him memcpy chars: 9.5 2.6 ints: 2.4 2.6 Sun 3/160 Him memcpy chars: 13.7 4.5 ints: 3.4 4.5 Inmos T800 (Meiko, 25MHz, kind-of unfair because of block_copy instruction) Him memcpy chars: 37.6 1.6 ints: 8.4 1.6 i860 (Meiko, 40MHz, Green Hills C-I860 1.8.5, beta assembler 1.41, beta linker 1.2) Him memcpy chars: 2.1 3.9 ints: 0.9 3.9 Convex C120 (Vector--yes his code vectorizes nicely, memcpy not available, used bcopy) Him memcpy chars: 3.0 1.0 ints: 1.0 1.0 Convex C120 (Scalar, memcpy not available, used bcopy) Him memcpy chars: 28.4 1.5 ints: 7.5 1.5 BBN TC2000 (Motorola 88000-based, Green Hills C-88000 2.35(1.8.4)) Him memcpy chars: 10.3 12.0 ints: 4.9 12.0 In general, I'd say Richard's code does a pretty good job when moving int's, and also when compared to young machines (the BBN and the Meiko i860.) In addition, his code is about 20% faster than a simple "for" loop on my Sparc 1+, so it illustrates a useful principle as well. I intend to use it in some selected applications, thanks for posting it. BIG TIME DISCLAIMER: I in no way intended this to be a comparison of different machines, but of the performance of a piece of C code on each of several different machines. There are a lot of ways to do timings, and most of them aren't very good, so please don't flame me if I didn't do justice to some machine's absolute performance, it is the relative timings that matter. If I screwed that up, flame away (though a nice note explaining the error might be more instructive.) Bruce P.S. For timing I used getusecclock() on the BBN, ticks() on the Meiko's, and getrusage() on everything else.
dwho@nmt.edu (David Olix) (07/29/90)
In article <134@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: >> In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > >> >The macro shown below is an optimized memory to memory copy macro. >> >It is probably faster than memcopy on your machine -- I have checked >> >it on several machines and have always found it to be faster. >> !!!!!! > >> Oh My!. > > [Superior timings for a 20KB move on a 386 by memmove given] > [Explanation for slower performance on 386 deleted] >In defense I have to point out that the quoted remark is accurate; >timings were made on 680x0 boxes, vaxes (!), and some risc boxes, [...] Just for kicks I did the following test on these machines: Sun 3/280 (68020) running SunOS 4.1, HP 9000/835 (HP Precision Architecture RISC) running HP-UX 7.0, and a VAX 750 running 4.3 BSD (don't laugh, it still runs!) Timings are in seconds based on 1000 copies of a 20K char array. Compile commands were: cc -O and gcc -O -fstrength-reduce -finline-functions. Macro (copy) Library (memcpy) Compiler Sun HP VAX Sun HP VAX cc 7 7 52 3 1 95 <-- gcc 8 --- 51 2 --- 96 Oddly, the macro *DOES* seem to work faster on the VAX (Gee, so I guess all the world *is* a VAX :-) :-) :-) >Richard Harter, Software Maintenance and Development Systems, Inc. >Net address: jjmhome!smds!rh Phone: 508-369-7398 >US Mail: SMDS Inc., PO Box 555, Concord MA 01742 >This sentence no verb. This sentence short. This signature done. --David Olix dwho@{nmtvax|nmtsun|minos}.nmt.edu .sig under construction Any connection between my views and NMIMT's is purely coincidental
tkacik@kyzyl.mi.org (Tom Tkacik) (07/29/90)
In article <134@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: > > In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > > >The macro shown below is an optimized memory to memory copy macro. > > >It is probably faster than memcopy on your machine -- I have checked > > >it on several machines and have always found it to be faster. > > !!!!!! > > Oh My!. > [Superior timings for a 20KB move on a 386 by memmove given] > Ouch. I should have phrased that more carefully. Yes, the gentleman > is quite right. As to be expected, a hardware bulk move is always > going to beat code that uses item by item move instructions. Furthermore, > In defense I have to point out that the quoted remark is accurate; > timings were made on 680x0 boxes, vaxes (!), and some risc boxes, > but not, obviously, on any 386 boxes. Oh My!. I also tried this using 1000 copies of 20K blocks. This was performed on an AT&T UnixPC (68010 @ 10MHz). Here are my times: gencopy memcpy GCC (char)20000 31.3sec 12.0 (int) 5000 12.2 12.0 CC (char)20000 31.3sec 12.0 (int) 5000 12.3 12.0 Both CC and GCC used the library version of memcpy, so there is little suprise that they both used the same time. > Small comfort for the chap > who took my posting blindly and put it on his 386. Apologies. Small comfort for the chap who blindly puts it on a 680x0. I think I will stick to the library version.:-) -- Tom Tkacik tkacik@kyzyl.mi.org Speaking only for myself here in Royal Oak.
hascall@cs.iastate.edu (John Hascall) (07/30/90)
rh@smds.UUCP (Richard Harter) writes: }mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: }> rh@smds.UUCP (Richard Harter) writes: }> >The macro shown below is an optimized memory to memory copy macro. }> >It is probably faster than memcopy on your machine -- I have checked }> >it on several machines and have always found it to be faster. }> Oh My!. } [Superior timings for a 20KB move on a 386 by memmove given] }Ouch. I should have phrased that more carefully. Yes, the gentleman }is quite right. As to be expected, a hardware bulk move is always }going to beat code that uses item by item move instructions. }In defense I have to point out that the quoted remark is accurate; }timings were made on 680x0 boxes, vaxes (!), and some risc boxes, ~~~~~ Hmmm, sorry, but here's a little more rain on your parade: VAXstation 3200 (admittedly, not the most modern VAX), 1000 20KB copies: memcpy: 3.18 your macro: 15.43 John Hascall / john@iastate.edu / hascall@atanasoff.cs.iastate.edu
john@frog.UUCP (John Woods) (07/31/90)
In article <1990Jul26.144134.16053@ux1.cso.uiuc.edu>, mcdonald@aries.scs.uiuc.edu (Doug McDonald) writes: > In article <133@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: > >The macro shown below is an optimized memory to memory copy macro. > >It is probably faster than memcopy on your machine > Oh My!. > Sure looks like his code loses. As I would expect, of course: The > library can do the whole 20 kilobytes in one instruction, assuming > aligned operands. Unaligned, it takes a few more. Well, it turns out that even without being able to do memory-to-memory moves with a single SPIN_CYCLE instruction, a poor little garden variety 68020 can do better than the supplied macro. I checked our movebytes() subroutine, and it did about three times better than the macro. It, of course, uses all of the valuable optimization tricks mentioned in Richard's article (reason enough to stash it away), and also takes care of minor little problems like odd-address alignment (which is a problem on the 68000 though not on the '20). movebytes() is, as it happens, written in C, using a compiler (GreenHills) which translates it into really good assembly code. -- John Woods, Charles River Data Systems, Framingham MA, (508) 626-1101 ...!decvax!frog!john, john@frog.UUCP, ...!mit-eddie!jfw, jfw@eddie.mit.edu
steve@thelake.mn.org (Steve Yelvington) (08/01/90)
[In article <1990Jul28.203800.17258@laguna.ccsf.caltech.edu>, bruce@seismo.gps.caltech.edu (Bruce Worden) writes ... ] > In general, I'd say Richard's code does a pretty good job when moving int's, > and also when compared to young machines (the BBN and the Meiko i860.) > In addition, his code is about 20% faster than a simple "for" loop on my > Sparc 1+, so it illustrates a useful principle as well. I intend to > use it in some selected applications, thanks for posting it. Bruce is one of the few people who seems to have seen the point -- which (to me, anyway) was just an illustration of C coding technique, not a claim that it's possible to beat Brand X compiler's mondo-optimized assembler memcpy(). For collectors of useless numbers, here are results from an 8-megaHertz 16-bit Motorola 68000 (Atari ST), 1000 iterations, 20K buffers: library memcpy: 17.125 seconds Richard's gencpy char: 40.755 seconds Richard's gencpy int: 20.385 seconds Richard's gencpy long: 15.460 seconds Details: Sozobon C compiler, dLibs public-domain C library. Optimizer turned on. sizeof(int) == 16 bits; sizeof(long) == 32 bits. The dLibs memcpy is coded in assembler, moves 16-bit words when possible, and DOES check for overlaps (as in memmove). The copy is a simple loop. Loading and dumping registers with movem.l might be faster; I have not tried. -- Steve Yelvington at the (rain-replenished) lake in Minnesota steve@thelake.mn.org
boyd@necisa.ho.necisa.oz (Boyd Roberts) (08/01/90)
When will these exercises in futility end? How many time have we seen the `my hack beats the library routine' postings? And, how many times have we seen such misguided claims proved to be false? Too many times! It should be obvious that there are too many variables in these situations and that generic statements will be found to be false for some large number of cases. Apples and oranges... Boyd Roberts boyd@necisa.ho.necisa.oz.au ``When the going gets wierd, the weird turn pro...''