eric@snark.UUCP (Eric S. Raymond) (08/27/88)
(Code below reproduced so that comp.arch people seeing this followup only won't get terminally frustrated. This is *really neat*, gang...) In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes: > > void send(int *to,int *from, int count) { > int n = (count + 7) / 8; > > switch(count % 8) { > case 0: do { *to++ = *from++; > case 7: *to++ = *from++; > case 6: *to++ = *from++; > case 5: *to++ = *from++; > case 4: *to++ = *from++; > case 3: *to++ = *from++; > case 2: *to++ = *from++; > case 1: *to++ = *from++; > } while (--n > 0); > } > > } > > Finally, Stroustrup asks the rhetorical question ``why would anyone > want to write something like this.'' Any guesses?! Yeah. That's the most hackish way of trying to write a portable optimized copy routine I've ever seen. I gather the whole point of the shenanigans is to get all the *from++ -> *to++ instructions in the generated code to be adjacent. This only makes if the author knows he's got a hardware instruction pipeline or cache that's no less than 8 and no more than 9 byte-copy instruction widths long, and stuff executing out of the pipeline is a lot faster than if the copies are interleaved with control transfers. Dollars to doughnuts this code was written on a RISC machine. (Gawrsh. That sounded just like one of the big boys on comp.arch tawkin'. I think I'll cross-post over there just to see if I get shot down in flames...) -- Eric S. Raymond (the mad mastermind of TMN-Netnews) UUCP: ...!{uunet,att,rutgers}!snark!eric = eric@snark.UUCP Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718
schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/27/88)
Hi, Since I posted my original question there has been a great deal of abstract discussion about the relative merits of the loop unrolling scheme. The topic has piqued my curiousity, so I when ahead and implemented a short test program, included below, to test Duff's device against the ``ordinary for loop w/index variable'' technique. See for yourself.... After some quick testing I found that gcc 1.26 -O on a Sun 3 and a Sequent Balance was pretty heavily in favor of the regular (non-Duff) loop. Your mileage may vary. I realize that there may be other tests, and if anyone has a better version, I'd like to see it! Doug Schmidt ---------------------------------------- #include <sys/time.h> #include <sys/resource.h> double Start_Timer(); double Return_Elapsed_Time(); #define MAX_NUM 100000 int array1[MAX_NUM ]; int array2[MAX_NUM ]; int *A = array1, *B = array2; main(int argc, char *argv[]) { double Elapsed_Time; int Count = argc > 1 ? atoi(argv[1]) : MAX_NUM; int i; for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting Duff's device timing...\n"); Start_Timer(); { int n = (Count + 7) / 8; switch(Count % 8) { case 0: do { *A++ = *B++; case 7: *A++ = *B++; case 6: *A++ = *B++; case 5: *A++ = *B++; case 4: *A++ = *B++; case 3: *A++ = *B++; case 2: *A++ = *B++; case 1: *A++ = *B++; } while (--n > 0); } } Elapsed_Time = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3f\n",Elapsed_Time); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting ordinary copy timing...\n"); Start_Timer(); for (i = 0;i < Count ;i++) { array1[i] = array2[i]; } Elapsed_Time = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3f\n",Elapsed_Time); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } } /* no such thing as "negative time"! */ #define ERROR_VALUE -1.0 static struct rusage Old_Time; static struct rusage New_Time; static int Timer_Set = 0; #ifdef __STDC__ double Start_Timer(void) #else double Start_Timer() #endif { Timer_Set = 1; getrusage(RUSAGE_SELF,&Old_Time); /* set starting process time */ return(Old_Time.ru_utime.tv_sec + (Old_Time.ru_utime.tv_usec / 1000000.0)); } /* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */ /* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */ /* the time since the Old_Time was set is returned. Returns ERROR_VALUE */ /* if Start_Timer() is not called first */ #ifdef __STDC__ double Return_Elapsed_Time(double Last_Time) #else double Return_Elapsed_Time(Last_Time) double Last_Time; #endif { if (!Timer_Set) { return(ERROR_VALUE); } else { /* get process time */ getrusage(RUSAGE_SELF,&New_Time); if (Last_Time == 0.0) { return((New_Time.ru_utime.tv_sec - Old_Time.ru_utime.tv_sec) + ((New_Time.ru_utime.tv_usec - Old_Time.ru_utime.tv_usec) / 1000000.0)); } else { return((New_Time.ru_utime.tv_sec + (New_Time.ru_utime.tv_usec / 1000000.0)) - Last_Time); } } } -- schmidt@bonnie.ics.uci.edu (ARPA) "If our behavior is strict, we do not need fun." -Zippy th' Pinhead "If our behavior is struct, we do not need defun." -Anon
henry@utzoo.uucp (Henry Spencer) (08/28/88)
In article <653@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >After some quick testing I found that gcc 1.26 -O on a Sun 3 and a >Sequent Balance was pretty heavily in favor of the regular (non-Duff) >loop... The odds are excellent that the compilers are applying optimizations to the regular loop that they can't do for the Duff loop as you've written it. In particular, your pointers aren't even register variables (and they are externs, so the compiler can't safely promote them quietly), whereas a good optimizing compiler will certainly be using register pointers in the regular loop. My partner in crime, Geoff Collyer, applied Duff's Device in a number of places in the C News rnews; the performance improvements were substantial. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
levy@ttrdc.UUCP (Daniel R. Levy) (08/28/88)
In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: > Hi, > > Since I posted my original question there has been a great deal of > abstract discussion about the relative merits of the loop unrolling > scheme. The topic has piqued my curiousity, so I when ahead and > implemented a short test program, included below, to test Duff's > device against the ``ordinary for loop w/index variable'' technique. > See for yourself.... > > After some quick testing I found that gcc 1.26 -O on a Sun 3 and a > Sequent Balance was pretty heavily in favor of the regular (non-Duff) > loop. Your mileage may vary. I realize that there may be other > tests, and if anyone has a better version, I'd like to see it! [schmidt's program deleted, please refer to parent article if you want it] I modified this program to run under System V, changed the arrays to be dynam- ically allocated, and changed both the Duff and ordinary copies to use register pointers instead of global pointers (for the Duff copy) and array indexing (for the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a Sun-4 both with and without -O optimization (using the standard pcc-type C compiler on each system). The result? Duff wins by about 10%-20% on all machines tested. Here is the code thus modified: #include <sys/types.h> #include <sys/times.h> #include <sys/param.h> double Start_Timer(); double Return_Elapsed_Time(); #define DEFAULT_NUM 100000 main(argc,argv) char **argv; { double Elapsed_Time_Duff; double Elapsed_Time_Ordinary; int Count = argc > 1 ? atoi(argv[1]) : DEFAULT_NUM; int i; register int *A, *B; register int n; register int *A_end; int *array1, *array2; char *malloc(); if (Count <= 0) { printf("Bad Count\n"); return 1; } else { printf("Count=%d\n", Count); } if (!(array1=(int *)malloc(sizeof(int) * Count))) { printf("Can't malloc %u bytes for array1\n", sizeof(int) * Count); return 1; } else if (!(array2=(int *)malloc(sizeof(int) * Count))) { printf ("Can't malloc %u bytes for array2\n", sizeof(int) * Count); return 1; } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting Duff's device timing...\n"); Start_Timer(); { n = (Count + 7) / 8; A = array1; B = array2; switch(Count & 0x7) { case 0: do { *A++ = *B++; case 7: *A++ = *B++; case 6: *A++ = *B++; case 5: *A++ = *B++; case 4: *A++ = *B++; case 3: *A++ = *B++; case 2: *A++ = *B++; case 1: *A++ = *B++; } while (--n > 0); } } Elapsed_Time_Duff = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3lf\n",Elapsed_Time_Duff); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting ordinary copy timing...\n"); Start_Timer(); { A = array1; B = array2; A_end = array1 + Count - 1; while (A <= A_end) *A++ = *B++; } Elapsed_Time_Ordinary = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3lf\n",Elapsed_Time_Ordinary); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } printf("Duff/Ordinary copy time ratio = %.3lf\n", Elapsed_Time_Duff/Elapsed_Time_Ordinary); } /* no such thing as "negative time"! */ #define ERROR_VALUE -1.0 static struct tms Old_Time; static struct tms New_Time; static int Timer_Set = 0; #ifdef __STDC__ double Start_Timer(void) #else double Start_Timer() #endif { Timer_Set = 1; times(&Old_Time); /* set starting process time */ return ((double)Old_Time.tms_utime)/(double)HZ; } /* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */ /* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */ /* the time since the Old_Time was set is returned. Returns ERROR_VALUE */ /* if Start_Timer() is not called first */ #ifdef __STDC__ double Return_Elapsed_Time(double Last_Time) #else double Return_Elapsed_Time(Last_Time) double Last_Time; #endif { if (!Timer_Set) { return(ERROR_VALUE); } else { /* get process time */ times(&New_Time); if (Last_Time == 0.0) { return ((double)(New_Time.tms_utime - Old_Time.tms_utime))/(double) HZ; } else { return (((double)New_Time.tms_utime)/(double)HZ) - Last_Time; } } } -- |------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY | Bell Labs Area 61 (R.I.P., TTY)| AND ARE NOT TO BE IMPUTED TO AT&T. | Skokie, Illinois | |-----Path: att!ttbcad!levy-----|
aglew@urbsdc.Urbana.Gould.COM (08/28/88)
>..> Duff's device > >Yeah. That's the most hackish way of trying to write a portable optimized >copy routine I've ever seen. I gather the whole point of the shenanigans >is to get all the *from++ -> *to++ instructions in the generated code to be >adjacent. Well, that's a start. Duff's device does better on some machines (like, mine) that don't have autoincrement addressing modes, and even on some that do, if you use indexing instead of incrementing. Unfortunately, that tends to make you run the copy backwards, which louses up some caches. >This only makes if the author knows he's got a hardware instruction pipeline >or cache that's no less than 8 and no more than 9 byte-copy instruction widths >long, and stuff executing out of the pipeline is a lot faster than if the >copies are interleaved with control transfers. Dollars to doughnuts this code >was written on a RISC machine. > > Eric S. Raymond (the mad mastermind of TMN-Netnews) No doughnuts. The canonical example of the Duff device originated on a VAX - it's that stuff in the BSD source that produces diagnostics every time you recompile a kernel. Or did it originate on a PDP-11, or early? Gosh darn it, we don't have version 7 reference source on line any more. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have MX records aglew@xenurus.gould.com - if you don't ...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
gwyn@smoke.ARPA (Doug Gwyn ) (08/28/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >Dollars to doughnuts this code was written on a RISC machine. Send me my doughnuts! (Or dollars, if you prefer.) RISC machines were practically non-existent when this C code was first written down (whether or not by Tom Duff, I'm not sure). It probably appeared on a PDP-11, perhaps a VAX. I suspect the first use of Duff's device was as an "efficiency hack", but as you observe it is not universally the most efficient method. The modern VALID reason for such code is that it is a relatively efficient way to do something portably that CANNOT BE DONE by using the "preferred block copy routine" supplied by an implementation (be it bcopy() or memcpy()). Whenever the preferred routine will work correctly (in the portable case, that means whenever the source and destination fields do not overlap), then of course one should use it, especially since if the C vendors are doing their job, the C library routine (which could also be expanded in-line by the compiler) will be near-optimally tuned for each specific architecture. P.S. Some implementations of bcopy() and memcpy() do guarantee proper handling of overlapping fields. But not all do, so one cannot count on it if one's trying to write truly portable code. ANSI C will specify a new function memmove() with this property guaranteed.
casey@admin.cognet.ucla.edu (Casey Leedom) (08/29/88)
In article <28200192@urbsdc> aglew@urbsdc.Urbana.Gould.COM writes: | > > Duff's device | > | > That's the most hackish way of trying to write a portable optimized | > copy routine I've ever seen. I gather the whole point of the shenanigans | > is to get all the *from++ -> *to++ instructions in the generated code to be | > adjacent. | | Well, that's a start. Duff's device does better on some machines (like, | mine) that don't have autoincrement addressing modes, and even on some | that do, if you use indexing instead of incrementing. Unfortunately, | that tends to make you run the copy backwards, which louses up some | caches. | | > This only makes if the author knows he's got a hardware instruction | > pipeline or cache that's no less than 8 and no more than 9 byte-copy | > instruction widths long, and stuff executing out of the pipeline is a lot | > faster than if the copies are interleaved with control transfers. Dollars | > to doughnuts this code was written on a RISC machine. I wrote the bcopy routine for 2.10BSD with some loop unrolling. But there the motivation wasn't pipelining, caching, or control/transfer considerations. Basically, it's very simple: 1: mov (r0)+, (r1)+ ; 1.50 usec sob r2, 1b ; 0.60 usec sob (subtract 1 and branch if non-zero) is 28% of total loop time. vs: 2: mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec mov (r0)+, (r1)+ ; 1.50 usec sob r2, 2b ; 0.60 usec sob is 10% of total loop time. The other big performance win was to optimize odd to odd, even to even with odd length, etc. copies which previous versions of bcopy implemented as byte copy loops. Casey
henry@utzoo.uucp (Henry Spencer) (08/29/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >This only makes if the author knows he's got a hardware instruction pipeline >or cache that's no less than 8 and no more than 9 byte-copy instruction widths >long, and stuff executing out of the pipeline is a lot faster than if the >copies are interleaved with control transfers. Dollars to doughnuts this code >was written on a RISC machine. Nope. Bell Labs Research uses VAXen and 68Ks, mostly. The key point is not pipelining, but loop-control overhead. There is in fact a tradeoff here: unrolling the loop further will reduce control overhead further, but will increase code size. That last is of some significance when caching gets into the act: cache-loading overhead favors short loops, and small cache sizes very strongly favor short ones. In general there is an optimal point in there somewhere, and an unrolling factor of 8 or 16 is a pretty good guess at it on the machines I've looked at closely. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
aglew@urbsdc.Urbana.Gould.COM (08/29/88)
..> Duff Device My apologies. In an earlier post I said that the Duff device was in VAX BSD source, and caused diagnostics when recompiling the kernel. Well, on being challenged by Chris Torek, I found it in Gould specific UTX source, specifically in sel/in_cksum.c. But the point remains - the Duff device was found to be a performance win on a Gould machine, which, while considerably simpler than a VAX, is hardly a RISC. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@gould.com - preferred, if you have MX records aglew@xenurus.gould.com - if you don't ...!ihnp4!uiucuxc!ccvaxa!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
rick@seismo.CSS.GOV (Rick Adams) (08/30/88)
From harpo!ihnp4!we13!burl!ulysses!allegra!alice!td Mon May 7 10:19:21 1984 From: td@alice.UUCP (Tom Duff) Newsgroups: net.lang.c Subject: Unwinding loops Message-ID: <2748@alice.UUCP> Date: 7 May 84 14:19:21 GMT Status: RO Consider the following routine, abstracted from code which copies an array of shorts into the Programmed IO data register of an Evans & Sutherland Picture System II: send(to, from, count) register short *to, *from; register count; { do *to = *from++; while(--count>0); } (Obviously, this fails if the count is zero.) The VAX C compiler compiles the loop into 2 instructions (a movw and a sobleq, I think.) As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%. The standard way to get more speed out of something like this is to unwind the loop a few times, decreasing the number of sobleqs. When you do that, you wind up with a leftover partial loop. I usually handle this in C with a switch that indexes a list of copies of the original loop body. Of course, if I were writing assembly language code, I'd just jump into the middle of the unwound loop to deal with the leftovers. Thinking about this one day last October, the following implementation occurred to me: send(to, from, count) register short *to, *from; register count; { register n=(count+7)/8; switch(count%8){ case 0: do{ *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; }while(--n>0); } } Disgusting, no? But it compiles and runs just fine on all known C compilers. Dennis Ritchie has endorsed it as legal C. I feel a combination of pride and revulsion at this discovery; I think I'll name it after myself -- ``Duff's Device'' has a nice ring to it. It amazes me that after 10 years of writing C there are still little corners that I haven't explored fully. (Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.) Many people (even Brian Kernighan?) have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I'm not sure whether it's for or against. Tom Duff {ucbvax,decvax,ihnp4,...}!alice!td Note: Work reported herein was done while the author was an employee of Lucasfilm Ltd., San Rafael, CA.
chuck@amdahl.uts.amdahl.com (Charles Simmons) (08/30/88)
In article <2877@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes: >In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >> Since I posted my original question there has been a great deal of >> abstract discussion about the relative merits of the loop unrolling >> scheme. The topic has piqued my curiousity, so I when ahead and >> implemented a short test program, included below, to test Duff's >> device against the ``ordinary for loop w/index variable'' technique. >> See for yourself.... >> >> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a >> Sequent Balance was pretty heavily in favor of the regular (non-Duff) >> loop. Your mileage may vary. I realize that there may be other >> tests, and if anyone has a better version, I'd like to see it! > >I modified this program to run under System V, changed the arrays to be dynam- >ically allocated, and changed both the Duff and ordinary copies to use register >pointers instead of global pointers (for the Duff copy) and array indexing (for >the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a >Sun-4 both with and without -O optimization (using the standard pcc-type C >compiler on each system). The result? Duff wins by about 10%-20% on all >machines tested. I then added a piece to the program to use 'memcpy'. The results? Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than Duff. So why do people spend so much time avoiding standard subroutines? -- Chuck
lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/30/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: [discussion of Duff copy deleted] >I then added a piece to the program to use 'memcpy'. The results? >Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than >Duff. So why do people spend so much time avoiding standard subroutines? Sometimes the standard subroutines are implemented horribly. I was horrified when I saw that the machine dependent version of memcpy on the AT&T 3Bs is nothing but a byte by byte transfer written in assembly language. It is tricky, but doable, to speed this up by a roughly a factor of sizeof(long). In fact it already is done in the 3B implementation of the UNIX(tm) operating system in the copyin (?) routine. Why wasn't it done in memcpy too? Sigh. -- Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM
nat@bales.UUCP (Nathaniel Stitt) (08/31/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >(Code below reproduced so that comp.arch people seeing this followup only >won't get terminally frustrated. This is *really neat*, gang...) > >In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes: >> >> void send(int *to,int *from, int count) { >> int n = (count + 7) / 8; >> >> switch(count % 8) { >> case 0: do { *to++ = *from++; >> case 7: *to++ = *from++; >> case 6: *to++ = *from++; >> case 5: *to++ = *from++; >> case 4: *to++ = *from++; >> case 3: *to++ = *from++; >> case 2: *to++ = *from++; >> case 1: *to++ = *from++; >> } while (--n > 0); >> } >> >> } >> >> Finally, Stroustrup asks the rhetorical question ``why would anyone >> want to write something like this.'' Any guesses?! > >Yeah. That's the most hackish way of trying to write a portable optimized >copy routine I've ever seen. I gather the whole point of the shenanigans >is to get all the *from++ -> *to++ instructions in the generated code to be >adjacent. > Here is my own personal version of the "Portable Optimized Copy" routine. It certainly seems more clear than the above example, and I would expect it to be at least as fast on virtually any machine. (Note the use of division and mod in the above routine.) Code does NOT have to be ugly to be fast. /* Copy 'count' ints from *from to *to. */ void nats_send(to, from, count) register int *to; register int *from; register int count; { /* Copy the bulk of the data */ while(count >= 8) { *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; count -= 8; } /* Copy the rest. */ if(count >= 4) { *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; count -= 4; } if(count >= 2) { *to++ = *from++; *to++ = *from++; count -= 2; } if(count) *to++ = *from++; } Hope this helps. -- Nathaniel Stitt | This life is a test. It is only a test. Had Guidelines Software, Inc. | this been an actual life, you would have received ucbvax!cogsci!z!nat | further instructions as to what to do and where (415) 376-1395 | to go.
bill@proxftl.UUCP (T. William Wells) (08/31/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'. The results?
: Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
: Duff. So why do people spend so much time avoiding standard subroutines?
Try some history, bud; it's good for what ails you.
I doubt that memcpy even existed then; and it is *not* standard
now. Perhaps it will be several years after the ANSI standard is
adopted, but not till then.
---
Bill
novavax!proxftl!bill
earl@mips.COM (Earl Killian) (09/01/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'. The results?
: Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
: Duff. So why do people spend so much time avoiding standard subroutines?
Try some history, bud; it's good for what ails you.
I doubt that memcpy even existed then; and it is *not* standard
now. Perhaps it will be several years after the ANSI standard is
adopted, but not till then.
Perhaps a better reason:
According to the author, that code was used for copying to a 16-bit IO
device. It would have be illegal to use memcpy or bcopy because they
would make word references to the device.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (09/01/88)
In article <1002@cbnews.ATT.COM>, lvc@cbnews.ATT.COM (that's me) wrote: > Sometimes the standard subroutines are implemented horribly. I was > horrified when I saw that the machine dependent version of memcpy > on the AT&T 3Bs is nothing but a byte by byte transfer written in > assembly language. ... Oops. I should have said this was on an early version (~1984) of a 3B5. New 3Bs don't have this deficiency. -- Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM
hankd@pur-ee.UUCP (Hank Dietz) (09/01/88)
In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: > Here is my own personal version of the "Portable Optimized Copy" routine. .... then he gives a rather verbose, but structured, encoding.... As long as we're getting into structured, portable, hacks, let me suggest the following two ways of doing block copy: 1. If the number of items/bytes is known at compile time, then you can define a struct type of the appropriate size and use struct assign. with type casts to make it fly. For example, suppose p and q are pointers to ints and I want to copy 601 ints from p to q. Then I can write the fast and surprizingly portable: struct t601 { int t[601]; }; *((struct t601 *) q) = *((struct t601 *) p); Of course, you do have to watch-out for alignment problems, but if your compiler doesn't generate very fast code for this.... 2. If the number of items/bytes is not known, then build a binary tree of such structs and copy half, then half of what remains, etc. This is funny looking, but very fast also. Suppose the number of ints (n) is not known at compile time, but can't be more than 601. You can write: struct t512 { int t[512]; }; struct t256 { int t[256]; }; struct t128 { int t[128]; }; struct t64 { int t[64]; }; struct t32 { int t[32]; }; struct t16 { int t[16]; }; struct t8 { int t[8]; }; struct t4 { int t[4]; }; struct t2 { int t[2]; }; if (n & 512) { *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; } if (n & 256) { *((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256; } if (n & 128) { *((struct t128 *) q) = *((struct t128 *) p); q+=128; p+=128; } if (n & 64) { *((struct t64 *) q) = *((struct t64 *) p); q+=64; p+=64; } if (n & 32) { *((struct t32 *) q) = *((struct t32 *) p); q+=32; p+=32; } if (n & 16) { *((struct t16 *) q) = *((struct t16 *) p); q+=16; p+=16; } if (n & 8) { *((struct t8 *) q) = *((struct t8 *) p); q+=8; p+=8; } if (n & 4) { *((struct t4 *) q) = *((struct t4 *) p); q+=4; p+=4; } if (n & 2) { *((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; } if (n & 1) *q = *p; Notice that, in this case, n, p, and q should be declared as being register variables and that p and q are altered by this routine. Of course, you can copy larger things by making larger power-of-2 sized structs. Incidentally, this ran about 8x faster (on a VAX 11/780) than using the usual copy loop. Unfortunately, the above code should have been written as: if (n & 512) { *(((struct t512 *) q)++) = *(((struct t512 *) p)++); } ... but, for some unknown reason, the VAX C compiler didn't like that. Enjoy. hankd@ee.ecn.purdue.edu
gwyn@smoke.ARPA (Doug Gwyn ) (09/01/88)
In article <2945@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >According to the author, that code was used for copying to a 16-bit IO >device. It would have be illegal to use memcpy or bcopy because they >would make word references to the device. Wrong reason. memcpy() etc. don't have the right behavior for stashing a sequence of values into the same destination location.
loren@pixar.UUCP (Loren Carpenter) (09/01/88)
The Duff Loop (as far as I know) was first cast into C by Tom Duff when he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever we needed reasonable speed without resorting to assembly language. It obviously generalizes to more than memory copy & clear. I and others have used this control construct for many years, but always in assembly language. I learned it from Howard Schmeising at Boeing in 1969, where we were writing optimal stack code for CDC 6600's (a $7M RISC machine). We could get 5+ 60-bit Mflops if we worked at it. Loren Carpenter ...{ucbvax,sun}!pixar!loren
rodc@unet.unet.pacbell.COM (Rod Creason) (09/01/88)
Newsgroups: comp.lang.c,comp.arch Subject: Re: Explanation, please! Summary: Expires: References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP> <189@bales.UUCP> <9064@pur-ee.UUCP> Sender: Reply-To: rodc@unet.PacBell.COM (Rod Creason) Followup-To: Distribution: Organization: Network Equipment Technologies, Redwood City, CA Keywords: byte copy In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >2. If the number of items/bytes is not known, then build a binary tree of > such structs and copy half, then half of what remains, etc. This is > funny looking, but very fast also. Suppose the number of ints (n) is > not known at compile time, but can't be more than 601. You can write: > > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; > struct t128 { int t[128]; }; > struct t64 { int t[64]; }; > struct t32 { int t[32]; }; > struct t16 { int t[16]; }; > struct t8 { int t[8]; }; > struct t4 { int t[4]; }; > struct t2 { int t[2]; }; > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } > [ rest of the example ] A key point is that this code is *not* portable unless you can *guarantee* that q and p are ALWAYS aligned at a four-byte boundary. Any 3b2 (WE32000), for example, will core dump if this is not the case. Although 68000 based computers only require even alignment, they would still fail in most cases where this routine would be used. The key to a fast memory-to-memory copy is getting to these boundaries. (and in the case where q or p is an odd address and the other is at an even address, none of this will work) > hankd@ee.ecn.purdue.edu Rod Creason ...!{ames,amdahl,oliveb,pacbell}!unet!rodc
pf@csc.ti.com (Paul Fuqua) (09/01/88)
Date: Wednesday, August 31, 1988 9:13pm (CDT) From: loren at pixar.UUCP (Loren Carpenter) Subject: Re: Explanation, please! Newsgroups: comp.lang.c,comp.arch The Duff Loop (as far as I know) was first cast into C by Tom Duff when he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever we needed reasonable speed without resorting to assembly language. It obviously generalizes to more than memory copy & clear. There was a discussion of the device in comp.protocols.tcp-ip not too long ago, as a way to speed up the calculation of checksums. The one point that I haven't seen brought out here yet is that if the unrolling factor is a power of two, the divide/remainder operations are simply a shift and a mask. pf Paul Fuqua Texas Instruments Computer Science Center, Dallas, Texas CSNet: pf@csc.ti.com (ARPA too, sometimes) UUCP: {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf
klg@njsmu.UUCP (Kenneth Goodwin) (09/02/88)
In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: > In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: > > Here is my own personal version of the "Portable Optimized Copy" routine. > 2. If the number of items/bytes is not known, then build a binary tree of > such structs and copy half, then half of what remains, etc. This is > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; > struct t128 { int t[128]; }; .... etc ..... > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } > if (n & 256) { > *((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256; > } ... etc ... > Incidentally, this ran about 8x faster (on a VAX 11/780) than using > the usual copy loop. Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... BUT This is where UNIONS come in handy, I used a similar although more brief technique for a faster version of a bmov() (byte move) subroutine on our PDP11-70 a while ago, and subsequently ported it to memcpy when we updated from V6 to System V. The basic idea that was used is to create a union of long, int, (short), and char pointers, use the character pointer to achieve the needed alignments and then use the largest available pointer to do the copy. There is no reason why a stucture copy could not be used, although I suspect on NON-VAX systems it may actually be detremental (sp?) in some cases. The PDP11 C compiler used to stuff registers onto the stack and create a 16 bit word copy loop to do structure copies using the freed registers, restoring them when it was done. So a structure copy would be the same as a word copy on that style of a system (ie, ones without block move instructions) So In the case of your example, a modified brief version of it would be: union ptr_types { struct t512 { int t512[512] } *t512; .... struct t32 { int t32[32] } *t32; long *t_long; int *t_int; short *t_short; char *t_char; } ; (probably could dispense with long and short pointers and related tests) memcpy(a, b, len) char *a; *b; { register union ptr_types a_ptr, b_ptr; a_ptr.t_char = a; b_ptr.t_char = b; while(NOT ON A WORD BOUNDARY AND CHARS LEFT) { *a_ptr.t_char++ = *b_ptr.t_char++; len--; } if(len >= sizeof(int) * 512) { /* if we can use a 512 int structure copy */ *a_ptr.t512++ = *b_ptr.t512++; len -= (512 * sizeof(int)); } /*M the biggest win is that the pointers increment correctly len -= (sizeof(*element pointer)) is the correct form over N INTS * sizeof int */ ....... I guess the rest is obvious, some GLUE may be needed that has not be shown.... :-) Boundaries should be checked on source and destination addresses to avoid memory faults.... As you may be given incompatible source and destination address that may require a full char by char copy. The first test loop sort of does this, but all the other copies should also check for proper address alignments before proceeding. Ken Goodwin NJSMU.
andrew@frip.gwd.tek.com (Andrew Klossner) (09/02/88)
Nathaniel Stitt writes: "Here is my own personal version of the "Portable Optimized Copy" routine. It certainly seems more clear than the above example, and I would expect it to be at least as fast on virtually any machine." then goes on to present a routine which uses follow-on code to handle the last few bytes after all octets have been copied. It's cleaner code, but it won't be quite as fast on many systems with instruction caches because it has fifteen byte-move instructions, replacing eight in the original, so more time is spent loading the loop into the i-cache. On systems with very small i-caches (my favorite example is the IBM 360/91 with 16 bytes), the bigger loop may not all fit into cache, and would be considerably slower. Several contributors have suggested that unrolling a byte-copy loop is a win. On some architectures it is, but on a good pipelined system it may not be. As an example, the program fragment while (count--) { to[i] = from[i]; ++i; } can be compiled to code on the M88k which copies memory as fast as a DMA controller could; the instructions to decrement, increment, and branch overlap with the data load/store requests. [If everything's in registers, indexing in this case is actually faster than keeping separate "to" and "from" pointers and incrementing both.] This assumes that "to" and "from" are pointers-to-ints or pointers-to-doubles. Copying less than a word at a time is slower. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]
jabir@quintus.uucp (Jabir Hussain) (09/02/88)
In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; [...] > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } [...] > Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... > > but, for some unknown reason, the VAX C compiler didn't like that. one portable way around that is to do something like: typedef char *caddr_t; typedef union { struct { int t[512]; } t512; struct { int t[256]; } t256; caddr_t caddr; } ustruct_t; { register ustruct_t src,dst; src.caddr = (caddr_t) p; dst.caddr = (caddr_t) q; if (n & 512) *dst.t512++ = *src.t512++; ... } jh
rick@pcrat.UUCP (Rick Richardson) (09/02/88)
I did a quick comparison of the times for "memcpy(,,n*4)", Duff's device, Nathaniel's "nat", and Hank's versions. Here's the results: TEST MACHINE ATT cc VENIXcc GNU1.26 memcpy 8Mhz 286, Venix SVR2 3.35 11.15 - duff 8Mhz 286, Venix SVR2 16.58 4.56 - nat 8Mhz 286, Venix SVR2 16.73 5.11 - hank 8Mhz 286, Venix SVR2 1.83 ccbarf - memcpy 16Mhz 386, 386/ix 1.0.4 .82 - .82 duff 16Mhz 386, 386/ix 1.0.4 2.01 - 2.26 nat 16Mhz 386, 386/ix 1.0.4 1.95 - 2.21 hank 16Mhz 386, 386/ix 1.0.4 .86 - .89 * Everything was compiled with -O * test moved 2000,1999,...,2,1 int's I learned this: Venix obviously doesn't have a good memcpy, and Duff, or nat make a good replacement. AT&T uses the block move in memcpy, and it is a definite win on both machines. The AT&T 286 C compiler is terrible for these program on the 286. AT&T still has a slight edge over gcc *for these programs*. Hanks version, while not portable, is a clear win on the 286 (disassemble memcpy and hank's to see why), and virtually as good as memcpy on 386. Conclusion: you can't make any. It all depends on your computer and compiler. -- Rick Richardson, PC Research, Inc. rick%pcrat.uucp@uunet.uu.net (INTERNET) uunet!pcrat!rick (UUCP, Personal Mail) ..!pcrat!jetroff (JetRoff Info) ..!pcrat!dry2 (Dhrystone Submissions)
mouse@mcgill-vision.UUCP (der Mouse) (09/02/88)
In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: > In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: >> ["Portable Optimized Copy" routine] > As long as we're getting into structured, portable, hacks, let me > suggest the following two ways of doing block copy: > 1. If the number of items/bytes is known at compile time, then you > can define a struct type of the appropriate size and use struct > assignment with type casts to make it fly. For example, [ int *p; int *q; /* want to copy 601 ints from p to q */ ] > struct t601 { int t[601]; }; > *((struct t601 *) q) = *((struct t601 *) p); Surprise! Your compiler has sizeof(int)=2, sizeof(long)=4, and aligns structs on long boundaries. You just copied 602 ints! > Unfortunately, the above code should have been written as: > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > but, for some unknown reason, the VAX C compiler didn't like > that. Perhaps the "unknown reason" is that it's illegal! An expression formed by casting another expression is not an lvalue and never has been. Any compiler that accepts that is broken. (Yes, there exist such broken compilers.) What would you expect from double d; ((int)d) ++; Perhaps you'd want d+=1? Perhaps d=1+(int)d? Perhaps *(int*)&d=1+(int)d even? der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
chris@mimsy.UUCP (Chris Torek) (09/02/88)
-In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: -> *(((struct t512 *) q)++) = *(((struct t512 *) p)++); -> but, for some unknown reason, the VAX C compiler didn't like that. In article <339@quintus.UUCP> jabir@quintus.uucp (Jabir Hussain) writes: -one portable way around that is to do something like: - - typedef char *caddr_t; - typedef union { - struct { int t[512]; } t512; - struct { int t[256]; } t256; - caddr_t caddr; - } ustruct_t; ... - if (n & 512) *dst.t512++ = *src.t512++; `Portable'? I suggest you try it on a Data General MV/10000. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
alverson@decwrl.dec.com (Robert Alverson) (09/03/88)
In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > Incidentally, this ran about 8x faster (on a VAX 11/780) than using > the usual copy loop. Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... > > but, for some unknown reason, the VAX C compiler didn't like that. > > >Enjoy. > hankd@ee.ecn.purdue.edu Actually, the VAX C compiler was completely correct. Attempting to post-increment the result of a cast is not legal C. One way around this is to use a little extra indirection: char* p; /* was: ((int *)p)++; */ (*((int **) &p))++; /* phew! */ The reasoning is that the result of a cast is an expression, not a lvalue. Quite often, compilers relax this rule for coercions which produce no code. In this case, it looks like the VAX C doesn't Bob
hankd@pur-ee.UUCP (Hank Dietz) (09/05/88)
Relative to my article <9064@pur-ee.UUCP> giving a struct-based hack for doing block copy and to all followups thereof.... Correction: I hereby appologize for having ommitted a (:-) and hence having made it sound like I didn't know why the VAX compiler didn't want to use a cast pointer as the operand of ++. I do know (and I knew even before a dozen of you reminded me :-), I just think it should be allowed in such cases; i.e., be implementation dependent rather than illegal. Anyway, the union-based fix given by Jabir is certainly a good way to do it (once you fix his typedef to make t512 and t256 pointers to structs, rather than structs). Comments: As I noted in the original posting, the use of a structure-hack block copy is subject to alignment problems -- if the target machine has alignment constraints. There are three additional notes on this: 1. Both the source AND destination pointers may need to be aligned, which implies that the difference between the pointers MUST BE A MULTIPLE OF THE ALIGNMENT FACTOR. Arbitrary copies will not have this property, hence, on such machines one must always have a byte-oriented copy routine to fall back on. For a 4-byte alignment machine, the test is something like: if ((p - q) & 3) *byte copy* else *struct copy* 2. If the difference between the pointers is a multiple of the machine alignment factor, then copying up until one is aligned will insure alignment of both. Hence, we simply need to insert a few moves to achieve alignment BEFORE the struct copy code I gave earlier. For a 4-byte alignment machine, the additional code is: if ((n >= 1) && (p & 1)) { *q = *p; ++q; ++p; --n; } if ((n >= 2) && (p & 2)) { *((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; n-=2; } Notice that, with the above code as a prefix, trailing alignment is also satisfied (proof left as an exercise to the reader). 3. Although I know of no such C compiler, it is legal for a C compiler to make the size of a struct anything big enough to hold the things inside it. Hence, for example, a struct containing an array of 4 things might be the size of 5 things.... If this were done, the struct copy would nearly work; the only problem is that it could overshoot the end of the block. There is no general fix for this. hankd@ee.ecn.purdue.edu
mark@bnr-rsc.UUCP (Mark MacLean) (09/06/88)
In article <10329@tekecs.TEK.COM> andrew@frip.gwd.tek.com (Andrew Klossner) writes: >Several contributors have suggested that unrolling a byte-copy loop is >a win. On some architectures it is, but on a good pipelined system it >may not be. As an example, the program fragment > > while (count--) { > to[i] = from[i]; > ++i; > } > >can be compiled to code on the M88k which copies memory as fast as a >DMA controller could; the instructions to decrement, increment, and >branch overlap with the data load/store requests. If you have instructions to load, store, increment, decrement, and branch in your loop then presumably this takes at least 5 clocks to perform only 2 memory accesses. Would'nt a DMA controller be able to perform a memory access every clock cycle? Is it not possible to unroll the loop into an inline stream of instructions (if the length is small and known at compile time) to produce an instruction sequence which could perform a memory access every cycle? If not, why not? It would seem very un-RISCy if the 88000 was unable to do this. Mark MacLean
pardo@june.cs.washington.edu (David Keppel) (09/07/88)
hankd@pur-ee.UUCP (Hank Dietz) writes: > if ((p - q) & 3) *byte copy* else *struct copy* I believe that the VAX "movc" command takes arbitrary pointers and does the following: * If both are word-aligned, do a word copy (I mean a 4-byte word). * If both are non-aligned and could be aligned with 1, 2, or 3 bytes of byte-copy at either end, then do a byte copy at either end and do a word copy down the middle. * If niether aligned then ?? Unfortunately, my VAX hardware reference is out of town for a couple of weeks, so I can't ask him about neither aligned. Anybody know? I can immagine that on some machines it is faster to copy words into register and repack the words in the registers rather than do a byte copy, since you could be taking advantage of some hardware gak. Simple example: machine X has register W1 divided into B4, B5, B6, B7. To do a copy, align the source pointer (doing byte copies) then read a wrod-at-a-time into the W1 register, write it back out by writing B4, B5, B6, B7 (little-endian). This is beginning to look suspiciously like the kinds of optimizations that get done for bit BLTs. Anybody know if this ever really gets done? ;-D on ( Ahh. Architecture at its finest ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
rik@june.cs.washington.edu (Rik Littlefield) (09/07/88)
In article <5654@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes: > > I can immagine that on some machines it is faster to copy words into > register and repack the words in the registers rather than do a byte > copy, since you could be taking advantage of some hardware gak. > On the old CDC 6000-series machines (early RISCs...) that was the *only* practical way to do it, as well as being blazingly fast. We had copies that would handle arbitrary *bit* alignments at a cost of around 6 instructions and 2 memory references per 60-bit word, in the middle of the string. The sequence was basically fetch, shift, mask, mask, OR, and store, appropriately rearranged to minimize memory delay and functional unit conflicts, of course. I vaguely remember that this thing could even be unrolled a couple of times and still fit in the instruction cache ("stack", in those days) for machines expensive enough to have one. VAXen I don't know about for sure, but I'd be real surprised if their microcode didn't do the same thing. --Rik
news@amdcad.AMD.COM (Network News) (09/07/88)
In article <5654@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes: | hankd@pur-ee.UUCP (Hank Dietz) writes: | > if ((p - q) & 3) *byte copy* else *struct copy* | | I believe that the VAX "movc" command takes arbitrary pointers and | does the following: | | * If both are word-aligned, do a word copy (I mean a 4-byte word). | * If both are non-aligned and could be aligned with 1, 2, or 3 bytes | of byte-copy at either end, then do a byte copy at either end and do | a word copy down the middle. | * If niether aligned then ?? | | Unfortunately, my VAX hardware reference is out of town for a couple | of weeks, so I can't ask him about neither aligned. Anybody know? I assume that "both are non-aligned" in the above list means that the source and the destination have the same alignment, but are not aligned with respect to a 4-byte boundary, and that "neither aligned" means that the source and the destination are misaligned. We use an interesting trick in the Am29000 memcpy routine for source/destination misalignment. In this case, we set up the alignment difference in the funnel-count register, read in two source words, and "extract" a destination word using the funnel-shifter's ability to extract any 32-bit word from a 64-bit double-word in a single cycle. The inner loop then consists of shifting the low source word to the high source word, reading a new low source word, extracting a destination and storing it (well, there is also the overhead of counting down the correct number of "word" moves, but you get the idea.) -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com)
aglew@urbsdc.Urbana.Gould.COM (09/07/88)
>I can immagine that on some machines it is faster to copy words into >register and repack the words in the registers rather than do a byte >copy, since you could be taking advantage of some hardware gak. > >Simple example: > > machine X has register W1 divided into B4, B5, B6, B7. To do a > copy, align the source pointer (doing byte copies) then read a > wrod-at-a-time into the W1 register, write it back out by writing > B4, B5, B6, B7 (little-endian). > >This is beginning to look suspiciously like the kinds of optimizations >that get done for bit BLTs. Anybody know if this ever really gets >done? Yep, it's done on Gould machines for halfwords to words in some copy routines. However, on the NP1 for "typical" distributions of operands, it turns out to be better to just copy at the greatest common denominator of alignment using the appropriate vector moves.
rpw3@amdcad.AMD.COM (Rob Warnock) (09/10/88)
In article <22859@amdcad.AMD.COM> tim@delirun.amd.com (Tim Olson) writes: +--------------- | We use an interesting trick in the Am29000 memcpy routine for | source/destination misalignment. In this case, we set up the alignment | difference in the funnel-count register, read in two source words, and | "extract" a destination word using the funnel-shifter's ability to | extract any 32-bit word from a 64-bit double-word in a single cycle. | The inner loop then consists of shifting the low source word to the high | source word, reading a new low source word, extracting a destination and | storing it... +--------------- It's even better than that, Tim. In the non-aligned case, the inner loop of the "bcopy()" routine in the 4.3 port for the 29000 uses load-multiple to grab a bunch of source words (making good use of "burst mode" in the memory, if any), a straight-line series of extract's, then a store-multiple, with a couple more instructions at either end of the inner block to propagate the "left-over" bits to the next block. Using this trick, an Am29000 can do large block transfers between arbitary *bit* boundaries with an asymptotic overhead of only one more cycle per word than word-aligned block-copy. Given single-cycle burst-mode memories (not hard to achieve, it's starting the burst that's expense), that puts the asymptotic performance of non-aligned transfers at 3 cycles/word, and word-aligned transfers at 2 cycles/word. Not too shabby... An assembly-language version of Duff's device keeps the overhead for small transfers reasonable, too. Rob Warnock Systems Architecture Consultant UUCP: {amdcad,fortune,sun}!redwood!rpw3 ATTmail: !rpw3 DDD: (415)572-2607 USPS: 627 26th Ave, San Mateo, CA 94403
mouse@mcgill-vision.UUCP (der Mouse) (09/11/88)
In article <339@quintus.UUCP>, jabir@quintus.uucp (Jabir Hussain) writes: > In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >> *(((struct t512 *) q)++) = *(((struct t512 *) p)++); >> but, for some unknown reason, the VAX C compiler didn't like that. > one portable way around that is to do something like: [summarized] > union { struct { int t[512]; } *t512; ...; char *caddr; } [then use ++ on .t512] This falls hopelessly flat if char * and struct {...} * have significantly different representation. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
df@nud.UUCP (Dale Farnsworth) (09/14/88)
Mark MacLean (mark@bnr-rsc.UUCP) writes: > Is it not possible to unroll the loop into an inline stream of instructions > (if the length is small and known at compile time) to produce an instruction > sequence which could perform a memory access every cycle? If not, why not? > It would seem very un-RISCy if the 88000 was unable to do this. Of course, it is possible to do a memory access every cycle on the MC88100 (with a 2 cycle pipeline startup delay on the first memory access), but at 25 MHz, your memory will have to have an effective cycle time of 40 nS. If memory is slower, the processor can increment pointers and check termination conditions while waiting for memory. -Dale