schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/25/88)
The following piece of wonderful obscurity comes from Stroustup's C++ Programming Language book, page 100: void send(int *to,int *from, int count) { int n = (count + 7) / 8; switch(count % 8) { case 0: do { *to++ = *from++; case 7: *to++ = *from++; case 6: *to++ = *from++; case 5: *to++ = *from++; case 4: *to++ = *from++; case 3: *to++ = *from++; case 2: *to++ = *from++; case 1: *to++ = *from++; } while (--n > 0); } } Now, much to my surprise, this is not only valid C++, it is also valid C! Could some one please explain to me why this is so? It seems like the case 7-1 labels are actually nested inside the do {} while loop, and thus not in the scope of the switch (should a break statement exit both the switch and the loop, or just one?!?!). Finally, Stroustrup asks the rhetorical question ``why would anyone want to write something like this.'' Any guesses?! thanks, Doug Schmidt -- Douglas Schmidt "If our behavior is strict, we do not need fun." -Zippy th' Pinhead "If our behavior is struct, we do not need defun." -Anon
chris@mimsy.UUCP (Chris Torek) (08/25/88)
In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >The following piece of wonderful obscurity comes from Stroustup's >C++ Programming Language book, page 100: > switch(count % 8) { > case 0: do { *to++ = *from++; > case 7: *to++ = *from++; ... > case 1: *to++ = *from++; > } while (--n > 0); > } >Now, much to my surprise, this is not only valid C++, it is also valid C! >Could some one please explain to me why this is so? It seems like >the case 7-1 labels are actually nested inside the do {} while loop, >and thus not in the scope of the switch (should a break statement exit >both the switch and the loop, or just one?!?!). `break' exits the innermost switch or loop, hence a `break' in cases 7 through 1 exits the do-while, not the switch. >Finally, Stroustrup asks the rhetorical question ``why would anyone >want to write something like this.'' Any guesses?! This has been called `Duff's device' (after Tom Duff, who probably did not invent it first), and it looks exactly like what an optimising compiler generates when it does loop unrolling. It works because case labels are just that---labels. As long as there is a switch in scope, a case label is legal; the case applies to the closest switch. (And if C had used a separate keyword for `break switch', the whole thing could be consistent :-) .) Why? To quote a certain infamous C language hack :-) , `It looks exactly like what an optimising compiler generates when it does loop unrolling.' Incidentally, there are some compilers that choke on that form. I would have to look hard at the dpANS to decide whether it is Officially Legal. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
gwyn@smoke.ARPA (Doug Gwyn ) (08/25/88)
In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes:
- switch(count % 8) {
- case 0: do { *to++ = *from++;
- case 7: *to++ = *from++;
- case 6: *to++ = *from++;
- case 5: *to++ = *from++;
- case 4: *to++ = *from++;
- case 3: *to++ = *from++;
- case 2: *to++ = *from++;
- case 1: *to++ = *from++;
- } while (--n > 0);
- }
-Now, much to my surprise, this is not only valid C++, it is also valid C!
-Could some one please explain to me why this is so?
Could you explain why it shouldn't be? "case" labels are just a
special form of label. You can stick a label on most statements.
-It seems like the case 7-1 labels are actually nested inside the
-do {} while loop, and thus not in the scope of the switch
?? "Thus"? Why would the labels go out of scope? They're
definitely within the body of the switch.
-(should a break statement exit both the switch and the loop, or just one?!?!)
A "break" applies to whatever the innermost container is. A "break"
between two of the "*to++ = *from++;" statements would exit the
do-while loop.
-Finally, Stroustrup asks the rhetorical question ``why would anyone
-want to write something like this.'' Any guesses?!
It's about the fastest way to move arbitrarily-aligned data in
portable C with a guarantee as to what happens in the case that the
data overlaps. memcpy() doesn't guarantee anything for overlaps.
memmove() does, but that's a recent X3J11 invention that probably
doesn't exist on your system yet.
levy@ttrdc.UUCP (Daniel R. Levy) (08/26/88)
< void send(int *to,int *from, int count) { < int n = (count + 7) / 8; < < switch(count % 8) { < case 0: do { *to++ = *from++; < case 7: *to++ = *from++; < case 6: *to++ = *from++; < case 5: *to++ = *from++; < case 4: *to++ = *from++; < case 3: *to++ = *from++; < case 2: *to++ = *from++; < case 1: *to++ = *from++; < } while (--n > 0); < } < } Question: what if count==0? -- |------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY | Bell Labs Area 61 (R.I.P., TTY)| AND ARE NOT TO BE IMPUTED TO AT&T. | Skokie, Illinois | |-----Path: att!ttbcad!levy-----|
jbs@fenchurch.MIT.EDU (Jeff Siegal) (08/26/88)
In article <2873@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes: >>[Duff's device loop example from C++ book] > >Question: what if count==0? The program breaks. I prefer (and also use, in highly-speed-sensitive code): #define duff16(counter, block) \ switch (counter & 0x0f) { \ do \ { \ counter -= 16; \ { block; } \ case 15: { block; } \ case 14: { block; } \ case 13: { block; } \ case 12: { block; } \ case 11: { block; } \ case 10: { block; } \ case 9: { block; } \ case 8: { block; } \ case 7: { block; } \ case 6: { block; } \ case 5: { block; } \ case 4: { block; } \ case 3: { block; } \ case 2: { block; } \ case 1: { block; } \ case 0: /* null statement */; \ } while (counter >= 16); \ } duff16(n, *to++ = *from++) Jeff Siegal
eric@snark.UUCP (Eric S. Raymond) (08/27/88)
(Code below reproduced so that comp.arch people seeing this followup only won't get terminally frustrated. This is *really neat*, gang...) In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes: > > void send(int *to,int *from, int count) { > int n = (count + 7) / 8; > > switch(count % 8) { > case 0: do { *to++ = *from++; > case 7: *to++ = *from++; > case 6: *to++ = *from++; > case 5: *to++ = *from++; > case 4: *to++ = *from++; > case 3: *to++ = *from++; > case 2: *to++ = *from++; > case 1: *to++ = *from++; > } while (--n > 0); > } > > } > > Finally, Stroustrup asks the rhetorical question ``why would anyone > want to write something like this.'' Any guesses?! Yeah. That's the most hackish way of trying to write a portable optimized copy routine I've ever seen. I gather the whole point of the shenanigans is to get all the *from++ -> *to++ instructions in the generated code to be adjacent. This only makes if the author knows he's got a hardware instruction pipeline or cache that's no less than 8 and no more than 9 byte-copy instruction widths long, and stuff executing out of the pipeline is a lot faster than if the copies are interleaved with control transfers. Dollars to doughnuts this code was written on a RISC machine. (Gawrsh. That sounded just like one of the big boys on comp.arch tawkin'. I think I'll cross-post over there just to see if I get shot down in flames...) -- Eric S. Raymond (the mad mastermind of TMN-Netnews) UUCP: ...!{uunet,att,rutgers}!snark!eric = eric@snark.UUCP Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718
henry@utzoo.uucp (Henry Spencer) (08/27/88)
In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >The following piece of wonderful obscurity comes from Stroustup's >C++ Programming Language book, page 100... Tom Duff, what hast thou wrought? :-) >... >Now, much to my surprise, this is not only valid C++, it is also valid C! >Could some one please explain to me why this is so? It seems like >the case 7-1 labels are actually nested inside the do {} while loop, >and thus not in the scope of the switch (should a break statement exit >both the switch and the loop, or just one?!?!). You're thinking of switch as if it were Pascal's case; the correct model is closer to Fortran's computed goto. The labels must appear within the case body, but they can be (essentially) *anywhere* within the body, including within nested blocks. The sensible programmer will not exploit this freedom except in truly unusual situations, but it is available. Break breaks out of the innermost construct that it could apply to, i.e. "just one". >Finally, Stroustrup asks the rhetorical question ``why would anyone >want to write something like this.'' Any guesses?! This construct is called "Duff's Device", after its discoverer. One can often make a loop run faster by "unrolling" it, duplicating its body N times and running one-Nth as many iterations -- this reduces the loop- control overhead by a factor of N. The annoying part is that the number of times one needs to do the body is usually not an even multiple of N, so one has to do a partial iteration at beginning or end. The usual assembly-language trick is to compute a start address in the middle of the loop and start by branching there, not to the beginning, thus doing a partial iteration first. Most high-level languages have no way to express this. Tom Duff (of Bell Labs) realized that it could be done in C. Ugh. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
davidsen@steinmetz.ge.com (William E. Davidsen Jr) (08/27/88)
In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: | The following piece of wonderful obscurity comes from Stroustup's | C++ Programming Language book, page 100: [ a do loop within a switch, with cases all through the do loop ] This does not seem to be clearly covered in the latest dpANS (section 3.6). The issue is if it is legal *and defined* to jump into a loop. While a case label behaves just like any other label, it's not clear what a jump into a loop implies, and I don't see that the existing standard is any clearer on the topic than the discussion was three years ago. Consider: i = 15; if (j < 10) goto BadMove; /* more code here */ do { int i = 7; int myvect[400]; BadMove: /* entry into block */ myvect[300] = i; if (j < 5) return i; } while (j-- > 302); There was a lot of discussion of this and I don't see a clarification in the standard as it stands. I think there are several possibilities: 1) it's not allowed 2) it's allowed, but the initializations at the start of the block don't take place. What does this imply about the block local variables? 3) it's allowed and the block local variables are allocated and initialized as you would expect. 4) it's allowed but the results are implementation dependent. In all other cases entry into a block causes allocation of the block local (auto) variables. Initialization is performed at the time of allocation. By allowing entry at any point the expected behavior is unspecified. Suggestion: Forget (1), it's there for completeness. Pick any of 2-4 and add it to the standard NOW, as an editorial change. We talked about this three years ago, it can hardly be considered a recent issue. I would not suggest (1) because it breaks a lot of existing code, but I would be happy to see a (5) "you can't jump into a block with local variables." This is one of the rare times when I would like to be wrong, to be told "oh, it's there in a footnote, you can, or you can't, and this is what it does." If the action is implementation dependent then I (personally) wouldn't use the feature, since I value portability highly. Please tell me X3J11 didn't let this go unresolved for three years. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
gwyn@smoke.ARPA (Doug Gwyn ) (08/27/88)
In article <9957@eddie.MIT.EDU> jbs@fenchurch.MIT.EDU (Jeff Siegal) writes: >... I prefer (and also use, in highly-speed-sensitive code): ... It is worth noting that byte-at-a-time is not optimal on most architectures. Duff's device is useful for arbitrarily-aligned data, but if you can arrange for (most of) the source and destination buffers to be relatively word-aligned, then along with processing the head and tail pieces via something like Duff's device, the bulk of the data can be transferred in bigger chunks, gaining considerable increase in speed. One should really use memcpy() in all cases where the source and destination do not overlap, because all the work done to optimize block-move for a specific architecture are supposed to be centralized there. A couple of people have complained about the tone of my previous response on this topic. I meant it to be informative and minimal, i.e. requiring some thought on the reader's part. Since some have missed the point, let me summarize the C style issue: Since the constructs are not forbidden by the C language rules, they are perforce permitted. One should have an explicit rule to point at as being violated in order to think that the code is not valid. "Intuition" is a lousy guide, particularly when it was formed by experience in other contexts (e.g. Pascal). Now, because of the demands made on one's understanding of the language, the cited code would be considered bad programming style IF what it did could be effectively accomplished in some other way. As I pointed out in my previous posting on this, Duff's device does indeed have use in cases where other approaches fail, namely moving overlapping unaligned data. Eventually we can all use memmove() for this, but not today. Incidentally, Duff's device is used to manipulate text strings in the "sam" text editor. I recently changed it to use the device for both directions, only in cases of overlap, using memcpy() for all other cases. (Originally the device was used for all transfers in one direction and memcpy() for all in the other, which assumed more about the behavior of memcpy() than is officially guaranteed.) So this is not just an academic issue..
schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) (08/27/88)
Hi, Since I posted my original question there has been a great deal of abstract discussion about the relative merits of the loop unrolling scheme. The topic has piqued my curiousity, so I when ahead and implemented a short test program, included below, to test Duff's device against the ``ordinary for loop w/index variable'' technique. See for yourself.... After some quick testing I found that gcc 1.26 -O on a Sun 3 and a Sequent Balance was pretty heavily in favor of the regular (non-Duff) loop. Your mileage may vary. I realize that there may be other tests, and if anyone has a better version, I'd like to see it! Doug Schmidt ---------------------------------------- #include <sys/time.h> #include <sys/resource.h> double Start_Timer(); double Return_Elapsed_Time(); #define MAX_NUM 100000 int array1[MAX_NUM ]; int array2[MAX_NUM ]; int *A = array1, *B = array2; main(int argc, char *argv[]) { double Elapsed_Time; int Count = argc > 1 ? atoi(argv[1]) : MAX_NUM; int i; for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting Duff's device timing...\n"); Start_Timer(); { int n = (Count + 7) / 8; switch(Count % 8) { case 0: do { *A++ = *B++; case 7: *A++ = *B++; case 6: *A++ = *B++; case 5: *A++ = *B++; case 4: *A++ = *B++; case 3: *A++ = *B++; case 2: *A++ = *B++; case 1: *A++ = *B++; } while (--n > 0); } } Elapsed_Time = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3f\n",Elapsed_Time); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting ordinary copy timing...\n"); Start_Timer(); for (i = 0;i < Count ;i++) { array1[i] = array2[i]; } Elapsed_Time = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3f\n",Elapsed_Time); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } } /* no such thing as "negative time"! */ #define ERROR_VALUE -1.0 static struct rusage Old_Time; static struct rusage New_Time; static int Timer_Set = 0; #ifdef __STDC__ double Start_Timer(void) #else double Start_Timer() #endif { Timer_Set = 1; getrusage(RUSAGE_SELF,&Old_Time); /* set starting process time */ return(Old_Time.ru_utime.tv_sec + (Old_Time.ru_utime.tv_usec / 1000000.0)); } /* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */ /* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */ /* the time since the Old_Time was set is returned. Returns ERROR_VALUE */ /* if Start_Timer() is not called first */ #ifdef __STDC__ double Return_Elapsed_Time(double Last_Time) #else double Return_Elapsed_Time(Last_Time) double Last_Time; #endif { if (!Timer_Set) { return(ERROR_VALUE); } else { /* get process time */ getrusage(RUSAGE_SELF,&New_Time); if (Last_Time == 0.0) { return((New_Time.ru_utime.tv_sec - Old_Time.ru_utime.tv_sec) + ((New_Time.ru_utime.tv_usec - Old_Time.ru_utime.tv_usec) / 1000000.0)); } else { return((New_Time.ru_utime.tv_sec + (New_Time.ru_utime.tv_usec / 1000000.0)) - Last_Time); } } } -- schmidt@bonnie.ics.uci.edu (ARPA) "If our behavior is strict, we do not need fun." -Zippy th' Pinhead "If our behavior is struct, we do not need defun." -Anon
bill@proxftl.UUCP (T. William Wells) (08/27/88)
In article <638@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: : The following piece of wonderful obscurity comes from Stroustup's : C++ Programming Language book, page 100: : : void send(int *to,int *from, int count) { : int n = (count + 7) / 8; : : switch(count % 8) { : case 0: do { *to++ = *from++; : case 7: *to++ = *from++; : case 6: *to++ = *from++; : case 5: *to++ = *from++; : case 4: *to++ = *from++; : case 3: *to++ = *from++; : case 2: *to++ = *from++; : case 1: *to++ = *from++; : } while (--n > 0); : } : : } : : Now, much to my surprise, this is not only valid C++, it is also valid C! : Could some one please explain to me why this is so? It seems like : the case 7-1 labels are actually nested inside the do {} while loop, : and thus not in the scope of the switch (should a break statement exit : both the switch and the loop, or just one?!?!). A switch statement has the syntax switch (<expression>) <statement> While it is conventional to use a compound statement as the embedded statement, there is nothing requiring it. What a switch does is permit the definition of a set of labels within the statement, to which control is passed by the head of the switch. The exclusive use of compound statements is not only conventional coding practice, it is also good coding practice. : Finally, Stroustrup asks the rhetorical question ``why would anyone : want to write something like this.'' Any guesses?! The false god of efficiency has reared it ugly head. This routine would be imagined to be more efficient than the almost equivalent: void send( int *to, int *from, int count) { if (count <= 0) { create progam bug } while (--count >= 0) { *to++ = *from++; } } However, it often (always?) is not. Consider a machine which has a fast means of moving small blocks of memory. The following routine should do the same thing (without the bug, of course), might be more efficient, and is more understandable as well. Of course, it might generate a smidgin more code, but I think that, in this case, this is a small price to pay. void send( int *to, int *from, int count) { if (count <= 0) { return; } switch (count % 7) { case 7: *to++ = *from++; case 6: *to++ = *from++; case 5: *to++ = *from++; case 4: *to++ = *from++; case 3: *to++ = *from++; case 2: *to++ = *from++; case 1: *to++ = *from++; } count >>= 3; while (--count >= 0) { *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; } } However, at least two IBM-PC compilers ought to generate better code from the "unoptimized" version. (Actually, they do it if the copied items are characters; I presume that they'll do it for integers.) I wouldn't be surprised if other compilers did as well. void send( int *to, int *from, int count) { /* If you are really paranoid, add the following code, it catches the case where count is minus full scale on a two's complement machine. I tend to not do this, since I consider minus full scale to be an illegal value and take the minimal pains needed to never generate it. if (count <= 0) { return; } */ while (--count >= 0) { *to++ = *from++; } } --- Bill novavax!proxftl!bill
bill@proxftl.UUCP (T. William Wells) (08/27/88)
In article <13198@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
: Incidentally, there are some compilers that choke on that form. I
: would have to look hard at the dpANS to decide whether it is Officially
: Legal.
It is.
---
Bill
novavax!proxftl!bill
henry@utzoo.uucp (Henry Spencer) (08/28/88)
In article <653@paris.ICS.UCI.EDU> schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >After some quick testing I found that gcc 1.26 -O on a Sun 3 and a >Sequent Balance was pretty heavily in favor of the regular (non-Duff) >loop... The odds are excellent that the compilers are applying optimizations to the regular loop that they can't do for the Duff loop as you've written it. In particular, your pointers aren't even register variables (and they are externs, so the compiler can't safely promote them quietly), whereas a good optimizing compiler will certainly be using register pointers in the regular loop. My partner in crime, Geoff Collyer, applied Duff's Device in a number of places in the C News rnews; the performance improvements were substantial. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (08/28/88)
In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes: >[Duff's Device] >The false god of efficiency has reared it ugly head. This >routine would be imagined to be more efficient than the almost >equivalent: ... > while (--count >= 0) > *to++ = *from++; > >However, it often (always?) is not... On the contrary, it often (usually?) is. This is from experience, not theory. (Specifically, experience in C News and related code.) The major exceptions are (a) the 68010, on which unrolling of simple loops is a mistake because of "loop mode" [the hardware's very limited loop cache], and (b) machines with fairly smart compilers and specialized bulk-transfer instructions, as the compilers may be able to recognize and optimize the simple loop but not the complex one. In general, however, in the long run the correct way to implement bulk data copying is to call "memcpy", which (in the long run) is likely to be recognized and given special attention by most compilers. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
levy@ttrdc.UUCP (Daniel R. Levy) (08/28/88)
In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: > Hi, > > Since I posted my original question there has been a great deal of > abstract discussion about the relative merits of the loop unrolling > scheme. The topic has piqued my curiousity, so I when ahead and > implemented a short test program, included below, to test Duff's > device against the ``ordinary for loop w/index variable'' technique. > See for yourself.... > > After some quick testing I found that gcc 1.26 -O on a Sun 3 and a > Sequent Balance was pretty heavily in favor of the regular (non-Duff) > loop. Your mileage may vary. I realize that there may be other > tests, and if anyone has a better version, I'd like to see it! [schmidt's program deleted, please refer to parent article if you want it] I modified this program to run under System V, changed the arrays to be dynam- ically allocated, and changed both the Duff and ordinary copies to use register pointers instead of global pointers (for the Duff copy) and array indexing (for the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a Sun-4 both with and without -O optimization (using the standard pcc-type C compiler on each system). The result? Duff wins by about 10%-20% on all machines tested. Here is the code thus modified: #include <sys/types.h> #include <sys/times.h> #include <sys/param.h> double Start_Timer(); double Return_Elapsed_Time(); #define DEFAULT_NUM 100000 main(argc,argv) char **argv; { double Elapsed_Time_Duff; double Elapsed_Time_Ordinary; int Count = argc > 1 ? atoi(argv[1]) : DEFAULT_NUM; int i; register int *A, *B; register int n; register int *A_end; int *array1, *array2; char *malloc(); if (Count <= 0) { printf("Bad Count\n"); return 1; } else { printf("Count=%d\n", Count); } if (!(array1=(int *)malloc(sizeof(int) * Count))) { printf("Can't malloc %u bytes for array1\n", sizeof(int) * Count); return 1; } else if (!(array2=(int *)malloc(sizeof(int) * Count))) { printf ("Can't malloc %u bytes for array2\n", sizeof(int) * Count); return 1; } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting Duff's device timing...\n"); Start_Timer(); { n = (Count + 7) / 8; A = array1; B = array2; switch(Count & 0x7) { case 0: do { *A++ = *B++; case 7: *A++ = *B++; case 6: *A++ = *B++; case 5: *A++ = *B++; case 4: *A++ = *B++; case 3: *A++ = *B++; case 2: *A++ = *B++; case 1: *A++ = *B++; } while (--n > 0); } } Elapsed_Time_Duff = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3lf\n",Elapsed_Time_Duff); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } for (i = 0;i < Count ;i++) { array1[i] = i + 1; array2[i] = i; } printf("Starting ordinary copy timing...\n"); Start_Timer(); { A = array1; B = array2; A_end = array1 + Count - 1; while (A <= A_end) *A++ = *B++; } Elapsed_Time_Ordinary = Return_Elapsed_Time(0.0 ); printf("Elapsed time = %.3lf\n",Elapsed_Time_Ordinary); for (i = 0;i < Count ;i++) { if (array1[i] != array2[i]) { printf("Yow, problems at location %d!\n",i); break; } } printf("Duff/Ordinary copy time ratio = %.3lf\n", Elapsed_Time_Duff/Elapsed_Time_Ordinary); } /* no such thing as "negative time"! */ #define ERROR_VALUE -1.0 static struct tms Old_Time; static struct tms New_Time; static int Timer_Set = 0; #ifdef __STDC__ double Start_Timer(void) #else double Start_Timer() #endif { Timer_Set = 1; times(&Old_Time); /* set starting process time */ return ((double)Old_Time.tms_utime)/(double)HZ; } /* returns process time since Last_Time (if parameter is not DEFAULT_TIME, */ /* i.e., (double) 0.0 ),otherwise, if parameter == DEFAULT_TIME then */ /* the time since the Old_Time was set is returned. Returns ERROR_VALUE */ /* if Start_Timer() is not called first */ #ifdef __STDC__ double Return_Elapsed_Time(double Last_Time) #else double Return_Elapsed_Time(Last_Time) double Last_Time; #endif { if (!Timer_Set) { return(ERROR_VALUE); } else { /* get process time */ times(&New_Time); if (Last_Time == 0.0) { return ((double)(New_Time.tms_utime - Old_Time.tms_utime))/(double) HZ; } else { return (((double)New_Time.tms_utime)/(double)HZ) - Last_Time; } } } -- |------------Dan Levy------------| THE OPINIONS EXPRESSED HEREIN ARE MINE ONLY | Bell Labs Area 61 (R.I.P., TTY)| AND ARE NOT TO BE IMPUTED TO AT&T. | Skokie, Illinois | |-----Path: att!ttbcad!levy-----|
gwyn@smoke.ARPA (Doug Gwyn ) (08/28/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >Dollars to doughnuts this code was written on a RISC machine. Send me my doughnuts! (Or dollars, if you prefer.) RISC machines were practically non-existent when this C code was first written down (whether or not by Tom Duff, I'm not sure). It probably appeared on a PDP-11, perhaps a VAX. I suspect the first use of Duff's device was as an "efficiency hack", but as you observe it is not universally the most efficient method. The modern VALID reason for such code is that it is a relatively efficient way to do something portably that CANNOT BE DONE by using the "preferred block copy routine" supplied by an implementation (be it bcopy() or memcpy()). Whenever the preferred routine will work correctly (in the portable case, that means whenever the source and destination fields do not overlap), then of course one should use it, especially since if the C vendors are doing their job, the C library routine (which could also be expanded in-line by the compiler) will be near-optimally tuned for each specific architecture. P.S. Some implementations of bcopy() and memcpy() do guarantee proper handling of overlapping fields. But not all do, so one cannot count on it if one's trying to write truly portable code. ANSI C will specify a new function memmove() with this property guaranteed.
henry@utzoo.uucp (Henry Spencer) (08/29/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >This only makes if the author knows he's got a hardware instruction pipeline >or cache that's no less than 8 and no more than 9 byte-copy instruction widths >long, and stuff executing out of the pipeline is a lot faster than if the >copies are interleaved with control transfers. Dollars to doughnuts this code >was written on a RISC machine. Nope. Bell Labs Research uses VAXen and 68Ks, mostly. The key point is not pipelining, but loop-control overhead. There is in fact a tradeoff here: unrolling the loop further will reduce control overhead further, but will increase code size. That last is of some significance when caching gets into the act: cache-loading overhead favors short loops, and small cache sizes very strongly favor short ones. In general there is an optimal point in there somewhere, and an unrolling factor of 8 or 16 is a pretty good guess at it on the machines I've looked at closely. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bill@proxftl.UUCP (T. William Wells) (08/29/88)
In article <11997@steinmetz.ge.com> davidsen@crdos1.UUCP (bill davidsen) writes:
: [ a do loop within a switch, with cases all through the do loop ]
:
: This does not seem to be clearly covered in the latest dpANS (section
: 3.6). The issue is if it is legal *and defined* to jump into a loop.
It is defined, and the same as it always has been. To quote from
section 3.1.2.4:
"Storage is guaranteed to be reserved for a new instance of such
an object [objects with *automatic storage duration*] on each
normal entry into the block in which it is declared, or on a jump
from outside the block to a label in the block or in an enclosed
block. If an initialization is specified for the value stored in
the object, it is performed on each normal entry, but not if the
block is entered by a jump to a label."
: While a case label behaves just like any other label, it's not clear
: what a jump into a loop implies, and I don't see that the existing
: standard is any clearer on the topic than the discussion was three years
: ago.
:
: i = 15;
: if (j < 10) goto BadMove;
:
: /* more code here */
:
: do {
: int i = 7;
: int myvect[400];
: BadMove: /* entry into block */
: myvect[300] = i;
: if (j < 5) return i;
: } while (j-- > 302);
The above tells us that your example 1) allocates space for i and
myvect before or as the block is entered, and 2) does not
initialize `i' when the loop is entered via the goto, but does on
each subsequent iteration of the loop.
: This is one of the rare times when I would like to be wrong, to be
: told "oh, it's there in a footnote, you can, or you can't, and this is
: what it does." If the action is implementation dependent then I
: (personally) wouldn't use the feature, since I value portability
: highly. Please tell me X3J11 didn't let this go unresolved for three
: years.
Nope. All they did was move it from 3.6.2 (the equivalent of K&R
9.2) to 3.1.2.4. I too have been bitten by not having the
standard say things in the expected place. (Remember my
NULL!=NULL posting? I got corrected by no less than dmr
himself! Flaming red :-)
From K&R section 9.2:
"Any initializations of auto or register variables are performed
each time the block is entered at the top. It is currently
possible (but a bad practice) to transfer into a block; in that
case the initializations are not performed."
Nothing could be clearer, eh?
---
Bill
novavax!proxftl!bill
bill@proxftl.UUCP (T. William Wells) (08/29/88)
In article <1988Aug28.031926.19222@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: : In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes: : >[Duff's Device] : >The false god of efficiency has reared it ugly head. This : >routine would be imagined to be more efficient than the almost : >equivalent: ... : > while (--count >= 0) : > *to++ = *from++; : > : >However, it often (always?) is not... : : On the contrary, it often (usually?) is. This is from experience, not : theory. (Specifically, experience in C News and related code.) Well, I'll bow to experimental evidence. However, that does make me wish that more compilers went to the effort to deal with this kind (the "unoptimized" version) of loop, seeing as how it is so commonly important. : In general, however, in the long run the correct way to implement bulk : data copying is to call "memcpy", which (in the long run) is likely : to be recognized and given special attention by most compilers. Agreed. That is how I would wish it to be done. --- Bill novavax!proxftl!bill
dave@onfcanim.UUCP (Dave Martindale) (08/30/88)
In article <1988Aug28.031926.19222@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: > >In general, however, in the long run the correct way to implement bulk >data copying is to call "memcpy", which (in the long run) is likely >to be recognized and given special attention by most compilers. For the case of copying bytes, yes. But there are other operations done in tight loops that do a bit more work than a simple byte move, so memcpy is not usable, yet little enough that unrolling the loop is still worthwhile. Thus Duff's Device will remain useful, until all compilers are smart enough to do loop unrolling on their own. For example: move while packing or unpacking data, table lookup, perhaps CRC generation. Dave Martindale
td@alice.UUCP (Tom Duff) (08/30/88)
I normally do not read comp.lang.c, but Jim McKie told me that ``Duff's device'' had come up in comp.lang.c again. I have lost the version that was sent to netnews in May 1984, but I have reproduced below the note in which I originally proposed the device. (If anybody has a copy of the netnews version, I would gratefully receive a copy at research!td or td@research.att.com.) To clear up a few points: 1) The point of the device is to express general loop unrolling directly in C. People who have posted saying `just use memcpy' have missed the point, as have those who have criticized it using various machine-dependent memcpy implementations as support. In fact, the example in the message is not implementable as memcpy, nor is any computer likely to have an memcpy-like idiom that implements it. 2) Somebody claimed that while the device was named for me, I probably didn't invent it. I almost certainly did invent it. I had definitely not seen or heard of it when I came upon it, and nobody has ever even claimed prior knowledge, let alone provided dates and times. Note the headers on the message below: apparently I invented the device on November 9, 1983, and was proud (or disgusted) enough to send mail to dmr. Please note that I do not claim to have invented loop unrolling, merely this particular expression of it in C. 3) The device is legal dpANS C. I cannot quote chapter and verse, but Larry Rosler, who was chairman of the language subcommittee (I think), has assured me that X3J11 considered it carefully and decided that it was legal. Somewhere I have a note from dmr certifying that all the compilers that he believes in accept it. Of course, the device is also legal C++, since Bjarne uses it in his book. 4) Somebody invoked (or more properly, banished) the `false god of efficiency.' Careful reading of my original note will put this slur to rest. The alternative to genuflecting before the god of code-bumming is finding a better algorithm. It should be clear that none such was available. If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles. 5) The same person claimed that the device wouldn't exhibit the desired speed-up. The argument was flawed in two regards: first, it didn't address the performance of the device, but rather the performance of one of its few uses (implementing memcpy) for which many machines have a high-performance idiom. Second, the poster made his claims in the absence of timing data, which renders his assertion suspect. A second poster tried the test, but botched the implementation, proving only that with diligence it is possible to make anything run slowly. 6) Even Henry Spencer, who hit every other nail square on the end with the flat round thing stuck to it, made a mistake (albeit a trivial one). Here is Henry replying to bill@proxftl.UUCP (T. William Wells): >>... Dollars to doughnuts this code >>was written on a RISC machine. >Nope. Bell Labs Research uses VAXen and 68Ks, mostly. I was at Lucasfilm when I invented the device. 7) Transformations like this can only be justified by measuring the resulting code. Be careful when you use this thing that you don't unwind the loop so much that you overflow your machine's instruction cache. Don't try to be smarter than an over-clever C compiler that recognizes loops that implement block move or block clear and compiles them into machine idioms. Here then, is the original document describing Duff's device: From research!ucbvax!dagobah!td Sun Nov 13 07:35:46 1983 Received: by ucbvax.ARPA (4.16/4.13) id AA18997; Sun, 13 Nov 83 07:35:46 pst Received: by dagobah.LFL (4.6/4.6b) id AA01034; Thu, 10 Nov 83 17:57:56 PST Date: Thu, 10 Nov 83 17:57:56 PST From: ucbvax!dagobah!td (Tom Duff) Message-Id: <8311110157.AA01034@dagobah.LFL> To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr, ucbvax!research!rob Consider the following routine, abstracted from code which copies an array of shorts into the Programmed IO data register of an Evans & Sutherland Picture System II: send(to, from, count) register short *to, *from; register count; { do *to = *from++; while(--count>0); } (Obviously, this fails if the count is zero.) The VAX C compiler compiles the loop into 2 instructions (a movw and a sobleq, I think.) As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%. The standard way to get more speed out of something like this is to unwind the loop a few times, decreasing the number of sobleqs. When you do that, you wind up with a leftover partial loop. I usually handle this in C with a switch that indexes a list of copies of the original loop body. Of course, if I were writing assembly language code, I'd just jump into the middle of the unwound loop to deal with the leftovers. Thinking about this yesterday, the following implementation occurred to me: send(to, from, count) register short *to, *from; register count; { register n=(count+7)/8; switch(count%8){ case 0: do{ *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; }while(--n>0); } } Disgusting, no? But it compiles and runs just fine. I feel a combination of pride and revulsion at this discovery. If no one's thought of it before, I think I'll name it after myself. It amazes me that after 10 years of writing C there are still little corners that I haven't explored fully. (Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.) Many people (even bwk?) have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I'm not sure whether it's for or against. yrs trly Tom
chuck@amdahl.uts.amdahl.com (Charles Simmons) (08/30/88)
In article <2877@ttrdc.UUCP> levy@ttrdc.UUCP (Daniel R. Levy) writes: >In article <653@paris.ICS.UCI.EDU>, schmidt@bonnie.ics.uci.edu (Douglas C. Schmidt) writes: >> Since I posted my original question there has been a great deal of >> abstract discussion about the relative merits of the loop unrolling >> scheme. The topic has piqued my curiousity, so I when ahead and >> implemented a short test program, included below, to test Duff's >> device against the ``ordinary for loop w/index variable'' technique. >> See for yourself.... >> >> After some quick testing I found that gcc 1.26 -O on a Sun 3 and a >> Sequent Balance was pretty heavily in favor of the regular (non-Duff) >> loop. Your mileage may vary. I realize that there may be other >> tests, and if anyone has a better version, I'd like to see it! > >I modified this program to run under System V, changed the arrays to be dynam- >ically allocated, and changed both the Duff and ordinary copies to use register >pointers instead of global pointers (for the Duff copy) and array indexing (for >the ordinary copy). I then tried it on a SVR2 3B20, a SVR3 3B2, a Sun-3, and a >Sun-4 both with and without -O optimization (using the standard pcc-type C >compiler on each system). The result? Duff wins by about 10%-20% on all >machines tested. I then added a piece to the program to use 'memcpy'. The results? Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than Duff. So why do people spend so much time avoiding standard subroutines? -- Chuck
lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (08/30/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes: [discussion of Duff copy deleted] >I then added a piece to the program to use 'memcpy'. The results? >Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than >Duff. So why do people spend so much time avoiding standard subroutines? Sometimes the standard subroutines are implemented horribly. I was horrified when I saw that the machine dependent version of memcpy on the AT&T 3Bs is nothing but a byte by byte transfer written in assembly language. It is tricky, but doable, to speed this up by a roughly a factor of sizeof(long). In fact it already is done in the 3B implementation of the UNIX(tm) operating system in the copyin (?) routine. Why wasn't it done in memcpy too? Sigh. -- Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM
nat@bales.UUCP (Nathaniel Stitt) (08/31/88)
In article <dpmuY#2EBC4R=eric@snark.UUCP> eric@snark.UUCP (Eric S. Raymond) writes: >(Code below reproduced so that comp.arch people seeing this followup only >won't get terminally frustrated. This is *really neat*, gang...) > >In article <638@paris.ics.uci.edu> Douglas C. Schmidt writes: >> >> void send(int *to,int *from, int count) { >> int n = (count + 7) / 8; >> >> switch(count % 8) { >> case 0: do { *to++ = *from++; >> case 7: *to++ = *from++; >> case 6: *to++ = *from++; >> case 5: *to++ = *from++; >> case 4: *to++ = *from++; >> case 3: *to++ = *from++; >> case 2: *to++ = *from++; >> case 1: *to++ = *from++; >> } while (--n > 0); >> } >> >> } >> >> Finally, Stroustrup asks the rhetorical question ``why would anyone >> want to write something like this.'' Any guesses?! > >Yeah. That's the most hackish way of trying to write a portable optimized >copy routine I've ever seen. I gather the whole point of the shenanigans >is to get all the *from++ -> *to++ instructions in the generated code to be >adjacent. > Here is my own personal version of the "Portable Optimized Copy" routine. It certainly seems more clear than the above example, and I would expect it to be at least as fast on virtually any machine. (Note the use of division and mod in the above routine.) Code does NOT have to be ugly to be fast. /* Copy 'count' ints from *from to *to. */ void nats_send(to, from, count) register int *to; register int *from; register int count; { /* Copy the bulk of the data */ while(count >= 8) { *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; count -= 8; } /* Copy the rest. */ if(count >= 4) { *to++ = *from++; *to++ = *from++; *to++ = *from++; *to++ = *from++; count -= 4; } if(count >= 2) { *to++ = *from++; *to++ = *from++; count -= 2; } if(count) *to++ = *from++; } Hope this helps. -- Nathaniel Stitt | This life is a test. It is only a test. Had Guidelines Software, Inc. | this been an actual life, you would have received ucbvax!cogsci!z!nat | further instructions as to what to do and where (415) 376-1395 | to go.
henry@utzoo.uucp (Henry Spencer) (08/31/88)
In article <8144@alice.UUCP> td@alice.UUCP (Tom Duff) writes: > [me] >Nope. Bell Labs Research uses VAXen and 68Ks, mostly. > > [td] I was at Lucasfilm when I invented the device. Mea culpa. I remembered that the article came from Bell Labs, but I forgot the little note on the end saying that the invention happened at Lucasfilm. Hokay, ve correct it: Nope. Lucasfilm used VAXen and 68Ks, mostly. -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
bill@proxftl.UUCP (T. William Wells) (08/31/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'. The results?
: Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
: Duff. So why do people spend so much time avoiding standard subroutines?
Try some history, bud; it's good for what ails you.
I doubt that memcpy even existed then; and it is *not* standard
now. Perhaps it will be several years after the ANSI standard is
adopted, but not till then.
---
Bill
novavax!proxftl!bill
earl@mips.COM (Earl Killian) (09/01/88)
In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
: I then added a piece to the program to use 'memcpy'. The results?
: Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
: Duff. So why do people spend so much time avoiding standard subroutines?
Try some history, bud; it's good for what ails you.
I doubt that memcpy even existed then; and it is *not* standard
now. Perhaps it will be several years after the ANSI standard is
adopted, but not till then.
Perhaps a better reason:
According to the author, that code was used for copying to a 16-bit IO
device. It would have be illegal to use memcpy or bcopy because they
would make word references to the device.
--
UUCP: {ames,decwrl,prls,pyramid}!mips!earl
USPS: MIPS Computer Systems, 930 Arques Ave, Sunnyvale CA, 94086
lvc@cbnews.ATT.COM (Lawrence V. Cipriani) (09/01/88)
In article <1002@cbnews.ATT.COM>, lvc@cbnews.ATT.COM (that's me) wrote: > Sometimes the standard subroutines are implemented horribly. I was > horrified when I saw that the machine dependent version of memcpy > on the AT&T 3Bs is nothing but a byte by byte transfer written in > assembly language. ... Oops. I should have said this was on an early version (~1984) of a 3B5. New 3Bs don't have this deficiency. -- Larry Cipriani, AT&T Network Systems, Columbus OH, cbnews!lvc lvc@cbnews.ATT.COM
hankd@pur-ee.UUCP (Hank Dietz) (09/01/88)
In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: > Here is my own personal version of the "Portable Optimized Copy" routine. .... then he gives a rather verbose, but structured, encoding.... As long as we're getting into structured, portable, hacks, let me suggest the following two ways of doing block copy: 1. If the number of items/bytes is known at compile time, then you can define a struct type of the appropriate size and use struct assign. with type casts to make it fly. For example, suppose p and q are pointers to ints and I want to copy 601 ints from p to q. Then I can write the fast and surprizingly portable: struct t601 { int t[601]; }; *((struct t601 *) q) = *((struct t601 *) p); Of course, you do have to watch-out for alignment problems, but if your compiler doesn't generate very fast code for this.... 2. If the number of items/bytes is not known, then build a binary tree of such structs and copy half, then half of what remains, etc. This is funny looking, but very fast also. Suppose the number of ints (n) is not known at compile time, but can't be more than 601. You can write: struct t512 { int t[512]; }; struct t256 { int t[256]; }; struct t128 { int t[128]; }; struct t64 { int t[64]; }; struct t32 { int t[32]; }; struct t16 { int t[16]; }; struct t8 { int t[8]; }; struct t4 { int t[4]; }; struct t2 { int t[2]; }; if (n & 512) { *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; } if (n & 256) { *((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256; } if (n & 128) { *((struct t128 *) q) = *((struct t128 *) p); q+=128; p+=128; } if (n & 64) { *((struct t64 *) q) = *((struct t64 *) p); q+=64; p+=64; } if (n & 32) { *((struct t32 *) q) = *((struct t32 *) p); q+=32; p+=32; } if (n & 16) { *((struct t16 *) q) = *((struct t16 *) p); q+=16; p+=16; } if (n & 8) { *((struct t8 *) q) = *((struct t8 *) p); q+=8; p+=8; } if (n & 4) { *((struct t4 *) q) = *((struct t4 *) p); q+=4; p+=4; } if (n & 2) { *((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; } if (n & 1) *q = *p; Notice that, in this case, n, p, and q should be declared as being register variables and that p and q are altered by this routine. Of course, you can copy larger things by making larger power-of-2 sized structs. Incidentally, this ran about 8x faster (on a VAX 11/780) than using the usual copy loop. Unfortunately, the above code should have been written as: if (n & 512) { *(((struct t512 *) q)++) = *(((struct t512 *) p)++); } ... but, for some unknown reason, the VAX C compiler didn't like that. Enjoy. hankd@ee.ecn.purdue.edu
gwyn@smoke.ARPA (Doug Gwyn ) (09/01/88)
In article <2945@wright.mips.COM> earl@mips.COM (Earl Killian) writes: >According to the author, that code was used for copying to a 16-bit IO >device. It would have be illegal to use memcpy or bcopy because they >would make word references to the device. Wrong reason. memcpy() etc. don't have the right behavior for stashing a sequence of values into the same destination location.
loren@pixar.UUCP (Loren Carpenter) (09/01/88)
The Duff Loop (as far as I know) was first cast into C by Tom Duff when he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever we needed reasonable speed without resorting to assembly language. It obviously generalizes to more than memory copy & clear. I and others have used this control construct for many years, but always in assembly language. I learned it from Howard Schmeising at Boeing in 1969, where we were writing optimal stack code for CDC 6600's (a $7M RISC machine). We could get 5+ 60-bit Mflops if we worked at it. Loren Carpenter ...{ucbvax,sun}!pixar!loren
mouse@mcgill-vision.UUCP (der Mouse) (09/01/88)
In article <1988Aug28.031926.19222@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > In article <634@proxftl.UUCP> bill@proxftl.UUCP (T. William Wells) writes: >> [Duff's Device] >> The false god of efficiency has reared it ugly head. > In general, however, in the long run the correct way to implement > bulk data copying is to call "memcpy", which (in the long run) is > likely to be recognized and given special attention by most > compilers. Ultimately, Henry is correct (as usual). However, for people who want their code to run fast today, when memcpy() is a real routine call, it is worthwhile to actually time things. In particular, if you have a small amount of data to move, the routine call overhead may well be enough to completely swamp the speedup obtained by using a highly-tuned copy routine, particularly on machines like the VAX with high call overhead. Moral: If you care, take the time to find out how it *really* is. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
jim.nutt@p11.f15.n114.z1.fidonet.org (jim nutt) (09/01/88)
-> In article <ac4GLe9fit1010twl3.@amdahl.uts.amdahl.com>
-> chuck@amdahl.uts.amdahl.com (Charles Simmons) writes:
-> : I then added a piece to the program to use 'memcpy'. The results?
-> : Duff beats a simple loop by 10%. 'memcpy' is 9 times faster than
-> : Duff. So why do people spend so much time avoiding standard
-> subroutines?
->
-> Try some history, bud; it's good for what ails you.
->
-> I doubt that memcpy even existed then; and it is *not* standard
-> now. Perhaps it will be several years after the ANSI standard is
-> adopted, but not till then.
memcpy() IS in the ansi standard library, along with a similar function called memmove(). admittedly, the standard is not finalized yet but it is scheduled to be so soon. i would think that it would behoove compiler vendors to begin tracking the standard now if they aren't already...
jim nutt
'the computer handyman'
--
St. Joseph's Hospital/Medical Center - Usenet <=> FidoNet Gateway
Uucp: ...ncar!noao!asuvax!stjhmc!15.11!jim.nutt
Internet: jim.nutt@p11.f15.n114.z1.fidonet.org
rodc@unet.unet.pacbell.COM (Rod Creason) (09/01/88)
Newsgroups: comp.lang.c,comp.arch Subject: Re: Explanation, please! Summary: Expires: References: <638@paris.ics.uci.edu> <dpmuY#2EBC4R=eric@snark.UUCP> <189@bales.UUCP> <9064@pur-ee.UUCP> Sender: Reply-To: rodc@unet.PacBell.COM (Rod Creason) Followup-To: Distribution: Organization: Network Equipment Technologies, Redwood City, CA Keywords: byte copy In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >2. If the number of items/bytes is not known, then build a binary tree of > such structs and copy half, then half of what remains, etc. This is > funny looking, but very fast also. Suppose the number of ints (n) is > not known at compile time, but can't be more than 601. You can write: > > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; > struct t128 { int t[128]; }; > struct t64 { int t[64]; }; > struct t32 { int t[32]; }; > struct t16 { int t[16]; }; > struct t8 { int t[8]; }; > struct t4 { int t[4]; }; > struct t2 { int t[2]; }; > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } > [ rest of the example ] A key point is that this code is *not* portable unless you can *guarantee* that q and p are ALWAYS aligned at a four-byte boundary. Any 3b2 (WE32000), for example, will core dump if this is not the case. Although 68000 based computers only require even alignment, they would still fail in most cases where this routine would be used. The key to a fast memory-to-memory copy is getting to these boundaries. (and in the case where q or p is an odd address and the other is at an even address, none of this will work) > hankd@ee.ecn.purdue.edu Rod Creason ...!{ames,amdahl,oliveb,pacbell}!unet!rodc
pf@csc.ti.com (Paul Fuqua) (09/01/88)
Date: Wednesday, August 31, 1988 9:13pm (CDT) From: loren at pixar.UUCP (Loren Carpenter) Subject: Re: Explanation, please! Newsgroups: comp.lang.c,comp.arch The Duff Loop (as far as I know) was first cast into C by Tom Duff when he was at Lucasfilm in the early 1980's. We used it at Lucasfilm wherever we needed reasonable speed without resorting to assembly language. It obviously generalizes to more than memory copy & clear. There was a discussion of the device in comp.protocols.tcp-ip not too long ago, as a way to speed up the calculation of checksums. The one point that I haven't seen brought out here yet is that if the unrolling factor is a power of two, the divide/remainder operations are simply a shift and a mask. pf Paul Fuqua Texas Instruments Computer Science Center, Dallas, Texas CSNet: pf@csc.ti.com (ARPA too, sometimes) UUCP: {smu, texsun, cs.utexas.edu, im4u, rice}!ti-csl!pf
klg@njsmu.UUCP (Kenneth Goodwin) (09/02/88)
In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: > In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: > > Here is my own personal version of the "Portable Optimized Copy" routine. > 2. If the number of items/bytes is not known, then build a binary tree of > such structs and copy half, then half of what remains, etc. This is > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; > struct t128 { int t[128]; }; .... etc ..... > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } > if (n & 256) { > *((struct t256 *) q) = *((struct t256 *) p); q+=256; p+=256; > } ... etc ... > Incidentally, this ran about 8x faster (on a VAX 11/780) than using > the usual copy loop. Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... BUT This is where UNIONS come in handy, I used a similar although more brief technique for a faster version of a bmov() (byte move) subroutine on our PDP11-70 a while ago, and subsequently ported it to memcpy when we updated from V6 to System V. The basic idea that was used is to create a union of long, int, (short), and char pointers, use the character pointer to achieve the needed alignments and then use the largest available pointer to do the copy. There is no reason why a stucture copy could not be used, although I suspect on NON-VAX systems it may actually be detremental (sp?) in some cases. The PDP11 C compiler used to stuff registers onto the stack and create a 16 bit word copy loop to do structure copies using the freed registers, restoring them when it was done. So a structure copy would be the same as a word copy on that style of a system (ie, ones without block move instructions) So In the case of your example, a modified brief version of it would be: union ptr_types { struct t512 { int t512[512] } *t512; .... struct t32 { int t32[32] } *t32; long *t_long; int *t_int; short *t_short; char *t_char; } ; (probably could dispense with long and short pointers and related tests) memcpy(a, b, len) char *a; *b; { register union ptr_types a_ptr, b_ptr; a_ptr.t_char = a; b_ptr.t_char = b; while(NOT ON A WORD BOUNDARY AND CHARS LEFT) { *a_ptr.t_char++ = *b_ptr.t_char++; len--; } if(len >= sizeof(int) * 512) { /* if we can use a 512 int structure copy */ *a_ptr.t512++ = *b_ptr.t512++; len -= (512 * sizeof(int)); } /*M the biggest win is that the pointers increment correctly len -= (sizeof(*element pointer)) is the correct form over N INTS * sizeof int */ ....... I guess the rest is obvious, some GLUE may be needed that has not be shown.... :-) Boundaries should be checked on source and destination addresses to avoid memory faults.... As you may be given incompatible source and destination address that may require a full char by char copy. The first test loop sort of does this, but all the other copies should also check for proper address alignments before proceeding. Ken Goodwin NJSMU.
andrew@frip.gwd.tek.com (Andrew Klossner) (09/02/88)
Nathaniel Stitt writes: "Here is my own personal version of the "Portable Optimized Copy" routine. It certainly seems more clear than the above example, and I would expect it to be at least as fast on virtually any machine." then goes on to present a routine which uses follow-on code to handle the last few bytes after all octets have been copied. It's cleaner code, but it won't be quite as fast on many systems with instruction caches because it has fifteen byte-move instructions, replacing eight in the original, so more time is spent loading the loop into the i-cache. On systems with very small i-caches (my favorite example is the IBM 360/91 with 16 bytes), the bigger loop may not all fit into cache, and would be considerably slower. Several contributors have suggested that unrolling a byte-copy loop is a win. On some architectures it is, but on a good pipelined system it may not be. As an example, the program fragment while (count--) { to[i] = from[i]; ++i; } can be compiled to code on the M88k which copies memory as fast as a DMA controller could; the instructions to decrement, increment, and branch overlap with the data load/store requests. [If everything's in registers, indexing in this case is actually faster than keeping separate "to" and "from" pointers and incrementing both.] This assumes that "to" and "from" are pointers-to-ints or pointers-to-doubles. Copying less than a word at a time is slower. -=- Andrew Klossner (decvax!tektronix!tekecs!andrew) [UUCP] (andrew%tekecs.tek.com@relay.cs.net) [ARPA]
jabir@quintus.uucp (Jabir Hussain) (09/02/88)
In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > struct t512 { int t[512]; }; > struct t256 { int t[256]; }; [...] > if (n & 512) { > *((struct t512 *) q) = *((struct t512 *) p); q+=512; p+=512; > } [...] > Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... > > but, for some unknown reason, the VAX C compiler didn't like that. one portable way around that is to do something like: typedef char *caddr_t; typedef union { struct { int t[512]; } t512; struct { int t[256]; } t256; caddr_t caddr; } ustruct_t; { register ustruct_t src,dst; src.caddr = (caddr_t) p; dst.caddr = (caddr_t) q; if (n & 512) *dst.t512++ = *src.t512++; ... } jh
rick@pcrat.UUCP (Rick Richardson) (09/02/88)
I did a quick comparison of the times for "memcpy(,,n*4)", Duff's device, Nathaniel's "nat", and Hank's versions. Here's the results: TEST MACHINE ATT cc VENIXcc GNU1.26 memcpy 8Mhz 286, Venix SVR2 3.35 11.15 - duff 8Mhz 286, Venix SVR2 16.58 4.56 - nat 8Mhz 286, Venix SVR2 16.73 5.11 - hank 8Mhz 286, Venix SVR2 1.83 ccbarf - memcpy 16Mhz 386, 386/ix 1.0.4 .82 - .82 duff 16Mhz 386, 386/ix 1.0.4 2.01 - 2.26 nat 16Mhz 386, 386/ix 1.0.4 1.95 - 2.21 hank 16Mhz 386, 386/ix 1.0.4 .86 - .89 * Everything was compiled with -O * test moved 2000,1999,...,2,1 int's I learned this: Venix obviously doesn't have a good memcpy, and Duff, or nat make a good replacement. AT&T uses the block move in memcpy, and it is a definite win on both machines. The AT&T 286 C compiler is terrible for these program on the 286. AT&T still has a slight edge over gcc *for these programs*. Hanks version, while not portable, is a clear win on the 286 (disassemble memcpy and hank's to see why), and virtually as good as memcpy on 386. Conclusion: you can't make any. It all depends on your computer and compiler. -- Rick Richardson, PC Research, Inc. rick%pcrat.uucp@uunet.uu.net (INTERNET) uunet!pcrat!rick (UUCP, Personal Mail) ..!pcrat!jetroff (JetRoff Info) ..!pcrat!dry2 (Dhrystone Submissions)
mouse@mcgill-vision.UUCP (der Mouse) (09/02/88)
In article <9064@pur-ee.UUCP>, hankd@pur-ee.UUCP (Hank Dietz) writes: > In article <189@bales.UUCP>, nat@bales.UUCP (Nathaniel Stitt) writes: >> ["Portable Optimized Copy" routine] > As long as we're getting into structured, portable, hacks, let me > suggest the following two ways of doing block copy: > 1. If the number of items/bytes is known at compile time, then you > can define a struct type of the appropriate size and use struct > assignment with type casts to make it fly. For example, [ int *p; int *q; /* want to copy 601 ints from p to q */ ] > struct t601 { int t[601]; }; > *((struct t601 *) q) = *((struct t601 *) p); Surprise! Your compiler has sizeof(int)=2, sizeof(long)=4, and aligns structs on long boundaries. You just copied 602 ints! > Unfortunately, the above code should have been written as: > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > but, for some unknown reason, the VAX C compiler didn't like > that. Perhaps the "unknown reason" is that it's illegal! An expression formed by casting another expression is not an lvalue and never has been. Any compiler that accepts that is broken. (Yes, there exist such broken compilers.) What would you expect from double d; ((int)d) ++; Perhaps you'd want d+=1? Perhaps d=1+(int)d? Perhaps *(int*)&d=1+(int)d even? der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu
chris@mimsy.UUCP (Chris Torek) (09/02/88)
-In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: -> *(((struct t512 *) q)++) = *(((struct t512 *) p)++); -> but, for some unknown reason, the VAX C compiler didn't like that. In article <339@quintus.UUCP> jabir@quintus.uucp (Jabir Hussain) writes: -one portable way around that is to do something like: - - typedef char *caddr_t; - typedef union { - struct { int t[512]; } t512; - struct { int t[256]; } t256; - caddr_t caddr; - } ustruct_t; ... - if (n & 512) *dst.t512++ = *src.t512++; `Portable'? I suggest you try it on a Data General MV/10000. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
alverson@decwrl.dec.com (Robert Alverson) (09/03/88)
In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: > Incidentally, this ran about 8x faster (on a VAX 11/780) than using > the usual copy loop. Unfortunately, the above code should have been > written as: > > if (n & 512) { > *(((struct t512 *) q)++) = *(((struct t512 *) p)++); > } > ... > > but, for some unknown reason, the VAX C compiler didn't like that. > > >Enjoy. > hankd@ee.ecn.purdue.edu Actually, the VAX C compiler was completely correct. Attempting to post-increment the result of a cast is not legal C. One way around this is to use a little extra indirection: char* p; /* was: ((int *)p)++; */ (*((int **) &p))++; /* phew! */ The reasoning is that the result of a cast is an expression, not a lvalue. Quite often, compilers relax this rule for coercions which produce no code. In this case, it looks like the VAX C doesn't Bob
hankd@pur-ee.UUCP (Hank Dietz) (09/05/88)
Relative to my article <9064@pur-ee.UUCP> giving a struct-based hack for doing block copy and to all followups thereof.... Correction: I hereby appologize for having ommitted a (:-) and hence having made it sound like I didn't know why the VAX compiler didn't want to use a cast pointer as the operand of ++. I do know (and I knew even before a dozen of you reminded me :-), I just think it should be allowed in such cases; i.e., be implementation dependent rather than illegal. Anyway, the union-based fix given by Jabir is certainly a good way to do it (once you fix his typedef to make t512 and t256 pointers to structs, rather than structs). Comments: As I noted in the original posting, the use of a structure-hack block copy is subject to alignment problems -- if the target machine has alignment constraints. There are three additional notes on this: 1. Both the source AND destination pointers may need to be aligned, which implies that the difference between the pointers MUST BE A MULTIPLE OF THE ALIGNMENT FACTOR. Arbitrary copies will not have this property, hence, on such machines one must always have a byte-oriented copy routine to fall back on. For a 4-byte alignment machine, the test is something like: if ((p - q) & 3) *byte copy* else *struct copy* 2. If the difference between the pointers is a multiple of the machine alignment factor, then copying up until one is aligned will insure alignment of both. Hence, we simply need to insert a few moves to achieve alignment BEFORE the struct copy code I gave earlier. For a 4-byte alignment machine, the additional code is: if ((n >= 1) && (p & 1)) { *q = *p; ++q; ++p; --n; } if ((n >= 2) && (p & 2)) { *((struct t2 *) q) = *((struct t2 *) p); q+=2; p+=2; n-=2; } Notice that, with the above code as a prefix, trailing alignment is also satisfied (proof left as an exercise to the reader). 3. Although I know of no such C compiler, it is legal for a C compiler to make the size of a struct anything big enough to hold the things inside it. Hence, for example, a struct containing an array of 4 things might be the size of 5 things.... If this were done, the struct copy would nearly work; the only problem is that it could overshoot the end of the block. There is no general fix for this. hankd@ee.ecn.purdue.edu
pardo@june.cs.washington.edu (David Keppel) (09/07/88)
hankd@pur-ee.UUCP (Hank Dietz) writes: > if ((p - q) & 3) *byte copy* else *struct copy* I believe that the VAX "movc" command takes arbitrary pointers and does the following: * If both are word-aligned, do a word copy (I mean a 4-byte word). * If both are non-aligned and could be aligned with 1, 2, or 3 bytes of byte-copy at either end, then do a byte copy at either end and do a word copy down the middle. * If niether aligned then ?? Unfortunately, my VAX hardware reference is out of town for a couple of weeks, so I can't ask him about neither aligned. Anybody know? I can immagine that on some machines it is faster to copy words into register and repack the words in the registers rather than do a byte copy, since you could be taking advantage of some hardware gak. Simple example: machine X has register W1 divided into B4, B5, B6, B7. To do a copy, align the source pointer (doing byte copies) then read a wrod-at-a-time into the W1 register, write it back out by writing B4, B5, B6, B7 (little-endian). This is beginning to look suspiciously like the kinds of optimizations that get done for bit BLTs. Anybody know if this ever really gets done? ;-D on ( Ahh. Architecture at its finest ) Pardo -- pardo@cs.washington.edu {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
rik@june.cs.washington.edu (Rik Littlefield) (09/07/88)
In article <5654@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes: > > I can immagine that on some machines it is faster to copy words into > register and repack the words in the registers rather than do a byte > copy, since you could be taking advantage of some hardware gak. > On the old CDC 6000-series machines (early RISCs...) that was the *only* practical way to do it, as well as being blazingly fast. We had copies that would handle arbitrary *bit* alignments at a cost of around 6 instructions and 2 memory references per 60-bit word, in the middle of the string. The sequence was basically fetch, shift, mask, mask, OR, and store, appropriately rearranged to minimize memory delay and functional unit conflicts, of course. I vaguely remember that this thing could even be unrolled a couple of times and still fit in the instruction cache ("stack", in those days) for machines expensive enough to have one. VAXen I don't know about for sure, but I'd be real surprised if their microcode didn't do the same thing. --Rik
news@amdcad.AMD.COM (Network News) (09/07/88)
In article <5654@june.cs.washington.edu> pardo@uw-june.UUCP (David Keppel) writes: | hankd@pur-ee.UUCP (Hank Dietz) writes: | > if ((p - q) & 3) *byte copy* else *struct copy* | | I believe that the VAX "movc" command takes arbitrary pointers and | does the following: | | * If both are word-aligned, do a word copy (I mean a 4-byte word). | * If both are non-aligned and could be aligned with 1, 2, or 3 bytes | of byte-copy at either end, then do a byte copy at either end and do | a word copy down the middle. | * If niether aligned then ?? | | Unfortunately, my VAX hardware reference is out of town for a couple | of weeks, so I can't ask him about neither aligned. Anybody know? I assume that "both are non-aligned" in the above list means that the source and the destination have the same alignment, but are not aligned with respect to a 4-byte boundary, and that "neither aligned" means that the source and the destination are misaligned. We use an interesting trick in the Am29000 memcpy routine for source/destination misalignment. In this case, we set up the alignment difference in the funnel-count register, read in two source words, and "extract" a destination word using the funnel-shifter's ability to extract any 32-bit word from a 64-bit double-word in a single cycle. The inner loop then consists of shifting the low source word to the high source word, reading a new low source word, extracting a destination and storing it (well, there is also the overhead of counting down the correct number of "word" moves, but you get the idea.) -- Tim Olson Advanced Micro Devices (tim@delirun.amd.com)
bill@proxftl.UUCP (T. William Wells) (09/08/88)
In article <485.231DAAC3@stjhmc.fidonet.org> jim.nutt@p11.f15.n114.z1.fidonet.org (jim nutt) writes:
: -> I doubt that memcpy even existed then; and it is *not* standard
: -> now. Perhaps it will be several years after the ANSI standard is
: -> adopted, but not till then.
:
: memcpy() IS in the ansi standard library, along with a similar
: function called memmove().
Yes, I know. I said "it isn't standard", not "it isn't in the
standard". The difference is that "it isn't standard" means that
you can't write portable code that assumes that it exists; the
other should be obvious.
---
Bill
novavax!proxftl!bill
mouse@mcgill-vision.UUCP (der Mouse) (09/11/88)
In article <339@quintus.UUCP>, jabir@quintus.uucp (Jabir Hussain) writes: > In article <9064@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >> *(((struct t512 *) q)++) = *(((struct t512 *) p)++); >> but, for some unknown reason, the VAX C compiler didn't like that. > one portable way around that is to do something like: [summarized] > union { struct { int t[512]; } *t512; ...; char *caddr; } [then use ++ on .t512] This falls hopelessly flat if char * and struct {...} * have significantly different representation. der Mouse old: mcgill-vision!mouse new: mouse@larry.mcrcim.mcgill.edu