mjr@decuac.dec.com (Marcus J. Ranum) (12/24/90)
Does anyone have any notes on what Duff's Device is, and how it works that they can point me to? mjr. -- "Don't include .signature twice" [From the notebooks of a heretic, 1990] -- "Don't include .signature twice" [From the motebooks of a heretic, 1990]
bobmon@iuvax.cs.indiana.edu (RAMontante) (12/26/90)
Saved from an old posting (headers included for historical reference): __________________________________________________________________________ Article 12105 of comp.lang.c: Subject: Re: Explanation, please! Summary: Original citation From: td@alice.UUCP (Tom Duff) Organization: AT&T Bell Laboratories, Murray Hill NJ Date: 29 Aug 88 20:33:51 GMT Message-ID: <8144@alice.UUCP> I normally do not read comp.lang.c, but Jim McKie told me that ``Duff's device'' had come up in comp.lang.c again. I have lost the version that was sent to netnews in May 1984, but I have reproduced below the note in which I originally proposed the device. (If anybody has a copy of the netnews version, I would gratefully receive a copy at research!td or td@research.att.com.) To clear up a few points: 1) The point of the device is to express general loop unrolling directly in C. People who have posted saying `just use memcpy' have missed the point, as have those who have criticized it using various machine-dependent memcpy implementations as support. In fact, the example in the message is not implementable as memcpy, nor is any computer likely to have an memcpy-like idiom that implements it. 2) Somebody claimed that while the device was named for me, I probably didn't invent it. I almost certainly did invent it. I had definitely not seen or heard of it when I came upon it, and nobody has ever even claimed prior knowledge, let alone provided dates and times. Note the headers on the message below: apparently I invented the device on November 9, 1983, and was proud (or disgusted) enough to send mail to dmr. Please note that I do not claim to have invented loop unrolling, merely this particular expression of it in C. 3) The device is legal dpANS C. I cannot quote chapter and verse, but Larry Rosler, who was chairman of the language subcommittee (I think), has assured me that X3J11 considered it carefully and decided that it was legal. Somewhere I have a note from dmr certifying that all the compilers that he believes in accept it. Of course, the device is also legal C++, since Bjarne uses it in his book. 4) Somebody invoked (or more properly, banished) the `false god of efficiency.' Careful reading of my original note will put this slur to rest. The alternative to genuflecting before the god of code-bumming is finding a better algorithm. It should be clear that none such was available. If your code is too slow, you must make it faster. If no better algorithm is available, you must trim cycles. 5) The same person claimed that the device wouldn't exhibit the desired speed-up. The argument was flawed in two regards: first, it didn't address the performance of the device, but rather the performance of one of its few uses (implementing memcpy) for which many machines have a high-performance idiom. Second, the poster made his claims in the absence of timing data, which renders his assertion suspect. A second poster tried the test, but botched the implementation, proving only that with diligence it is possible to make anything run slowly. 6) Even Henry Spencer, who hit every other nail square on the end with the flat round thing stuck to it, made a mistake (albeit a trivial one). Here is Henry replying to bill@proxftl.UUCP (T. William Wells): >>... Dollars to doughnuts this code >>was written on a RISC machine. >Nope. Bell Labs Research uses VAXen and 68Ks, mostly. I was at Lucasfilm when I invented the device. 7) Transformations like this can only be justified by measuring the resulting code. Be careful when you use this thing that you don't unwind the loop so much that you overflow your machine's instruction cache. Don't try to be smarter than an over-clever C compiler that recognizes loops that implement block move or block clear and compiles them into machine idioms. Here then, is the original document describing Duff's device: From research!ucbvax!dagobah!td Sun Nov 13 07:35:46 1983 Received: by ucbvax.ARPA (4.16/4.13) id AA18997; Sun, 13 Nov 83 07:35:46 pst Received: by dagobah.LFL (4.6/4.6b) id AA01034; Thu, 10 Nov 83 17:57:56 PST Date: Thu, 10 Nov 83 17:57:56 PST From: ucbvax!dagobah!td (Tom Duff) Message-Id: <8311110157.AA01034@dagobah.LFL> To: ucbvax!decvax!hcr!rrg, ucbvax!ihnp4!hcr!rrg, ucbvax!research!dmr, ucbvax!research!rob Consider the following routine, abstracted from code which copies an array of shorts into the Programmed IO data register of an Evans & Sutherland Picture System II: send(to, from, count) register short *to, *from; register count; { do *to = *from++; while(--count>0); } (Obviously, this fails if the count is zero.) The VAX C compiler compiles the loop into 2 instructions (a movw and a sobleq, I think.) As it turns out, this loop was the bottleneck in a real-time animation playback program which ran too slowly by about 50%. The standard way to get more speed out of something like this is to unwind the loop a few times, decreasing the number of sobleqs. When you do that, you wind up with a leftover partial loop. I usually handle this in C with a switch that indexes a list of copies of the original loop body. Of course, if I were writing assembly language code, I'd just jump into the middle of the unwound loop to deal with the leftovers. Thinking about this yesterday, the following implementation occurred to me: send(to, from, count) register short *to, *from; register count; { register n=(count+7)/8; switch(count%8){ case 0: do{ *to = *from++; case 7: *to = *from++; case 6: *to = *from++; case 5: *to = *from++; case 4: *to = *from++; case 3: *to = *from++; case 2: *to = *from++; case 1: *to = *from++; }while(--n>0); } } Disgusting, no? But it compiles and runs just fine. I feel a combination of pride and revulsion at this discovery. If no one's thought of it before, I think I'll name it after myself. It amazes me that after 10 years of writing C there are still little corners that I haven't explored fully. (Actually, I have another revolting way to use switches to implement interrupt driven state machines but it's too horrid to go into.) Many people (even bwk?) have said that the worst feature of C is that switches don't break automatically before each case label. This code forms some sort of argument in that debate, but I'm not sure whether it's for or against. yrs trly Tom
johnl@iecc.cambridge.ma.us (John R. Levine) (12/27/90)
In article <1990Dec25.135417.27999@news.cs.indiana.edu> you write: >[a letter from Tom Duff in which he reports that he invented his device >in 1983.] I have personal knowledge of a previous, entirely independent invention of this technique, unrolling a loop into a C switch statement, at Yale in 1977 or 1978. In the early 1970's the Yale computer science department built the GEM system one of the earlier bit-mapped graphics systems based on then state-of-the-art 2K RAMs that Intel had given us. (Part of the appeal was that the fact that the RAMs weren't very reliable didn't matter, the worst that would happen would be a little screen bit rot.) It had a total of 256K 16-bit words of screen memory, 16K for each of 16 screens. Each screen also had a keyboard with a multiplexor feeding all the keystrokes back to the host. The host system was two loosely coupled computers, a PDP-11/45 which was the main system, and a PDP-11/05 that ran the terminal emulator and handled the keyboards. Both hosts could map chunks of screen memory into their address spaces, and there was a 12 bit wide serial link between them that looked enough like a paper tape reader/punch to fool the boostrap ROMs. Originally both machines ran homebrew code written in a homebrew language called IMP11, a relative of Irons' IMP10 and IMP72. In 1974-75 we decided to ditch the local operating system on the 11/45 and switch to 5th edition Unix which was already in use at Harvard and Princeton. We still used the IMP11 terminal emulator, though. In the 1975-76 academic year, I was working for the department maintaining the Gem system, and one of the things that I did was to rewrite the terminal emulator from scratch in C, saving only the character bitmaps from the IMP11 code. Even though the 11/05 only implemented a subset of the 11/45's instruction set, I used the regular Ritchie PDP-11 C compiler, having discovered that it was fairly easy to write code so that the C compiler didn't generate any of the instructions that the 11/05 didn't have. (It mainly lacked multiply and divide.) The new C terminal emulator was about twice as fast as the IMP11 version, partly because it was a lot simpler, having no task switcher but instead processing each character to completion as it arrived from the host, and partly because the C compiler had a better code generator. It processed about 700 characters per second, which seemed pretty fast at the time, and I went on to add lots of bells and whistles such as multiple character sets, area scrolls, and an "editor mode" for our local full-screen editor, a cousin of the Rand editor, that did all the simple character processing in the 11/05, only waking up the editor program on the 11/45 when it needed to do something non-trivial. This made the system load from the full screen editor comparable to that from "ed" which was important when we were trying to time-share ten students on a poor little 11/45. The following year, 1977-78, I became a graduate student and Gem management was taken over by John Mostenan, an undergraduate, supervised Bob Tuttle, a former grad student who was the facility manager, and had been my boss, too. One of the things they did was to speed up my emulator not by any fundamental changes but by careful recoding and tuning, trading size for speed since the emulator came nowhere near filling the 24K words available to it on the 11/05. For example, they went from line by line scrolling to jump scrolling, and from a block to a line cursor. Another thing they did was to use what is now known as Duff's device to unroll the code for arbitrary area scrolls. I think I had unrolled full-screen scrolls where I knew in advance that the screen was 72 characters and hence 36 words wide, but had the usual tight loop for other scrolls. By unrolling the area scrolls into switches they greatly increased scrolling speed. When they were done, their emulator was twice as fast as mine, a very noticable improvement. For various not very good reasons, the entire Gem project went almost entirely undocumented. Yale's CS department was very late to get on the Arpanet or any other network, and there was little incentive to publish anything other than one's thesis. I wrote a short overview as a departmental tech report in 1979 which appeared as an article in Software Practice and Experience in 1982. As far as I can tell, no other publications or theses resulted from it, though there was considerable good work. (We used xor-mode drawing all over the place from the beginning but thought it was so obvious that we never published anything about it, to my later regret.) The last notable thing that happened on the Gem system was that I used its 11/45 as a host to port 4BSD to the Vax 11/750 at Yale about the same time as Bill Joy was doing the same thing at Berkeley on an 11/780. (The 11/750 had a lot of microcode bugs that made the 11/780 code fail, most notably one that broke the inner loop in printf(). The Vax assembler code for printf probably still has Joy's comment "comet sucks" where he patched around it.) In about 1984, after I had left, all of the Gem listings and a lot of its hardware went into a dumpster when the department was running out of space.) I have no reason to doubt that Tom Duff did independently invent the unrolling technique in 1983, and he deserves considerable credit for bringing it to the attention of the C community. But I wouldn't be surprised to hear that it had been invented and lost several more times between the dawn of C in about 1970 and Duff's invention in 1983. Regards, John Levine, johnl@iecc.cambridge.ma.us, {spdcc|ima|world}!iecc!johnl
jc@minya.UUCP (John Chambers) (01/08/91)
[History of "Duff's Device"] > send(to, from, count) > register short *to, *from; > register count; > { > register n=(count+7)/8; > switch(count%8){ > case 0: do{ *to = *from++; > case 7: *to = *from++; > case 6: *to = *from++; > case 5: *to = *from++; > case 4: *to = *from++; > case 3: *to = *from++; > case 2: *to = *from++; > case 1: *to = *from++; > }while(--n>0); > } > } > > Disgusting, no? But it compiles and runs just fine. I feel a combination > of pride and revulsion at this discovery. If no one's thought of it before, > I think I'll name it after myself. Uh, why is this disgusting? It strikes me as simple and elegant. I have severe doubts that any minimally-competent C programm would have even the slightest trouble understanding it. It's not tricky, convoluted, confusing, or anything else negative. So why would someone object to it? Myself, I think that people object to simple, elegant things just because they are simple and elegant [but I don't claim to read minds. ;-] What I was hoping for was a simple, elegant solution that involved a goto. Now THAT would offend people! Oh, well, maybe next time. Of course, I would have put the while before the loop, but then, I'm rather paranoid. For n==0, the above code will copy 8 bytes, which could cause a SEG fault, and guess whose routine dbx/sbd/whatever would finger? [To satisfy the line count daemon.] -- Zippy-Says: Imagine ... a world without clothing folds, chiaroscuro, or marital difficulties ... Home: 1-617-484-6393 Work: 1-508-952-3274 Uucp: ...!{harvard.edu,ima.com,eddie.mit.edu,ora.com}!minya!jc (John Chambers) Uucp-map: minya adelie(DEAD)