speck@cit-vlsi (Don Speck) (08/05/85)
I have noticed, of late, a few people trying to program around the fast-dump that I wrote half a year ago. I'm still out here, and occasionally hack dump still more, in a vain effort to keep up with our burgeoning collection of disks (which has tripled in one year). The version posted in February had several deficiencies, viz: tended to "run on" after aborting, squashed interactive response, and couldn't stream a TU80 faster than 25 ips. Lack of streaming was particularly annoying, since we were about to get a streamer for our new 750's. I already knew that 85% of the cpu time went to syscalls, so I profiled the kernel while running pieces of dump. Write() to a Unibus raw device has amazing overhead, most of it at spl6. Spl6 locks out clock interrupts - causing our 750 to lose time while dumping. Most of this overhead was a waste, for example: setting the lock bit on 20 pages took 3ms, because each bit took a subroutine call, bit-field instructions, and a call on udiv() with divisor 2. Unlocking the pages takes as long, and setting the UBA map registers is not much faster. I made a score of optimizations, in pcc, c2, and the kernel code, profiling after each one. Profiling from the system clock is pretty "noisy", so I often couldn't be sure if a particular optimization had actually improved anything. I knew, though, that I was breaking the C compiler and kernel with distressing frequency. So, I started over from backups, and left out the optimizations that I hadn't been able to measure. This left me with 3 code changes, which I posted in early February. Together they trimmed 2ms from the write() time. The dump that I posted still wouldn't stream on my friend's TU80. But doing a bunch of write calls would stream beautifully. I discovered that a minimum of 15 back-to-back writes would kick the TU80 into 100 ips mode. It was trivial to modify standard 4.2bsd dump to group its writes into bursts of 15, and my friend still uses that variant. I posted it in late February. The approach ought to work on any Unix. I now had two versions of dump, one for TU80's and one for everything else. The concurrent-processes version used pipes to tell the next process when it could safely write(). This worked correctly, but pipe writes don't force a context switch, so the next writer didn't wake up until the previous writer happened to block on a syscall. A TU80 at 100 ips can't wait that long - it repositions if the next write doesn't come within 9 milliseconds. Signals DO force a context switch to the signaled process. They're faster than pipes, too, so I got the TU80 to stream - on those occasions when the program didn't deadlock. Yuk! I appealed to Unix-Wizards for suggestions, and got a dozen. Most suggested sticking with pipes, and follow the pipe write() with something that will immediately block, such as read() or select(). Mike Muuss sent me Doug Kingston's "dbcp", which works along those lines, and it did stream the TU80 - but just barely, off-and-on. Dbcp fell back to 25 ips after I threw out the kernel optimizations that I hadn't been able to measure. If I were to try to make "tar" stream, I'd use Doug Kingston's approach; it's portable. Stew@mazama suggested flock(). The way he said it made me think he was just hyping featurism, but eventually I figured out how to apply flock to token-passing. I wrote a crude double- buffered copy program that streams the TU80 at an average of 60 ips or so. I'm posting it to Unix-Sources. My latest dump works like the double-buffered copy program, except that triple-buffering turned out to be the optimum value. The third process ensures that there's always a process reading the disk, even while the other two are context-switching. This matters because slow disks like RA81's need all the help they can get to keep up with a TU80. It's sad that after all that work, the RA81 is usually enough of a bottleneck to force the TU80 to fall back to 25 ips mode. Eagles do better. On the vaxes that have Eagles I get average throughputs of 95 ips with a TU77, and 22 ips @ 6250 bpi with a CDC 92185 streamer. While I was recoding everything, I fixed some of the deficiencies I'd mentioned. It no longer "runs on" after an abort. It no longer uses any more CPU than the dump that came with 4.2bsd. Ignored interrupts now stay ignored. (A typical Berkeley bug, common to csh users, but inexusable in a program intended to *only* be run from the single-user [Bourne] shell). Tape length is accounted more accurately. Davet@rand-unix reported that the "#ifdef RDUMP" code in the old version would hang on Sun Unix 1.x. It was sending a second command to the server before waiting for the first command to finish, which works fine on VAX 4.2bsd but confuses Suns. That optimization has been ifdef'd out. Be aware that I only have VAX sources; I can't guess what hacks Sun needed in their dump, and I refuse to decompile any more of their stripped a.out's. I'm posting my heavily modified dumptape.c to unix-sources as a complete file, instead of posting the 913 line context diff containing essentially all of the old and new versions. This keeps the submission small and avoids giving away lots of Unix(TM) code. It also makes installation easy: put it in the source directory and type "make". Of course, you'll need 4.2bsd source code to do this. Don Speck speck@cit-vax.arpa ihnp4!cithep!cit-vax!speck