dump

speck@cit-vlsi (Don Speck) (08/05/85)
    I have noticed, of late, a few people trying to program around
the fast-dump that I wrote half a year ago.  I'm still out here,
and occasionally hack dump still more, in a vain effort to keep up
with our burgeoning collection of disks (which has tripled in one
year).

    The version posted in February had several deficiencies, viz:
tended to "run on" after aborting, squashed interactive response,
and couldn't stream a TU80 faster than 25 ips.	Lack of streaming
was particularly annoying, since we were about to get a streamer
for our new 750's.

    I already knew that 85% of the cpu time went to syscalls,
so I profiled the kernel while running pieces of dump.	Write()
to a Unibus raw device has amazing overhead, most of it at spl6.
Spl6 locks out clock interrupts - causing our 750 to lose time
while dumping.	Most of this overhead was a waste, for example:
setting the lock bit on 20 pages took 3ms, because each bit
took a subroutine call, bit-field instructions, and a call on
udiv() with divisor 2.	Unlocking the pages takes as long, and
setting the UBA map registers is not much faster.

    I made a score of optimizations, in pcc, c2, and the kernel
code, profiling after each one.  Profiling from the system clock
is pretty "noisy", so I often couldn't be sure if a particular
optimization had actually improved anything.  I knew, though,
that I was breaking the C compiler and kernel with distressing
frequency.  So, I started over from backups, and left out the
optimizations that I hadn't been able to measure.  This left
me with 3 code changes, which I posted in early February.
Together they trimmed 2ms from the write() time.

    The dump that I posted still wouldn't stream on my friend's
TU80.  But doing a bunch of write calls would stream beautifully.
I discovered that a minimum of 15 back-to-back writes would kick
the TU80 into 100 ips mode.  It was trivial to modify standard
4.2bsd dump to group its writes into bursts of 15, and my friend
still uses that variant.  I posted it in late February.  The
approach ought to work on any Unix.

    I now had two versions of dump, one for TU80's and one for
everything else.  The concurrent-processes version used pipes to
tell the next process when it could safely write().  This worked
correctly, but pipe writes don't force a context switch, so the
next writer didn't wake up until the previous writer happened to
block on a syscall.  A TU80 at 100 ips can't wait that long - it
repositions if the next write doesn't come within 9 milliseconds.

    Signals DO force a context switch to the signaled process.
They're faster than pipes, too, so I got the TU80 to stream -
on those occasions when the program didn't deadlock.  Yuk!

    I appealed to Unix-Wizards for suggestions, and got a dozen.
Most suggested sticking with pipes, and follow the pipe write()
with something that will immediately block, such as read() or
select().  Mike Muuss sent me Doug Kingston's "dbcp", which works
along those lines, and it did stream the TU80 - but just barely,
off-and-on.  Dbcp fell back to 25 ips after I threw out the kernel
optimizations that I hadn't been able to measure.  If I were to
try to make "tar" stream, I'd use Doug Kingston's approach;
it's portable.

    Stew@mazama suggested flock().  The way he said it made me
think he was just hyping featurism, but eventually I figured out
how to apply flock to token-passing.  I wrote a crude double-
buffered copy program that streams the TU80 at an average of
60 ips or so.  I'm posting it to Unix-Sources.

    My latest dump works like the double-buffered copy program,
except that triple-buffering turned out to be the optimum value.
The third process ensures that there's always a process reading
the disk, even while the other two are context-switching.  This
matters because slow disks like RA81's need all the help they
can get to keep up with a TU80.  It's sad that after all that
work, the RA81 is usually enough of a bottleneck to force the
TU80 to fall back to 25 ips mode.  Eagles do better.  On the
vaxes that have Eagles I get average throughputs of 95 ips with
a TU77, and 22 ips @ 6250 bpi with a CDC 92185 streamer.

    While I was recoding everything, I fixed some of the
deficiencies I'd mentioned.  It no longer "runs on" after an
abort.	It no longer uses any more CPU than the dump that came
with 4.2bsd.  Ignored interrupts now stay ignored.  (A typical
Berkeley bug, common to csh users, but inexusable in a program
intended to *only* be run from the single-user [Bourne] shell).
Tape length is accounted more accurately.

    Davet@rand-unix reported that the "#ifdef RDUMP" code in the
old version would hang on Sun Unix 1.x.  It was sending a second
command to the server before waiting for the first command to
finish, which works fine on VAX 4.2bsd but confuses Suns.  That
optimization has been ifdef'd out.  Be aware that I only have
VAX sources; I can't guess what hacks Sun needed in their dump,
and I refuse to decompile any more of their stripped a.out's.

    I'm posting my heavily modified dumptape.c to unix-sources
as a complete file, instead of posting the 913 line context
diff containing essentially all of the old and new versions.
This keeps the submission small and avoids giving away lots of
Unix(TM) code.	It also makes installation easy:  put it in the
source directory and type "make".  Of course, you'll need 4.2bsd
source code to do this.

	Don Speck   speck@cit-vax.arpa	ihnp4!cithep!cit-vax!speck