[comp.unix.wizards] Why is restore so slow?

jerry@olivey.olivetti.com (Jerry Aguirre) (01/26/91)

Those familiar with using dump and restore will have noticed the
difference in speed between them.  The dump procedure, expecially with
the current multi-buffered version, usually sings along at close to full
tape speed.  Restore, on the other hand, is a real dog taking up to 10
times as long for the same amount of data.

Has anyone done any evaluations of why there is such an extreem
difference in speed?  Granted that creating files involves more overhead
than dumping them restore still seems very slow.  As restore operates on
the mounted file system it has the advantage of accessing a buffered
file system with write behind.

My particular theory is that the disk buffering algorithms are precisely
wrong for restore.  By this I mean they keep in the buffers the data
that will never be needed again and flush the data that will.  I plan to
do some experimentation and would appreciate hearing any ideas you might
offer.

				Jerry Aguirre

liam@cs.qmw.ac.uk (William Roberts;) (01/28/91)

In <50235@olivea.atc.olivetti.com> jerry@olivey.olivetti.com (Jerry Aguirre) 
writes:

>Has anyone done any evaluations of why there is such an extreem
>difference in speed?  Granted that creating files involves more overhead
>than dumping them restore still seems very slow.  As restore operates on
>the mounted file system it has the advantage of accessing a buffered
>file system with write behind.

>My particular theory is that the disk buffering algorithms are precisely
>wrong for restore.  By this I mean they keep in the buffers the data
>that will never be needed again and flush the data that will.  I plan to
>do some experimentation and would appreciate hearing any ideas you might
>offer.

Restore suffers from the fact that files are stored in inode-number order: 
this is not the ideal order for createing files as it thrashes the namei-cache 
because the files are recreated randomly all over the place. We reorganised 
our machine once and used dump/restore to move our /usr/spool and /usr/mail 
partitions around: /usr/spool contains lots of tiny files called things like 
/usr/spool/news/comp/unix/internals/5342 and this took an incredibly long time 
to restore. /usr/mail contains several hundred files but no subdirectories and 
restored in about the same sort of time as it took to dump. 

Restore suffers the same accidents of history as a lot of other system 
utilities. It dates back to sub 1-megabyte memory machines (maybe 64k separate 
I/D space PDPs) and so it uses pathetically small buffers. If you want to 
speed up restore, steal maybe 2 Megabytes of memory as file buffers, fill that 
with file images to be restored and then write them to disk in something 
resembling a depth-first traversal of the directory tree. This costs a little 
bit of ingenuity but doesn't involve any kernel changes.
--

William Roberts                 ARPA: liam@cs.qmw.ac.uk
Queen Mary & Westfield College  UUCP: liam@qmw-cs.UUCP
Mile End Road                   AppleLink: UK0087
LONDON, E1 4NS, UK              Tel:  071-975 5250 (Fax: 081-980 6533)

slevy@poincare.geom.umn.edu (Stuart Levy) (01/29/91)

In article <2880@redstar.cs.qmw.ac.uk> liam@cs.qmw.ac.uk (William Roberts;) writes:
>In <50235@olivea.atc.olivetti.com> jerry@olivey.olivetti.com (Jerry Aguirre) 
>writes:
>
>>Has anyone done any evaluations of why there is such an extreem
>>difference in speed?  Granted that creating files involves more overhead
>>than dumping them restore still seems very slow.  As restore operates on
>>the mounted file system it has the advantage of accessing a buffered
>>file system with write behind.

Aha, that's the problem.  On BSD-derived systems (e.g. Suns) at least,
there are lots of synchronous operations done as files are created,
space allocated, and directories modified.  The point is to make the
filesystem more robust -- a system crash in mid-update doesn't leave corrupted
directories, or blocks in the free list that also point into a file, or etc.
Write-behind still applies to file *data*, but the restore bottleneck is
in creating all those files.

A few years ago I hacked the BSD filesystem code so, when a mode was set,
most of those synchronous bwrite() calls would get changed to delayed
bdwrite()s.  It helped -- restore performance rose by about a factor of 2,
still much slower than dump but more nearly tolerable.  Of course this was
unsafe, and the mode was only enabled when we were restoring a wrecked
filesystem.

	Stuart Levy, Geometry Group, University of Minnesota
	slevy@geom.umn.edu

rbj@uunet.UU.NET (Root Boy Jim) (02/01/91)

In article <173@skyking.UUCP> jc@skyking.UUCP (J.C. Webber III) writes:
?What I have been doing is using "find . print|cpio -pudmv /new.slice
?/usr/spool"  to move the files to a different partition while I clean
?up the /usr/spool slice.  I do a rm -r * on /usr/spool, umount it,
?fsck it, remount it and the cpio all the files back from the backup
?partition.  

Why don't you just newfs (or mkfs) rather than removing everything
and fscking? BTW, you might want to try making the filesystem with
more inodes than usual. This might not solve your problem, but
it might make it less frequent. Get a better OS if you can. Good Luck.
-- 

	Root Boy Jim Cottrell <rbj@uunet.uu.net>
	Close the gap of the dark year in between

andys@ulysses.att.com (Andy Sherman) (02/04/91)

In article <1013@eplunix.UUCP> das@eplunix.UUCP (David Steffens) writes:
>All right, my $0.02 on this issue.
>
>Who cares how slow restore is?  How often do you do have to do
>full restore on a filesystem or a whole disk?  Once or twice a year?
>If it's more often than that, then you have a REAL problem
>and maybe you ought to spend your time and energy fixing THAT!

The frequency of file system restores, even in the best run shops, is
directly proportional to the number of disks spinning.  We have
something like 100 disks in our computer center (*NOT* counting the
things attached to workstations and PCs).  MTBFs are 30000 hours on
about a third of them and 100000 hours on the rest.  Go ahead and do
the math.  We are statistically doomed to an awful lot of disk
failures just based on the volume.  Now not every failure is a
catastrophic failure requiring a full restore of the file system, but
I suspect that more than one or two a year will be.  With users
clamoring for us to get their data back on line, restore performance
is a concern.  Reliability is a more urgent concern, of course, but
you try fighting these folks off.....
--
Andy Sherman/AT&T Bell Laboratories/Murray Hill, NJ
AUDIBLE:  (201) 582-5928
READABLE: andys@ulysses.att.com  or att!ulysses!andys
What? Me speak for AT&T?  You must be joking!

torek@elf.ee.lbl.gov (Chris Torek) (02/18/91)

In article <2880@redstar.cs.qmw.ac.uk> liam@cs.qmw.ac.uk (William Roberts)
writes:
>Restore suffers from the fact that files are stored in inode-number order: 
>this is not the ideal order for createing files as it thrashes the namei-cache 
>because the files are recreated randomly all over the place.

Well, no and yes.

While the files are indeed in inode order, and the restore program (as
opposed to the old `restor' program) does recreate them in this order,
the Fast File System tends to set things up so that all the files in
any one directory are in the same cylinder group as that directory.
Depending on cylinder group sizes this may or may not overload the name
cache, since only the directory parts of the names are cached (each
trailing name is unique within its directory, but the directory must be
searched anyway to verify this first).

More important are two other facts:

 - Each directory must be scanned entirely (to make sure the name is unique);
 - Directory operations are synchronous.

The latter is usually the performance-killer since the directory blocks
tend to remain in the buffer cache.  Directory writes are done
synchronously to make crash recovery possible.  Ordered (but otherwise
delayed) writes should give the same effect with a much smaller
performance penalty; this is being investigated.

>/usr/spool/news/comp/unix/internals/5342 and this took an incredibly long time 
>to restore. /usr/mail contains several hundred files but no subdirectories and 
>restored in about the same sort of time as it took to dump. 

The presence or absence of subdirectories is largely irrelevant: the
problem is the large number of files.  One big file restores much
faster than several dozen small files, even though both take the same
amount of space, because one big file equals one synchronous directory
write (preceded by one synchronous inode write) followed by many
asynchronous data writes.

If you do many full file system restores, it would probably be worth
your effort to make a kernel that does delayed writes for inode and
directory operations, and run it (or enable delayed writes on each file
system in question) each time you do such a restore.  If the system
crashes, you can just start over.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

bzs@world.std.com (Barry Shein) (02/19/91)

We've suffered thru this slow restore problem here, so it's on my
mind.

Anyone have any thoughts on the idea of writing a restore which
restores a standard dump thru the raw device? This would be for the
first, zero level dump image (I guess, perhaps it would be easy enough
to apply incrementals this way also tho they're usually less of a
problem.)

It would seem that would bypass the synchronous directory problem w/o
too much disruption to, eg, the kernel. And one could still use good
ol' restore on the same tapes if there were any doubts or problems.  I
like the conservative nature of this approach, your tapes and backup
procedures remain unchanged, only restoral is affected.

Mostly a matter of simulating the file system at user-level and
deciding which block it would throw things into as they come off of
tape.

Thoughts? Think it would be a lot faster?
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

torek@elf.ee.lbl.gov (Chris Torek) (02/19/91)

In article <BZS.91Feb18182857@world.std.com> bzs@world.std.com
(Barry Shein) writes:
>Anyone have any thoughts on the idea of writing a restore which
>restores a standard dump thru the raw device?

Funny how these things come full circle: the original `restor' scribbled
directly on the raw device.

Kirk threw it out when he designed the 4.2BSD Fast File System, and not
just because it only wrote 4.1BSD-style file systems.  It also required
that you restore onto the same *size* file system.

You could, of course, move the kernel FFS code into user space (or
maybe find a copy of Kirk's original implementation, which lived in
user space) and make restore talk to that.

If you really wanted to reimplement wheels, you could make your user FFS
run via RPC+XDR over the network (sockets/pipes).
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

bzs@world.std.com (Barry Shein) (03/04/91)

>Who cares how slow restore is?  How often do you do have to do
>full restore on a filesystem or a whole disk?  Once or twice a year?

This is a value judgement that may or may not be true in other
people's facilities. C'mon, not everyone does exactly what you do
for a living.

>If it's more often than that, then you have a REAL problem
>and maybe you ought to spend your time and energy fixing THAT!

No, not clear. I worked in a place where they used huge scratch files
and at the time scratch space was at a premium (this is in the days of
washing machine drives.) What they did was take turns running to a
point in their computations (which took days) and then yielding to the
next group (say, over the weekend.) This involved backing up and
restoring all the files for each swap.

In theory it was no big deal and the stop/start had already been honed
down to a simple procedure in the code (given a signal it would write
out its state and exit.)  The only painful part was slow-moving tapes,
each hour of compute time was precious and the switch-over could take
a couple of hours or more.  Just adding more disks wasn't available in
the short-term since these were fixed grant contracts and, alas, using
a coupla grad students who were already paid for made a fair amount of
sense (besides, to get more money would invariably involve promising
more work, diminishing returns.)

Now, there are other ways to do this and they were used (e.g. "dd"),
but that begs the question.

But I think to just cast off a reasonable question with "no reasonable
person would ever want this" often just belies a limit of one's own
imagination. It's a bad knee-jerk in systems work (particularly
because systems people are usually woefully ignorant and even callous
about what their systems are actually used for, and tend to consider
any feature they're personally not interested in as "unnecessary", I
consider that to be the dark side of the systems religion.)
-- 
        -Barry Shein

Software Tool & Die    | bzs@world.std.com          | uunet!world!bzs
Purveyors to the Trade | Voice: 617-739-0202        | Login: 617-739-WRLD

lm@slovax.Berkeley.EDU (Larry McVoy) (03/04/91)

I believe that a key component to the slowness of restore is the synchronous
nature of directory operations in the Unix file system.  For example, a create,
something that occurs quite often in restore :-), is synchronous.  It has to
be, those are the semantics of a Unix file system (can you say lock files?).
It actually has to be atomic and completed when the system call returns to the
user, the fact that it is synchronous is an implementation issue that has been
much discussed in comp.arch; I took the point of view that it was a "good thing"
and somebody from Japan took the point of view that it was "too slow".  Before
everyone starts complaining, think back to the days that you had to repair
file systems with fsdb (remember that? If not, be quiet).

The correct fix, in my opinion, is hardware, not software.  Use NVRAM and reclaim
the directory pages from that.  The semantics remain and you get the performance
back.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

torek@elf.ee.lbl.gov (Chris Torek) (03/04/91)

In article <480@appserv.Eng.Sun.COM> lm@Eng.Sun.COM writes:
>I believe that a key component to the slowness of restore is the
>synchronous nature of directory operations in the Unix file system.
>For example, a create, something that occurs quite often in restore
>:-), is synchronous.  It has to be, those are the semantics of a Unix
>file system (can you say lock files?).

(Funny to hear someone from Sun arguing for Unix FS semantics :-) )

Seriously, `synchronous' is more restrictive than necessary.  Directory
operations must be ordered.  They need not be complete by the time the
call returns.  If they are properly ordered, the inode will exist before
the directory entry, and the directory entry will exist before the first
file block appears, so that fsync() will guarantee that the file exists
and is in permanent storage.
-- 
In-Real-Life: Chris Torek, Lawrence Berkeley Lab EE div (+1 415 486 5427)
Berkeley, CA		Domain:	torek@ee.lbl.gov

terryl@sail.LABS.TEK.COM (03/05/91)

In article <BZS.91Mar3135546@world.std.com> bzs@world.std.com (Barry Shein) writes:
>But I think to just cast off a reasonable question with "no reasonable
>person would ever want this" often just belies a limit of one's own
>imagination. It's a bad knee-jerk in systems work (particularly
>because systems people are usually woefully ignorant and even callous
>about what their systems are actually used for, and tend to consider
>any feature they're personally not interested in as "unnecessary", I
>consider that to be the dark side of the systems religion.)

     Barry hit the nail squarely on the head, so to speak, and I'll do a little
confessional thing here to demonstrate why(good for the soul, and all that rub-
bish...)

     I do a lot of systems work in my job, the majority of which is kernel-
related work, although I do have some user level experience (have to use what
I develop, ya know...).

     Anyways, rewind back to the early `80's. We rolled our own hardware for
what would now be similar to the original Sun (not a clone, but functionally
equivalent). I was heavily involved in the kernel work, first on V7 Unix(tm),
and then on 4.2 BSD later in the decade.

     Our disk drive of choice was a Micropolis drive with an embedded control-
ler. The early drives had some "interesting" failure modes, so to speak. Once
a particular failure mode appeared (and there were several), the drive was
basically out to lunch. Since I had one of the systems built early in the game,
and since I heavily used my system to continue development, it was a royal pain
in the keester. So I put a LOT of error recovery in the device driver, and
things were hunky dory, and it was released to our general user community.

     Now for the confessional part: If it wasn't MY system that was experien-
cing the difficulties, I doubt that all of that error recovery would have made
it into the device driver. Let's face it, that kind of stuff is not all that
interesting to do, anyways.......

     BTW, I finally powered off my old system about a year ago, `cause I needed
the space it was occupying on my desk in my office, and you wouldn't believe
how emotional it was (NO smileys here). It was just like telling a parent to
abandon a child, because the child was no longer useful to the parent. I've
worked through my guilt now!!!! (-:

__________________________________________________________
Terry Laskodi		"There's a permanent crease
     of			 in your right and wrong."
Tektronix		Sly and the Family Stone, "Stand!"
__________________________________________________________

lerman@stpstn.UUCP (Ken Lerman) (03/05/91)

In article <480@appserv.Eng.Sun.COM> lm@Eng.Sun.COM writes:
>
>I believe that a key component to the slowness of restore is the synchronous
>nature of directory operations in the Unix file system.  For example, a create,
...
>---
>Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

I was always taught that there is no point in debating how many angels
can dance on the head of the pin when one can just go count them. :-)

Has anyone out there with the appropriate source done some measurement
of where the time goes in restore?  How many reads/writes does it do?
How long does each take?  Do those figures seem reasonable? ...etc...

Just a humble suggestion from someone who neither has the problem nor
the solution. :-)

Ken

lm@slovax.Eng.Sun.COM (Larry McVoy) (03/06/91)

In article <6511@stpstn.UUCP> lerman@stpstn.UUCP (Ken Lerman) writes:
>In article <480@appserv.Eng.Sun.COM> lm@Eng.Sun.COM writes:
>>
>>I believe that a key component to the slowness of restore is the synchronous
>>nature of directory operations in the Unix file system.  For example, a create,
>Has anyone out there with the appropriate source done some measurement
>of where the time goes in restore?  How many reads/writes does it do?
>How long does each take?  Do those figures seem reasonable? ...etc...

Well, I didn't want to tip my hand, but someone at Sun actually tried turning
off the sync writes (dir ops) while restoring a system.  A speed up of 4X
is what I remember, but I might be a little off.  Your mileage may vary.

NVRAM in the disk interface is the easy answer, a option to mount is the
sleazy answer.
---
Larry McVoy, Sun Microsystems     (415) 336-7627       ...!sun!lm or lm@sun.com

stefano@angmar.sublink.ORG (Stefano Longano) (03/06/91)

In <480@appserv.Eng.Sun.COM> lm@slovax.Berkeley.EDU (Larry McVoy) writes:


>I believe that a key component to the slowness of restore is the synchronous
>nature of directory operations in the Unix file system.  For example, a create,
>something that occurs quite often in restore :-), is synchronous.  It has to
>be, those are the semantics of a Unix file system (can you say lock files?).

These operation need not to be syncronous to retain the semantics of the
Unix file system. You could read the paper presented at the Summer '90 USENIX
Technical Conference by Mendel Rosenblum and John Ousterhout about the
Sprite Log-structured File System. This paper is available for anonymous FTP
from sprite.berkeley.edu.
-- 
Stefano Longano           WW           WW    EMAIL : stefano@angmar.sublink.ORG
Viale Trento 1            ||wwwwwwwwwww||     Happy are those who dream dreams
38068 Rovereto (TN)       ||    ---    ||      and are ready to pay the price
Tel : +39 (464) 436042    ||____|_|____||          to make them come true

grr@cbmvax.commodore.com (George Robbins) (03/07/91)

In article <485@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes:
> In article <6511@stpstn.UUCP> lerman@stpstn.UUCP (Ken Lerman) writes:
> >In article <480@appserv.Eng.Sun.COM> lm@Eng.Sun.COM writes:
> >>
> >>I believe that a key component to the slowness of restore is the synchronous
> >>nature of directory operations in the Unix file system.  For example, a create,
> >Has anyone out there with the appropriate source done some measurement
> >of where the time goes in restore?  How many reads/writes does it do?
> >How long does each take?  Do those figures seem reasonable? ...etc...
> 
> Well, I didn't want to tip my hand, but someone at Sun actually tried turning
> off the sync writes (dir ops) while restoring a system.  A speed up of 4X
> is what I remember, but I might be a little off.  Your mileage may vary.
> 
> NVRAM in the disk interface is the easy answer, a option to mount is the
> sleazy answer.

I don't see what's easy about NVRAM, it's expensive and still requires some
new software action on restart that unix doesn't do presently.

The mount option isn't sleazy, it just represents putting some options at
a very key point where the "one size fits all" philosophy is getting painful.

I've felt for a long time that options at the mount point for 100% synchronous
writes (for floppies) was pretty obvious, providing a similar option for
non-synchronous operation, for either restores or "don't care" temporary
filesystems seems painless.  I shouldn't mention the never sync option to
confine writes to a rom based filesystem to the buffer pool...

---
A hardware type who gets very bored waiting to restore ~500 MB news paritions.
-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)