[comp.os.research] Question for you.

gibson@rosemary.Berkeley.EDU (Garth Gibson) (08/14/89)

> From: eho@cognito.Princeton.EDU (Eric Ho)
> Newsgroups: comp.os.research
> Message-Id: <EHO.89Aug12232420@cognito.Princeton.EDU>
> Subject: raid downtime fault-tolerance & more ??
> 
> Is there a way to increase the downtime fault-tolerance for RAID ?
> I mean if the server that "manages" RAID crashed or got some other hardware
> faults (or other software bugs) then the whole thing will be screwed even
> though the disk arrays are completely ok.

If the CPU hardware or software on a file server fails, then,
as Eric suggests, that server is hosed regardless of the availability
of the disks (RAID or not).  The UNIX reaction is to reboot, forget
what was in progress and clean up the disks so that partially
completed operations are forgotten.  Sprite has a similar approach,
but if the file system is log-structured then "cleanup" means
read in the most recently written file system "superblock".
Since the log-structured file system is (almost) non-overwrite
(almost means you have to wrap around eventually), this superblock
points to a completely clean filesystem.  All the partially
complete operations were written after this superblock, but because a
new superblock describing them has not been found they are forgotten.
So cleanup is not a complete scan of all disks as long as the superblock
can be found more easily (timestamped and only in certain offsets into
each "cyinder group".  This improves availability by recovering more
quickly.

To preserve availability during the server downtime, the disks can be
dual ported to another host.  In SCSI this means that two hosts are
connected to each cable (though one may never use it because of cache
inconsistency issues).  The problem with this scheme is again incomplete
metadata.  So the other host, if it is to pick up the ball, must cleanup
the disks.  This gives you a problem when the failed server reboots and wants
to clean its disks himself.  It needs to assume that it is no longer master
of its disks and go looking for the usurper.

Of course, one could journal all metadata changes to your pair so that it
might be able to avoid cleanup, but this is rather expensive given failures
are rare and Sprite clients can tell the server what files they have open
and what dirty blocks they have.  In short our main vulnerability is that
dirty data in the server's cache that has delayed (hoping to be deleted).
Databases will require that their journal data go to disk, so it is just
your basic operations at risk.  How much do you want to pay for their
protection?

garth gibson