[comp.unix.wizards] Really, redundant file servers

dss@fatkid.UUCP (02/10/87)

In article <2592@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>
>	How hard would it be to incorporate a "shadow server" mode into
>NFS?  Imagine 2 servers serving the same file system.  When a write request
>comes in, both servers do it and the client waits to hear an ack from both
>of them.  When a read request comes in, both servers try to do it and the
>client takes the data from whichever server responds first.

First off, you've made some assumptions about a single shadow....
think, instead, of a set of shadow 'devices' (or filesystems).
Second, if you 'broadcast' read requests to all shadow servers, you're
imposing a lot of distributed overhead for the no-error (99.9%) case.
If, instead, you attempt the read first from a 'primary' source
then you have to decide which filesystem is primary (it could rotate,
but you'd probably lose any advantages that might be gained from
buffered read-ahead).

In many ways it is simpler to have a physical disk shadow than a
filesystem shadow....there the intent is to provide some reasonable
degree of redundancy in case of hard failure.  Then, the issues are:
    Which disk do you read first?
    If you get an error writing to a shadow, was the 'write' successful?
    Are you concerned with data integrity between shadows (i.e., what
    do you do if one shadow has different data than then other?)
    How do you deal with bad-block mapping?

Since disk shadowing is fundamentally intended to reduce the number of
read errors (by providing redundancy), all the interesting decisions
must be made when an error on one shadow occurs.  When you shadow file-
systems, you increase dramatically the number of types of errors that
can occur.  Consequently, the decision matrix gets far more complex.

For instance, if you are shadowing filesystems, you cannot tolerate
soft failures (e.g., timeout) on writing to any of the shadows.
This is because a subsequent read to that shadow may succeed where the
corresponding write failed.  Of course, there's also naming conflict
problems: what if creat() works on one filesystem but not another?
what happens if one filesystem is used to shadow multiple clients?
But even if you ignored those problems, there's still a whole can
of worms involving failure recovery.  For example, you raised the
question of what happens if one shadow filesystem fills up before
another.  Consider, also, what happens when a read request returns
end-of-file.  Do you accept this or do you try all the shadows to
see if one got further (and if it did, is it necessarily correct)?

Daniel Steinberg
(ihnp4|ucbvax)!sun!dss

jans@stalker.UUCP (02/10/87)

In article <12959@sun.uucp> dss@sun.UUCP (Daniel Steinberg) writes:
>In article <2592@phri.UUCP> roy@phri.UUCP (Roy Smith) writes:
>>
>>	How hard would it be to incorporate a "shadow server" mode into
>>NFS?  Imagine 2 servers serving the same file system...
>
>In many ways it is simpler to have a physical disk shadow than a
>filesystem shadow...

Simpler, and MUCH more efficient.  The Tandem NonStop fault-tolerant
computers provide parallel writes, but split seeks on reads.  I.e.: One
disk is assigned cylinders 0-405, the other cylinders 406-811, reducing
the average seek time by a factor approching two.  The actual improvement
is tempered by accelerated seeks, out-of-area writes, and non-locality of
reference.

:::::: Artificial   Intelligence   Machines   ---   Smalltalk   Project ::::::
:::::: Jan Steinman		Box 1000, MS 60-405	(w)503/685-2956 ::::::
:::::: tektronix!tekecs!jans	Wilsonville, OR 97070	(h)503/657-7703 ::::::

mangler@cit-vax.UUCP (02/16/87)

In article <12959@sun.uucp> dss@sun.UUCP (Daniel Steinberg) writes:
>In many ways it is simpler to have a physical disk shadow than a
>filesystem shadow...

Another variation is two servers for a set of dual-ported disks.
In this case there's no duplication of effort, and you're not
buying twice as many disks.  Eagles are so reliable that you'll
see more downtime from power outages than from broken drives.
This would be an attractive mode for us, because we've got the
hardware for it.

The hard part is getting two machines to share dual-ported disks
read-write.  The problem is caching; dual-port kits have no way
to let the other CPU know that something was written and that
something in the in-core cache should be invalidated.  Some other
way has to be provided to communicate this (Ethernet?).  The SI
SIMACS controller is supposed to be able to do this - I don't
understand how.  (Maybe they just don't cache anything?  But
that's too horrible to contemplate...).

Don Speck   speck@vlsi.caltech.edu  {seismo,rutgers,ames}!cit-vax!speck