[comp.os.research] References for Fault Tolerent, "safe" file system

vlohia@orville.nas.nasa.gov (Ved P. Lohia) (05/04/89)

I am looking for references regarding fault tolerent, "safe" file system.
Our group is developing Mass Storage Subsystem. We need to meet following
goals related to file system.
 
1. Single failure of "media hardware or software" shold not cause a
   irrecoverable loss of data.

2. Better recovery techniques shold be provided (FSCK is far too slow on
   a "standard" sized file system, it will be impossible for an MSS sized
   file system.

Pointers from netters will be much appreciated.



vlohia@navier.nas.nasa.gov

dave@lethe.UUCP (Dave Collier-Brown) (05/06/89)

In article <7013@saturn.ucsc.edu> vlohia@orville.nas.nasa.gov (Ved P. Lohia) writes:
>I am looking for references regarding fault tolerent, "safe" file system.
>Our group is developing Mass Storage Subsystem. We need to meet following
>goals related to file system.
> 
>1. Single failure of "media hardware or software" shold not cause a
>   irrecoverable loss of data.
>
>2. Better recovery techniques shold be provided (FSCK is far too slow on
>   a "standard" sized file system, it will be impossible for an MSS sized
>   file system.
>
>Pointers from netters will be much appreciated.
>vlohia@navier.nas.nasa.gov 

  The subject isn't new, but a fair bit of work has been done on it.  One of
the better papers was by Ian Davis (then of ICL) at the University of Toronto.
This was distantly related to their Unix-like TUNIS operating system project.

  Davis, Ian John, "Towards Reliable File Systems", Masters thesis, date
unknown. 91pps, refs.

Abstract:
	The purpose of this thesis is to investigate the potential
damage caused to file systems by system failures, and to present ways of
improving the tolerance of file systems to such failures. It will be
shown that many of the problems associated with systems failures can be
avoided if certain facilities can be made available, and these are used
wisely by the system designer. These facilities will allow the designer
to ensure that following system failure the files held on the disc are
in a valid state, and thus continue to be usable. In addition, some
causes of system failure will be detected by the file system, and
corrected automatically. These include the detection of suspect storage
areas, and data overflow when storage devices become full. We will be
concerned primarily with UNIX type file structures, but hope that any
conclusions drawn will be applicable to other systems. It is considered
of some importance that the methods proposed do not unduly degrade
machine performance.

  Ian was a colleague many years ago, and was rather subtle. I was
suitably impressed with his verbal descriptions and extracted a
(lineprinter) copy of his thesis.

--dave (who wonders where he went) c-b

-- 
David Collier-Brown,  | {toronto area...}lethe!dave
72 Abitibi Ave.,      |  Joyce C-B:
Willowdale, Ontario,  |     He's so smart he's dumb.
CANADA. 223-8968      |

adp@cs.rochester.edu (Alan Percy) (05/22/89)

In article <7013@saturn.ucsc.edu> vlohia@orville.nas.nasa.gov (Ved P. Lohia) writes:
>
>I am looking for references regarding fault tolerent, "safe" file system.

When I examined making a "safer" file system that would be
based on a standard operating system, we came up with a rather
simple conclusion:

We where going to use dual hard disks and controllers.  The system
would have the dual media and a driver that would write to both,
but read from only one.  If a media failure was detected the
backup disk would be read from.  The bad track on the primary would
be reassigned and rewritten with data from the backup.

This could be done with only one controller and drive, but keeping
duplicate copies of each track on another platter in the same cylinder.
In our system, halfing the total storage and slowing writes down
was an acceptable trade off to gain reliablility.

-- 
Alan Percy..........................{rutgers,ames,cmcl2}!rochester!moscom!adp

eugene@eos.arc.nasa.gov (Eugene Miya) (05/24/89)

In article <7597@saturn.ucsc.edu> moscom!adp@cs.rochester.edu (Alan Percy) writes:
> Redundancy in fault-tolerance....
>simple conclusion:
>
>We where going to use dual hard disks and controllers.

General question:
Does this mean that Tandems are the wave of the fault-tolerant future?
We are certainly going to see multiprocessors and arrays of disks, and
I have never regarded computers as reliable.

Another gross generalization from

--eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov
  resident cynic at the Rock of Ages Home for Retired Hackers:
  "You trust the `reply' command with all those different mailers out there?"
  "If my mail does not reach you, please accept my apology."
  {ncar,decwrl,hplabs,uunet}!ames!eugene
  				Live free or die.

savela@uunet.UU.NET (Markku Savela) (05/24/89)

In article <7597@saturn.ucsc.edu>, moscom!adp@cs.rochester.edu (Alan Percy) writes:
> 
> We where going to use dual hard disks and controllers.  The system
> would have the dual media and a driver that would write to both,
> but read from only one.  If a media failure was detected the
> backup disk would be read from.  The bad track on the primary would
> be reassigned and rewritten with data from the backup.

    This method was an option in a PDP-11 based multiuser operating
system which we designed in 70's in my earlier employment. One
additional detail has to be noted

   - if media failure is detected no futher attempts should be
     done on this disk. System should revert to backup only.
     All kind of havoc may result if the failure is transient..

    The "dual write"-option wasn't very popular, although some
sites used it. The trouble was just those transient error (or
someone hitting "write protect" or "off line" accidentally.
System reverted fully to backup and nobody noticed anything.
And, naturally nobody read the error messages from the console
and the next time system was booted, users had trashed disks,
because primary disk was again in use... :-(  I guess the
backup disk should have had some mark that the primary has
been dropped, but we never got to implement that.

reggie@dinsdale.nm.paradyne.com (George W. Leach) (05/24/89)

In article <7622@saturn.ucsc.edu> eos!eugene@eos.arc.nasa.gov (Eugene Miya) writes:

>General question:

>Does this mean that Tandems are the wave of the fault-tolerant future?

    I seem to recall AT&T offering a fault tolerant system that consisted
of two 3b20 minis configured to run off the same set of disk drives.  This
was in the early 80's.

George W. Leach					AT&T Paradyne 
.!uunet!pdn!reggie				Mail stop LG-129
reggie@pdn.paradyne.com				P.O. Box 2826
Phone: (813) 530-2376				Largo, FL  USA  34649-2826