vlohia@orville.nas.nasa.gov (Ved P. Lohia) (05/04/89)
I am looking for references regarding fault tolerent, "safe" file system. Our group is developing Mass Storage Subsystem. We need to meet following goals related to file system. 1. Single failure of "media hardware or software" shold not cause a irrecoverable loss of data. 2. Better recovery techniques shold be provided (FSCK is far too slow on a "standard" sized file system, it will be impossible for an MSS sized file system. Pointers from netters will be much appreciated. vlohia@navier.nas.nasa.gov
dave@lethe.UUCP (Dave Collier-Brown) (05/06/89)
In article <7013@saturn.ucsc.edu> vlohia@orville.nas.nasa.gov (Ved P. Lohia) writes: >I am looking for references regarding fault tolerent, "safe" file system. >Our group is developing Mass Storage Subsystem. We need to meet following >goals related to file system. > >1. Single failure of "media hardware or software" shold not cause a > irrecoverable loss of data. > >2. Better recovery techniques shold be provided (FSCK is far too slow on > a "standard" sized file system, it will be impossible for an MSS sized > file system. > >Pointers from netters will be much appreciated. >vlohia@navier.nas.nasa.gov The subject isn't new, but a fair bit of work has been done on it. One of the better papers was by Ian Davis (then of ICL) at the University of Toronto. This was distantly related to their Unix-like TUNIS operating system project. Davis, Ian John, "Towards Reliable File Systems", Masters thesis, date unknown. 91pps, refs. Abstract: The purpose of this thesis is to investigate the potential damage caused to file systems by system failures, and to present ways of improving the tolerance of file systems to such failures. It will be shown that many of the problems associated with systems failures can be avoided if certain facilities can be made available, and these are used wisely by the system designer. These facilities will allow the designer to ensure that following system failure the files held on the disc are in a valid state, and thus continue to be usable. In addition, some causes of system failure will be detected by the file system, and corrected automatically. These include the detection of suspect storage areas, and data overflow when storage devices become full. We will be concerned primarily with UNIX type file structures, but hope that any conclusions drawn will be applicable to other systems. It is considered of some importance that the methods proposed do not unduly degrade machine performance. Ian was a colleague many years ago, and was rather subtle. I was suitably impressed with his verbal descriptions and extracted a (lineprinter) copy of his thesis. --dave (who wonders where he went) c-b -- David Collier-Brown, | {toronto area...}lethe!dave 72 Abitibi Ave., | Joyce C-B: Willowdale, Ontario, | He's so smart he's dumb. CANADA. 223-8968 |
adp@cs.rochester.edu (Alan Percy) (05/22/89)
In article <7013@saturn.ucsc.edu> vlohia@orville.nas.nasa.gov (Ved P. Lohia) writes: > >I am looking for references regarding fault tolerent, "safe" file system. When I examined making a "safer" file system that would be based on a standard operating system, we came up with a rather simple conclusion: We where going to use dual hard disks and controllers. The system would have the dual media and a driver that would write to both, but read from only one. If a media failure was detected the backup disk would be read from. The bad track on the primary would be reassigned and rewritten with data from the backup. This could be done with only one controller and drive, but keeping duplicate copies of each track on another platter in the same cylinder. In our system, halfing the total storage and slowing writes down was an acceptable trade off to gain reliablility. -- Alan Percy..........................{rutgers,ames,cmcl2}!rochester!moscom!adp
eugene@eos.arc.nasa.gov (Eugene Miya) (05/24/89)
In article <7597@saturn.ucsc.edu> moscom!adp@cs.rochester.edu (Alan Percy) writes: > Redundancy in fault-tolerance.... >simple conclusion: > >We where going to use dual hard disks and controllers. General question: Does this mean that Tandems are the wave of the fault-tolerant future? We are certainly going to see multiprocessors and arrays of disks, and I have never regarded computers as reliable. Another gross generalization from --eugene miya, NASA Ames Research Center, eugene@aurora.arc.nasa.gov resident cynic at the Rock of Ages Home for Retired Hackers: "You trust the `reply' command with all those different mailers out there?" "If my mail does not reach you, please accept my apology." {ncar,decwrl,hplabs,uunet}!ames!eugene Live free or die.
savela@uunet.UU.NET (Markku Savela) (05/24/89)
In article <7597@saturn.ucsc.edu>, moscom!adp@cs.rochester.edu (Alan Percy) writes: > > We where going to use dual hard disks and controllers. The system > would have the dual media and a driver that would write to both, > but read from only one. If a media failure was detected the > backup disk would be read from. The bad track on the primary would > be reassigned and rewritten with data from the backup. This method was an option in a PDP-11 based multiuser operating system which we designed in 70's in my earlier employment. One additional detail has to be noted - if media failure is detected no futher attempts should be done on this disk. System should revert to backup only. All kind of havoc may result if the failure is transient.. The "dual write"-option wasn't very popular, although some sites used it. The trouble was just those transient error (or someone hitting "write protect" or "off line" accidentally. System reverted fully to backup and nobody noticed anything. And, naturally nobody read the error messages from the console and the next time system was booted, users had trashed disks, because primary disk was again in use... :-( I guess the backup disk should have had some mark that the primary has been dropped, but we never got to implement that.
reggie@dinsdale.nm.paradyne.com (George W. Leach) (05/24/89)
In article <7622@saturn.ucsc.edu> eos!eugene@eos.arc.nasa.gov (Eugene Miya) writes: >General question: >Does this mean that Tandems are the wave of the fault-tolerant future? I seem to recall AT&T offering a fault tolerant system that consisted of two 3b20 minis configured to run off the same set of disk drives. This was in the early 80's. George W. Leach AT&T Paradyne .!uunet!pdn!reggie Mail stop LG-129 reggie@pdn.paradyne.com P.O. Box 2826 Phone: (813) 530-2376 Largo, FL USA 34649-2826