shekita@crystal.UUCP (07/12/85)
The problem is this: We have a database file system that sits on a Unix raw disk. Our current goal is to add recovery to the database. In order to do this we need to know some things about disk controllers. Suppose a write operation is initiated (i.e., the controller begins processing the write request) and a system crash occurs. 1) Will the write finish? It seems that it shouldn't, since RAM will probably get flakey as power drops, and then a block of garbage will get written to disk. 2) If the write doesn't finish, will the block be detectably bad? For example, would the block's CRC be wrong, causing the controller to return an error on subsequent reads. In essence, we'd like to know if a block write can be considered atomic, and if it's not atomic, we'd like to know if there's a way to detect whether the write was interrupted and/or whether garbage got written. Granted, the answers to these questions will be device dependent, but we (unfortunately) seek general information. Any particular expertise that you could share would certainly be useful, though. Incidently, we currently run on an Eagle drive. Eugene Shekita Computer Science Department University of Wisconsin
hahn@AMES-NAS.ARPA (Jonathan Hahn) (07/13/85)
> Suppose a write operation is initiated (i.e., the controller > begins processing the write request) and a system crash > occurs. > > 1) Will the write finish? It seems that it shouldn't, since > RAM will probably get flakey as power drops, and then > a block of garbage will get written to disk. There is a significant difference between a "system crash" (i.e. software crash) and an unexpected power failure (or other hardware catastrophe)... > 2) If the write doesn't finish, will the block be detectably > bad? For example, would the block's CRC be wrong, causing > the controller to return an error on subsequent reads. In the event of a software crash, the disk sector(s) should be written properly (i.e. data and ecc written out in proper format). Of course, there's no telling how corrupted the data may have gotten as a result of the crash. The best protection against this is one or more internal consistency checks of some sort. In the event of a hardware failure such as a power failure during a write, I think it's pretty much undefined and depends a lot on the hardware in question and timing particulars of the incident. A formatted sector is made up of read-only, writable, and gap regions. If the power went out while the disk head was over the read-only or gap regions, the write would probably terminate successfully. If the power went out during the writable region, you would probably end up with a bad sector that returned hard ECC errors when read. I believe that most controllers are wired such that if they loose power, all disk operations are immediately disabled since the disks may still be powered. You should check the technical manuals for your controller and disk drive. -jonathan hahn
rpw3@redwood.UUCP (Rob Warnock) (07/17/85)
Just how bad can a power failure be? Well, how about wiping out the formatting (and therefore the data, to say the least) on several (even "many") cylinders on the disk? (Under Unix, might as well just reformat and hope your backup tapes are healty!) This can happen even if the disk drive has power-fail protection, if the drive is in an "expansion box" and powered by a separate power supply. As the power to the disk controller (in the main box) goes down, so does the power to the drive cable terminating resistors (which are normally pulled to +5 volts). This can, if you are unlucky, cause the "WRITE ENABLE L" signal to drop below the TTL threshold and start writing on the disk BEFORE the power to the disk (in the expansion box) drops enough to shut off the write amps in the disk. It all depends on the relative "hold-up" time of the two power supplies in the two boxes. Conversely, if you have TWO disks on the same controller sharing a bussed "control" cable, if the expansion box power drops first, you can wipe the data & formatting on the disk in the main box. One of the nice things about large DEC systems is that they have a power- fail line which is bussed between all of the expansion boxes (if the installer hooked them up correctly) and which causes ALL the boxes to panic and protect themselves if ANY of the boxes loses power. (Of course, if you are having troubles with power supplies, this bussed line makes it hard to figure out which box is causing the problem, so sometimes it gets unhooked for debugging and never gets put back...) It is possible (and not too expensive) to protect disks fairly well from this sort of thing, but a lot of the current low-cost "desktop" computers don't bother. (*sigh*) Rob Warnock Systems Architecture Consultant UUCP: {ihnp4,ucbvax!dual}!fortune!redwood!rpw3 DDD: (415)572-2607 USPS: 510 Trinidad Lane, Foster City, CA 94404