rcd@ico.isc.com (Dick Dunn) (09/25/90)
art@pilikia.pegasus.com (Art Neilson) writes: > ...ir@cel.co.uk (ian reid) writes: > > [ stuff about file system getting hosed when power is cycled > > without performing graceful shutdown ... ] > Every UNIX I have seen behaves in the manner you describe. If you > hit the red switch or experience a power outage without performing a > graceful shutdown, you deserve whatever you get... Years ago, that was generally true...and it was one of the major objections to using UNIX in "commercial" systems. As a result, essentially all variants of UNIX have had file system changes to "harden" them against problems caused by power failure. Damage from a power outage should be limited to files being written at the time the power went away, and should be localized (e.g., a frozzed/missing block of data, not an entire file gone or destroyed). Going back to the original question: If you're seeing major file system damage due to power failures, there's something wrong that should be fixed. I'm not just spouting applehood/motherpie; I haven't seen a file system damaged by power failure in years. I've even tried to damage file systems by getting things as busy as I could, then turning off machines. (Of course, the T-storm just now gathering over the hills will probably destroy all my files and prove me to be drastically wrong.:-) The software in hardened file systems is pretty good at ensuring that things get written when they should, as they should, so that fsck can pick up the pieces. This leaves some questions about hardware which were brought up in a couple other postings on this topic. There are old but unfortunately-not-apocry- phal stories about disk controllers which would start writing zeros as power dropped. That was a hardware bug; if it happened to you nowadays you'd need to get your disk controller fixed or replaced. Taking the 386 PCish world in particular, there is no excuse for a controller writing because of a power failure. (Detail: One pin out of a PC power supply is POWER GOOD. On a low- voltage condition, the power supply is expected to drop POWER GOOD; the motherboard logic must use this to drive RESET on the bus. Bus cards must honor RESET as an indication of either system start-up or power failure. If this doesn't work, you've got a hardware problem.) > ...If your UNIX box is used for real > production work, you are quite foolish not to put it on an UPS... Neilson signs himself from "Bank of Hawaii"--and I'm glad that someone associated with banking is taking a conservative attitude on system failure! I hate to argue against cautiousness, but not all applications are critical enough to make an UPS worthwhile. (The cost of an UPS might be 10-25% of the cost of the rest of the hardware. They're getting more affordable, but they're not cheap.) If you need constant availability of systems, an UPS is essential. If data integrity is paramount, an UPS helps but there are other things you need to do as well. My point is that file systems and hardware are expected to be robust enough that you should *not* tolerate power failures corrupting file systems. -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...Worst-case analysis must never begin with "No one would ever want..."
karl@naitc.naitc.com (Karl Denninger) (09/26/90)
In article <1990Sep24.231148.18053@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >> Every UNIX I have seen behaves in the manner you describe. If you >> hit the red switch or experience a power outage without performing a >> graceful shutdown, you deserve whatever you get... > >Years ago, that was generally true...and it was one of the major objections >to using UNIX in "commercial" systems. As a result, essentially all >variants of UNIX have had file system changes to "harden" them against >problems caused by power failure. Damage from a power outage should be >limited to files being written at the time the power went away, and should >be localized (e.g., a frozzed/missing block of data, not an entire file >gone or destroyed). Going back to the original question: If you're >seeing major file system damage due to power failures, there's something >wrong that should be fixed. I'm not just spouting applehood/motherpie; I >haven't seen a file system damaged by power failure in years. I've even >tried to damage file systems by getting things as busy as I could, then >turning off machines. (Of course, the T-storm just now gathering over the >hills will probably destroy all my files and prove me to be drastically >wrong.:-) Ok, I've seen filesystem damage of this type, on your Operating System (2.0.2), and another employee here has seen the same thing on his copy of ISC 2.2. To put it bluntly, there's something wrong that should be fixed. >The software in hardened file systems is pretty good at ensuring that >things get written when they should, as they should, so that fsck can pick >up the pieces. OK, so why did my /etc/default/boot file get whacked a few months back when we had a power failure? (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY, you can't boot the machine!) >(Detail: One pin out of a PC power supply is POWER GOOD. On a low- >voltage condition, the power supply is expected to drop POWER GOOD; the >motherboard logic must use this to drive RESET on the bus. Bus cards >must honor RESET as an indication of either system start-up or power >failure. If this doesn't work, you've got a hardware problem.) Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic that disables the write gate when power goes out of safe margins). >If you need constant availability of systems, an UPS is essential. If data >integrity is paramount, an UPS helps but there are other things you need to >do as well. My point is that file systems and hardware are expected to be >robust enough that you should *not* tolerate power failures corrupting >file systems. Ok Mr. Dunn, the gauntlet has been thrown down. If you want details of the failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these hits) you're welcome to call me here. I await your response. -- Karl Denninger AC Nielsen kdenning@ksun.naitc.com (708) 317-3285 Disclaimer: Contents represent opinions of the author; I do not speak for AC Nielsen on Usenet.
rcd@ico.isc.com (Dick Dunn) (09/27/90)
I had opined that you shouldn't see file system damage on a power hit, and also noted that I hadn't seen damage (beyond files being written during the hit) for quite a few years. karl@naitc.naitc.com (Karl Denninger) writes: > Ok, I've seen filesystem damage of this type, on your Operating System > (2.0.2), and another employee here has seen the same thing on his copy of > ISC 2.2. > > To put it bluntly, there's something wrong that should be fixed. This sort of thing is tough to work out without a lot of detail, but since Karl has said, "the gauntlet has been thrown down" let's see if we can make some progress on it here. I'm game--I don't want to find out "the hard way" that there are ways to take major damage from a power failure, so if Karl has seen it, I'd like to learn from it. > OK, so why did my /etc/default/boot file get whacked a few months back when > we had a power failure? ... > (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY, > you can't boot the machine!) "Whacked" is a little too technical for me just yet. Do you mean that it ended up empty, or missing entirely? After you recreated it and got the system back up, did anything like the boot file show up in lost+found? Was the rest of /etc/default OK, or did it take out the whole directory? Here's what I'm trying to get at: If the file was corrupted or gone, something got written that shouldn't have been written. The first task is to find out what got written. The sort of reasoning goes like this: - If /etc/default (the directory containing boot) got corrupted, I'd want to know what ended up there, because that directory shouldn't be subject to change during "normal" system operation. - If the inode for boot got corrupted, you'd expect a chunk of inodes (one disk sector) to get it...and it's likely that other files would be hit also. The boot parameter file is likely to share its inode sector with other files that are "important" but seldom modified. An access-time update could have been in progress when the power failed. If it toasted a full sector, you'd expect to see other important files damaged or gone. > Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic > that disables the write gate when power goes out of safe margins). Sounds good so far. What's the box? If you've built it up from parts, then what's the motherboard? As you can guess, I don't yet see cause to say that either hardware or software is either guilty or innocent. Again, if something got corrupted, it means that something got written that shouldn't have been written. The problem--and it's NOT likely to be an easy one--is to find out what was written wrong. That's likely to give a clue whether it's hardware or software (or a conspiracy of the two:-). > Ok Mr. Dunn, the gauntlet has been thrown down. If you want details of the > failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these > hits) you're welcome to call me here. I don't follow the connection to SunOS4.1--correct me if I'm wrong, but I didn't think there was a hardware platform common to ISC's Sys V.3.2 and SunOS. (386i???) (My OS??? Let's clarify: I do use ISC systems, both at work and at home. I'm taking an interest in this because I want to know how and why the failures you've seen can happen--it's an important question. But I'm not speaking for ISC on the net.) -- Dick Dunn rcd@ico.isc.com -or- ico!rcd Boulder, CO (303)449-2870 ...Worst-case analysis must never begin with "No one would ever want..."
rdc30med@nmrdc1.nmrdc.nnmc.navy.mil (LCDR Michael E. Dobson) (09/27/90)
My system, an AT&T 3B2/600G running AT&T Sys V R 3.2.2 supposedly has a hardened file system, however, I have had to restore from a boot floppy and tape on occaision after a power failure. Because of this, in addition to an UPS, I have installed a powerfailure monitor which automaticly begins the shutdown sequence when it senses a powerfailure from the primary power source. The UPS provides sufficient time for users to log off and for the sutdown sequnce to complete. For a cost of ~$250 for the transducer/software, it's a very good investment. E-mail if you want details on the product. -- Mike Dobson, Sys Admin for | Internet: rdc30med@nmrdc1.nmrdc.nnmc.navy.mil nmrdc1.nmrdc.nnmc.navy.mil | UUCP: ...uunet!mimsy!nmrdc1!rdc30med AT&T 3B2/600G Sys V R 3.2.2 | BITNET: dobson@usuhsb.bitnet WIN/TCP for 3B2 | MCI-Mail: 377-2719 or 0003772719@mcimail.com
karl@naitc.naitc.com (Karl Denninger) (09/29/90)
In article <1990Sep26.192446.22110@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes: >I had opined that you shouldn't see file system damage on a power hit, and >also noted that I hadn't seen damage (beyond files being written during the >hit) for quite a few years. .... >> OK, so why did my /etc/default/boot file get whacked a few months back when >> we had a power failure? >... >> (For the unknowing, lacking an /etc/default/boot file, which is READ ONLY, >> you can't boot the machine!) > >"Whacked" is a little too technical for me just yet. Do you mean that it >ended up empty, or missing entirely? After you recreated it and got the >system back up, did anything like the boot file show up in lost+found? >Was the rest of /etc/default OK, or did it take out the whole directory? Whacked means that fsck unlinked it entirely; the file was gone. It did not show up in lost+found; it was "cleared". If it wasn't for my figuring out what happened (for the uninitiated, you never want to be in this position; you don't get to see the error message about the file being missing, the machine just doesn't come up with no apparent cause) and recreating it from a boot floppy, I would have had to reload the entire OS. As it was I lost a couple of hours figuring out why my machine wouldn't come up and fixing it (the majority of that time was the figuring out part). >Here's what I'm trying to get at: If the file was corrupted or gone, >something got written that shouldn't have been written. The first task is >to find out what got written. The sort of reasoning goes like this: > - If /etc/default (the directory containing boot) got corrupted, > I'd want to know what ended up there, because that directory > shouldn't be subject to change during "normal" system operation. > - If the inode for boot got corrupted, you'd expect a chunk of > inodes (one disk sector) to get it...and it's likely that other > files would be hit also. The boot parameter file is likely to > share its inode sector with other files that are "important" but > seldom modified. An access-time update could have been in > progress when the power failed. If it toasted a full sector, > you'd expect to see other important files damaged or gone. The boot file itself was good, as the system did give the "Booting" message and once the default file was put back, all was well. Also note that the physical format on the disk was just fine; normally if power is interrupted >during< a write and you get damage the sector will then have a "hard" error on it. This was not the case. >> Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic >> that disables the write gate when power goes out of safe margins). > >Sounds good so far. What's the box? If you've built it up from parts, >then what's the motherboard? As you can guess, I don't yet see cause to >say that either hardware or software is either guilty or innocent. Compaq'386, and we've seen the same kind of problem with an AT&T 6386 with the same (and different) disk/adapter combinations. A collegue of mine who sits across the hall has had many files killed or corrupted (like X11 files which are normally read-only, /etc/netd.cf, minor things like that) from power failures. We finally gave up and put UPSs on both machines; that has stopped the insanity. I've seen this same failure mode with MFM, RLL, ESDI and SCSI disks across lots of different platforms -- but only with ISC OSs. >Again, if something got corrupted, it means that something got written >that shouldn't have been written. The problem--and it's NOT likely to be >an easy one--is to find out what was written wrong. That's likely to give >a clue whether it's hardware or software (or a conspiracy of the two:-). Yep. I've seen this with every ISC release since the dawn of time, and have NEVER seen this kind of problem on identical hardware with SCO Xenix (don't know about SCO Unix, haven't run it for any length of time). Examples have been had from 1.0.6, 2.0, 2.0.1, 2.0.2, and now 2.2. There's something stinky in that "bitmapped filesystem monster" that is used in the ISC system. Yes, it does speed up the filesystem. It's also dangerous without power protection. It doesn't bite you all the time, but it does get you often enough to make a UPS a mandatory part of all ISC systems unless you like testing the integrity of your backup media under fire. >> Ok Mr. Dunn, the gauntlet has been thrown down. If you want details of the >> failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these >> hits) you're welcome to call me here. > >I don't follow the connection to SunOS4.1--correct me if I'm wrong, but I >didn't think there was a hardware platform common to ISC's Sys V.3.2 and >SunOS. (386i???) We have lots of Sun machines here at the same location; they take the same power plunge that the rest of the gear does. Never has one of these machines been burned when there is a problem with power. Files which are open for writing often do get lunched, yes, but that's a risk with ANY filesystem. Read-only parameter files, kernels, etc. have never been damaged on the Suns. They have, many times, on ISC. American Power Conversion loves us; we're using lots of their UPS systems to back up the Compaqs. >(My OS??? Let's clarify: I do use ISC systems, both at work and at home. >I'm taking an interest in this because I want to know how and why the >failures you've seen can happen--it's an important question. But I'm not >speaking for ISC on the net.) Ok... that was a misunderstanding on my part. I have, however, heard you champion the company's products more than once here. To make it short and sweet -- if you run ISC, make sure you have a UPS or risk the loss of your filesystems. -- Karl Denninger AC Nielsen kdenning@ksun.naitc.com (708) 317-3285 Disclaimer: Contents represent opinions of the author; I do not speak for AC Nielsen on Usenet.
rdc30med@nmrdc1.nmrdc.nnmc.navy.mil (LCDR Michael E. Dobson) (09/29/90)
In article <1990Sep27.132549.10168@nmrdc1.nmrdc.nnmc.navy.mil> I wrote: > [ ..... ] Because of this, in addition to an UPS, I >have installed a powerfailure monitor which automaticly begins the shutdown >sequence when it senses a powerfailure from the primary power source. The >UPS provides time for users to log off and for the shutdown sequnce to >complete. For a cost of ~$250 for the transducer/software, it's a very good >investment. E-mail if you want details on the product. Because of several requests, I am posting the following: The system I mentioned is called Showdown and is distributed by: Continental Information Systems Corporation PO Box 248 Itasca, IL 60143-0248 (312) 250-8111 It is available for a variety of platforms, I don't recall the complete list right off and can't find the product brochure. It consists of a transducer that plugs into the wall and a serial port on the computer and has software to monitor that port. When a power failure is sensed, it begins your system's normal shutdown procedure with user definable delays for warnings and the start of the final shutdown sequence. These should be tailored to your UPS to ensure things get closed down before the power really dies. Hope this helps, -- Mike Dobson, Sys Admin for | Internet: rdc30med@nmrdc1.nmrdc.nnmc.navy.mil nmrdc1.nmrdc.nnmc.navy.mil | UUCP: ...uunet!mimsy!nmrdc1!rdc30med AT&T 3B2/600G Sys V R 3.2.2 | BITNET: dobson@usuhsb.bitnet WIN/TCP for 3B2 | MCI-Mail: 377-2719 or 0003772719@mcimail.com