[comp.unix.questions] Reliability of

rcd@ico.isc.com (Dick Dunn) (09/25/90)

art@pilikia.pegasus.com (Art Neilson) writes:
> ...ir@cel.co.uk (ian reid) writes:
> > [ stuff about file system getting hosed when power is cycled
> >   without performing graceful shutdown ... ]

> Every UNIX I have seen behaves in the manner you describe.  If you
> hit the red switch or experience a power outage without performing a
> graceful shutdown, you deserve whatever you get...

Years ago, that was generally true...and it was one of the major objections
to using UNIX in "commercial" systems.  As a result, essentially all
variants of UNIX have had file system changes to "harden" them against
problems caused by power failure.  Damage from a power outage should be
limited to files being written at the time the power went away, and should
be localized (e.g., a frozzed/missing block of data, not an entire file
gone or destroyed).  Going back to the original question:  If you're
seeing major file system damage due to power failures, there's something
wrong that should be fixed.  I'm not just spouting applehood/motherpie; I
haven't seen a file system damaged by power failure in years.  I've even
tried to damage file systems by getting things as busy as I could, then
turning off machines.  (Of course, the T-storm just now gathering over the
hills will probably destroy all my files and prove me to be drastically
wrong.:-)

The software in hardened file systems is pretty good at ensuring that
things get written when they should, as they should, so that fsck can pick
up the pieces.

This leaves some questions about hardware which were brought up in a couple
other postings on this topic.  There are old but unfortunately-not-apocry-
phal stories about disk controllers which would start writing zeros as
power dropped.  That was a hardware bug; if it happened to you nowadays
you'd need to get your disk controller fixed or replaced.  Taking the 386
PCish world in particular, there is no excuse for a controller writing
because of a power failure.

(Detail:  One pin out of a PC power supply is POWER GOOD.  On a low-
voltage condition, the power supply is expected to drop POWER GOOD; the
motherboard logic must use this to drive RESET on the bus.  Bus cards
must honor RESET as an indication of either system start-up or power
failure.  If this doesn't work, you've got a hardware problem.)

> ...If your UNIX box is used for real
> production work, you are quite foolish not to put it on an UPS...

Neilson signs himself from "Bank of Hawaii"--and I'm glad that someone
associated with banking is taking a conservative attitude on system
failure!  I hate to argue against cautiousness, but not all applications
are critical enough to make an UPS worthwhile.  (The cost of an UPS might
be 10-25% of the cost of the rest of the hardware.  They're getting more
affordable, but they're not cheap.)

If you need constant availability of systems, an UPS is essential.  If data
integrity is paramount, an UPS helps but there are other things you need to
do as well.  My point is that file systems and hardware are expected to be
robust enough that you should *not* tolerate power failures corrupting
file systems.
-- 
Dick Dunn     rcd@ico.isc.com -or- ico!rcd       Boulder, CO   (303)449-2870
   ...Worst-case analysis must never begin with "No one would ever want..."

karl@naitc.naitc.com (Karl Denninger) (09/26/90)

In article <1990Sep24.231148.18053@ico.isc.com> rcd@ico.isc.com (Dick Dunn) writes:
>> Every UNIX I have seen behaves in the manner you describe.  If you
>> hit the red switch or experience a power outage without performing a
>> graceful shutdown, you deserve whatever you get...
>
>Years ago, that was generally true...and it was one of the major objections
>to using UNIX in "commercial" systems.  As a result, essentially all
>variants of UNIX have had file system changes to "harden" them against
>problems caused by power failure.  Damage from a power outage should be
>limited to files being written at the time the power went away, and should
>be localized (e.g., a frozzed/missing block of data, not an entire file
>gone or destroyed).  Going back to the original question:  If you're
>seeing major file system damage due to power failures, there's something
>wrong that should be fixed.  I'm not just spouting applehood/motherpie; I
>haven't seen a file system damaged by power failure in years.  I've even
>tried to damage file systems by getting things as busy as I could, then
>turning off machines.  (Of course, the T-storm just now gathering over the
>hills will probably destroy all my files and prove me to be drastically
>wrong.:-)

Ok, I've seen filesystem damage of this type, on your Operating System
(2.0.2), and another employee here has seen the same thing on his copy of
ISC 2.2.

To put it bluntly, there's something wrong that should be fixed.

>The software in hardened file systems is pretty good at ensuring that
>things get written when they should, as they should, so that fsck can pick
>up the pieces.

OK, so why did my /etc/default/boot file get whacked a few months back when
we had a power failure?

(For the unknowing, lacking an /etc/default/boot file, which is READ ONLY,
you can't boot the machine!)

>(Detail:  One pin out of a PC power supply is POWER GOOD.  On a low-
>voltage condition, the power supply is expected to drop POWER GOOD; the
>motherboard logic must use this to drive RESET on the bus.  Bus cards
>must honor RESET as an indication of either system start-up or power
>failure.  If this doesn't work, you've got a hardware problem.)

Host adapter was a Adaptec 1542B, disk a Maxtor (which has power-safe logic
that disables the write gate when power goes out of safe margins).

>If you need constant availability of systems, an UPS is essential.  If data
>integrity is paramount, an UPS helps but there are other things you need to
>do as well.  My point is that file systems and hardware are expected to be
>robust enough that you should *not* tolerate power failures corrupting
>file systems.

Ok Mr. Dunn, the gauntlet has been thrown down.  If you want details of the
failures we have had with YOUR OS (btw, SunOS4.1 doesn't seem to take these
hits) you're welcome to call me here.  

I await your response.


--
Karl Denninger	AC Nielsen
kdenning@ksun.naitc.com
(708) 317-3285
Disclaimer:  Contents represent opinions of the author; I do not speak for
	     AC Nielsen on Usenet.

rdc30med@nmrdc1.nmrdc.nnmc.navy.mil (LCDR Michael E. Dobson) (09/27/90)

My system, an AT&T 3B2/600G running AT&T Sys V R 3.2.2 supposedly has a hardened
file system, however, I have had to restore from a boot floppy and tape on
occaision after a power failure.  Because of this, in addition to an UPS, I
have installed a powerfailure monitor which automaticly begins the shutdown
sequence when it senses a powerfailure from the primary power source.  The
UPS provides sufficient time for users to log off and for the sutdown sequnce to
complete.  For a cost of ~$250 for the transducer/software, it's a very good
investment.  E-mail if you want details on the product.
-- 
Mike Dobson, Sys Admin for      | Internet: rdc30med@nmrdc1.nmrdc.nnmc.navy.mil
nmrdc1.nmrdc.nnmc.navy.mil      | UUCP:   ...uunet!mimsy!nmrdc1!rdc30med
AT&T 3B2/600G Sys V R 3.2.2     | BITNET:   dobson@usuhsb.bitnet
WIN/TCP for 3B2                 | MCI-Mail: 377-2719 or 0003772719@mcimail.com

rdc30med@nmrdc1.nmrdc.nnmc.navy.mil (LCDR Michael E. Dobson) (09/29/90)

In article <1990Sep27.132549.10168@nmrdc1.nmrdc.nnmc.navy.mil> I wrote: 
> [ ..... ]                       Because of this, in addition to an UPS, I
>have installed a powerfailure monitor which automaticly begins the shutdown
>sequence when it senses a powerfailure from the primary power source.  The
>UPS provides time for users to log off and for the shutdown sequnce to
>complete.  For a cost of ~$250 for the transducer/software, it's a very good
>investment.  E-mail if you want details on the product.

Because of several requests, I am posting the following:

The system I mentioned is called Showdown and is distributed by:

	Continental Information Systems Corporation
	PO Box 248
	Itasca, IL 60143-0248
	(312) 250-8111

It is available for a variety of platforms, I don't recall the complete list
right off and can't find the product brochure.  It consists of a transducer
that plugs into the wall and a serial port on the computer and has software
to monitor that port.  When a power failure is sensed, it begins your system's
normal shutdown procedure with user definable delays for warnings and the
start of the final shutdown sequence.  These should be tailored to your UPS
to ensure things get closed down before the power really dies.

Hope this helps,

-- 
Mike Dobson, Sys Admin for      | Internet: rdc30med@nmrdc1.nmrdc.nnmc.navy.mil
nmrdc1.nmrdc.nnmc.navy.mil      | UUCP:   ...uunet!mimsy!nmrdc1!rdc30med
AT&T 3B2/600G Sys V R 3.2.2     | BITNET:   dobson@usuhsb.bitnet
WIN/TCP for 3B2                 | MCI-Mail: 377-2719 or 0003772719@mcimail.com