[net.unix-wizards] "bitrot" on magnetic media: is there such a thing?

charles@c3pe.UUCP (07/31/86)

In article <826@PUCC.BITNET> D0430@PUCC.BITNET writes:
>In article <147@itcatl.UUCP>, robin@itcatl.UUCP (Robin Cutshaw) writes:
>>I have been getting numerous soft errors on the G partition
>>and a few hard errors there.
> 
>We had the same problem with an rd53, lots of soft errors, turning into
>hard errors, running rabads every day to replace it, etc.  We finally
>just reformatted the disk and all the problems vanished.

That raises an interesting question:  I've noticed a similar phenomenon with
certain 5.25" Winchesters.  In our case (Xenix), reformatting destroys special
"bad sector" flags in the address fields of bad blocks.  These are used to
construct a map later used by the kernel to make all partitions appear "clean";
if one reformats the disk without noting previously-bad sectors, they may come
back to bite later.  Thus, I'm reluctant to tell someone to reformat their disk
unless they're getting "Address Not Found" errors.

But I'm beginning to wonder:  after the address marks are written on a disk
during formatting, as the years go by, do they gradually "entrophy"
(atrophy via entropy!), or melt into the noise?
-- 
_____-__---__-_----_-__-_-_----__-_--_--___-_---__-__-_-_-_--__-_-__--___-_____
-Charles Green at C3 Inc.	{{styx!seismo,cvl}!decuac,dolqci}!c3pe!charles
You hear the howling of the Winchester. The voltage spike hits! You crash.-More

grr@cbmvax.cbm.UUCP (George Robbins) (08/08/86)

In article <217@c3pe.UUCP> charles@c3pe.UUCP (Charles Green) writes:
>
>But I'm beginning to wonder:  after the address marks are written on a disk
>during formatting, as the years go by, do they gradually "entrophy"
>(atrophy via entropy!), or melt into the noise?

Yes, there is such a thing as 'bitrot', but I'm sure you could get quite a
bit of argument going on the subject...

The following notions should be considered:

Magnetic storage devices work when a magnetic field from a read/write head
magnitizes the vast majority of magnetic particles in an area on the media
in the same orientation.  Not every particle particle cooperates, and some
are likely to change, as a function of time and temperature.  Also, the head
assembly may retain a slight magnetic field, and/or be subject to leakage
currents from the circuitry.  This can also encourage particles to change
orientation.  As the number of appropriately oriented particles diminish,
the signal to noise ratio seen by the read circuitry will decrease and you
may eventually see problems.

Note that there are other problems that can cause similar symptoms, such
as media wear, gradual change of drive speed, thermal effects and shifts
in positioner repeatability.  Drives that do not move the heads to a parking
postion may also suffer occasional glitches on power down.

The problem is not limited to 5.25" drives, I've had problems with 100MB
disk pack drives, where 'read-only' system packs had to be reformatted
maybe twice a year when random errors started to occur.  (note that ECC
and/or track substitution was not involved here).

Thoughtful comment appreciated...
--
George Robbins - now working with,      uucp: {ihnp4|seismo|caip}!cbmvax!grr
but no way officially representing      arpa: cbmvax!grr@seismo.css.GOV
Commodore, Engineering Department       fone: 215-431-9255 (only by moonlite)

cprice@vianet.UUCP (Charlie Price) (08/08/86)

> -Charles Green at C3 Inc.	{{styx!seismo,cvl}!decuac,dolqci}!c3pe!charles
> But I'm beginning to wonder:  after the address marks are written on a disk
> during formatting, as the years go by, do they gradually "entrophy"
> (atrophy via entropy!), or melt into the noise?

The answer is -- YES.
Data (and "format info" is just data) *can* degrade on a magnetic disk;
though not just from some random evaporation into the air.

I used to work for Storage Technology Corp (in the very recent past)
and I'm familiar with at least one mechanism for gradually degrading recorded
data on the disk.  Of course, Storage Tek makes fairly big disks,
(2.5 Gbyte Head Disk Assembly using 14" disks) but the physics is the same.

In a winchester technology disk you have read/write heads flying
REALLY CLOSE to a disk.  What happens if there are any little particles
of gruk in the drive?  If it is the right kind of gruk and the right
sized particles the particle can either provide a material to rub
"under" the head or it can just bang the head around and cause it
to "bounce" and momentarily touch down on the surface.
If this is really bad, you have a crash in the making.
If it isn't too bad, the contact (in the disk business this is
called head-disk-interface) maybe knocks some more particles loose from
the disk surface and generatates a whole bunch of short-lived localized heat.
If you heat up a magnetized material above some particular temperature
for the material, called the curie point, the magnetic domains can move.
Since the media isn't in a strong field here, it will probably demagnetize.
If this happens repeatedly in the same area, the stored data can actually
degrade to the point it can't be read.
Though they believe it had always been going on, Storage Tech only noticed
this behavior with the most recent generation of drive technology
(very low-mass thin-film heads flying REALLY close to the media surface).
[A cleaner clean room eliminated the problem].

A typical cheap winchester is using technology that isn't as prone
to this sort of problem (head flight fairly far away from the disk).
If it weren't build-it-and-ship-it technology the drives would
be too expensive.

Gradually degrading behavior on a disk drive can indicate that it
is gradually getting dirtier (start-stop can kick loose particles).
Reformatting and/or rewriting all data CAN help but clearly doesn't
make the problem go away.

-- 
Charlie Price    {hao stcvax nbires}!vianet!cprice    (303) 440-0700
ViaNetix, Inc. / 2900 Center Green Ct. South / Boulder, CO   80301

jc@sdcsvax.UUCP (John Cornelius) (08/14/86)

Bits normally do not rot on magnetic media, at least not in the lifetime of a
winchester disk drive.  Bits have been known to spread and/or migrate on reels
of magnetic tape that have been stored for long periods of time (7-10 years)
but one would not expect this behaviour on a disk drive.

The most probable causes of 'bit rot' on a disk drive are:
1)	Worn or defective erase heads, might be caused by head crashes or
	chronic power off on the disk.  When you power down a winchester the
	heads crash and are subject to wear.

2)	Worn or defective write heads, they can spread the bits out resulting
	in lower signal/noise ratio.

3)	Defective media.  The retentivity of the media may not be good enough.
	The tendency of magnetic media is toward a uniform polarization.  The
	media is designed to have a half life in the tens of years and in some
	cases hundreds of years.  The drive itself should disintegrate before
	the half life is reached.  On the other hand, nobody's perfect and on
	occasion less than ideal media winds up in disk drives.

4)	Impurities in the HDA environment can make the heads less sensitive and
	less precise.  This is the infamous problem with the RA-81 where glue
	vaporized inside the HDA and began coating the heads and disks.  The
	usually untimely result is a catastrophic head crash but read errors
	often precede the crash so you have some warning.

5)	Defective head selection matrix resulting in small write currents on
	unselected heads during writing.  This will often be followed by a
	catastrophic failure and hard 'write-fault' or 'head select' errors.

The first two of these causes can be avoided by never turning your winchester
off.  The last three are harder to avoid but judicious selection of disk
vendors can be a help.  Lowest purchase price does not usually have anything to
do with lowest cost to own.

The wisdom of leaving your winchester running, even if the system it is
connected to is not running, cannot be too heavily stressed.  Winchesters are
designed for a continuous operating environment, not a sporadic one.  There is
a school of thought that being nice to your disk drive involves turning it off
when it is not in use.  I recognize that this thinking has some intuitive basis
but it is, alas, quite incorrect.

John Cornelius
aka jc@sdcsvax

geoff@desint.UUCP (Geoff Kuenning) (08/16/86)

In article <1978@sdcsvax.UUCP> jc@sdcsvax.UUCP (John Cornelius) writes:

> The wisdom of leaving your winchester running, even if the system it is
> connected to is not running, cannot be too heavily stressed.  Winchesters are
> designed for a continuous operating environment, not a sporadic one.  There is
> a school of thought that being nice to your disk drive involves turning it off
> when it is not in use.  I recognize that this thinking has some intuitive
> basis but it is, alas, quite incorrect.

I wonder if John could give us some references to support this contention.
In particular, one of the failure modes I have seen in Winchesters is
bearing failure.  Bearing wear is directly related to on-time, not to
the number of startup/shutdown cycles.

Let's remember that a lot of Winchesters are spec'ed with MTBF's of
10,000 hours or less.  There are 8760 hours in a year, so if you leave
your Winchesters on 24 hours a day, you can expect the average one to
fail after about 14 months.
-- 

	Geoff Kuenning
	{hplabs,ihnp4}!trwrb!desint!geoff

bass@dmsd.UUCP (John Bass) (08/18/86)

In article <247@desint.UUCP>, geoff@desint.UUCP (Geoff Kuenning) writes:
>In article <1978@sdcsvax.UUCP> jc@sdcsvax.UUCP (John Cornelius) writes:
>
>>The wisdom of leaving your winchester running, even if the system it is
>>connected to is not running, cannot be too heavily stressed.  Winchesters are
>>designed for a continuous operating environment, not a sporadic one.  There is
>>a school of thought that being nice to your disk drive involves turning it off
>>when it is not in use.  I recognize that this thinking has some intuitive
>>basis but it is, alas, quite incorrect.
> 
> I wonder if John could give us some references to support this contention.
> In particular, one of the failure modes I have seen in Winchesters is
> bearing failure.  Bearing wear is directly related to on-time, not to
> the number of startup/shutdown cycles.

Sorry, but bearing wear is also a function of the number of cold starts,
running temp, and thermal cycling. On-time is just one componet in the
life factor. Furthermore the media/head life is also a function of
start/stops, as is the life of the spindle motor control circuit in most
smaller drives (startup current rush).

> 
> Let's remember that a lot of Winchesters are spec'ed with MTBF's of
> 10,000 hours or less.  There are 8760 hours in a year, so if you leave
> your Winchesters on 24 hours a day, you can expect the average one to
> fail after about 14 months.
> -- 

Most vendors don't spec the number of cold start cycles, the number of
host start cycles, or the effects of thermal cycling on life. I don't think
very many drives will run over 1,000 hours of a power/thermal cycling
combination.

I think that a survey of 10 drives under continuous service compared to
10 drives under cycling of 1 hour on/off will result in a VERY skewed
comparison favoring continuous duty when plotted again operating time.

This cycling rate is not out of line, given most desk top micro usage
is for a very short interval.
-- 

John Bass (DBA:DMS Design)
DMS Design (System Design, Performance and Arch Consultants)
{dual,fortune,polyslo,hpda}!dmsd!bass     (805) 541-1575