[net.arch] MTBF of Cray-1's

baskett@decwrl.UUCP (Forest Baskett) (05/03/84)

[History as vaguely remembered by a participant.]

Los Alamos had serial one of the Cray-1 in 1976 for a six month evaluation
period.  During that six months, the machine was exercised in a variety of
ways in order to test performance and reliability.  One of the interesting
things that the testing uncovered was that, at 7000 feet above sea level, the
mean time between memory parity errors when running memory tests over the
whole eight megabytes was about 30 minutes!  This number was discovered early
in the 6 month period and continued to be confirmed as the months wore on
despite the attempts of the on site engineers to find something marginal in
the memory system.  Thus this discovery seemed to lead to the decision of
Cray Research Inc. to change the machine from parity main memory to error
corrected main memory.  Serial two was already built but never delivered.
Serial three was about three inches taller and a memory reference took 11
cycles instead of the original 10.  When it was delivered to Los Alamos we
discovered that the MTBF of the memory system tests was now about 6 hours.
Thus it took about 6 hours to get a double bit, uncorrectable memory error.
Ordinary programs would run the machine for more than 24 hours between double
bit errors.  This wasn't wonderful but it was considered acceptable because
the machine was so fast.  The on site and off site engineers continued to
look for something marginal in the memory system and I wrote the first
program to take the log of single bit error correction syndromes and
translate them into board number, chip number pairs and then sort the
resulting list by frequency of correction.  What that revealed was that some
chips would occasionally go thru a period of apparently marginal operation
where they generated a lot of one bit errors and then they would apparently
return to normal.  The on site engineers starting taking this translated and
sorted log and using it as a guide for replacing "weak" memory chips during
the weekly preventive maintenance period.  The MTBF started going up.  By
1978, as I vaguely recall, it was on the order of one or two weeks.  We never
did figure out what was wrong with the "weak" memory chips but the key to the
success of this program did seem to be the assumption that the whole memory
chip was weak and that the location on the chip that failed was just noise
information (it was).  Serial one was sold by Cray Research to the British
government and I can easily believe it was made reliable by careful
attention to the memory system.  (I mentioned the altitude at Los Alamos
because of the cosmic ray theory :-)

Forest Baskett - Western Research Laboratory - Digital Equipment Corporation