[comp.arch] Reliabilty of Crays

eugene@nike.uucp (Eugene Miya N.) (11/15/86)

> A friend of mine who has used them tells me that when Crays go down, they
> tend to be down for periods measured in weeks, not hours or days.  Now tell
> me again that manufacturers don't trade off reliability and speed.
> Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka

Frank (I still respect you)--
I don't think many people would be buying Crays (now measured in
hundreds) if this were so.  Typical Mean Time Between Failure (Hardware)
for most sites is 3 months.  Time to repair is tends to be measured from
minutes to at most days.  You pay for this with some good maintenance
fees (however).  Your friend must have worked on some early machine.
Also, the newer multiprocessor Crays degrade "gracefully."  Our X-MP/4
had a CPU down for a day this week, but the other three ran fine.
Fast, expensive machines have to be more reliable.  This was learned
with the ILLIAC IV (which had MTBFs starting in 1972 from 10 minutes
and ending in 1981 with 1 week).

An aside: the first Cray delivered to Japan died about 3 months after
delivery.  It was packed up and sent home by the Japanese with a note
that this was "unacceptable."  Hopefully, some day we wil have even more
reliable machines as the parts count continue to drop and densities go
higher and faster.

From the Rock of Ages Home for Retired Hackers:

--eugene miya
  NASA Ames Research Center
  eugene@ames-aurora.ARPA
  "You trust the `reply' command with all those different mailers out there?"
  {hplabs,hao,nike,ihnp4,decwrl,allegra,tektronix,menlo70}!ames!aurora!eugene