[comp.arch] Speed is the one true performance metric

franka@mmintl.UUCP (Frank Adams) (11/11/86)

In article <107@pembina.alberta.UUCP> cdshaw@pembina.UUCP (Chris Shaw) writes:
>I suppose one could trade off reliability for speed, but most manufacturers
>realize that unreliable machines are extremely costly in service and annoyance
>time, and therefore manufacturers try to maximize the reliability.
>Unreliable machines are hard to sell.

A friend of mine who has used them tells me that when Crays go down, they
tend to be down for periods measured in weeks, not hours or days.  Now tell
me again that manufacturers don't trade off reliability and speed.

Frank Adams                           ihnp4!philabs!pwa-b!mmintl!franka
Multimate International    52 Oakland Ave North    E. Hartford, CT 06108

weemba@brahms (Matthew P Wiener) (11/15/86)

Summary:

Expires:

Sender:

Followup-To:

Distribution:

Keywords:

In article <1913@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>A friend of mine who has used them tells me that when Crays go down, they
>tend to be down for periods measured in weeks, not hours or days.

What utter nonsense.  A company that can afford a Cray can afford proper
prophylaxes, and keep it humming.  If management wants to be cut corners,
they deserve what they get.

ucbvax!brahms!weemba	Matthew P Wiener/UCB Math Dept/Berkeley CA 94720

rpw3@amdcad.UUCP (Rob Warnock) (11/16/86)

In article <3576@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes:
> This is silly. Broken computers don't give wrong answers. They crash,
> or they log soft errors, or they act flaky. It is almost impossible to
> imagine a hardware fault that would have no visible effect other than
> to make the 'value' (whatever it may be) of the output wrong.

Hard to imagine? Maybe, but I've run into it, more than once. (Not a LOT,
you understand, but when you've been around a long time...) Besides, don't
you consider "wrong answers" to be "acting flakey"?

Anyway, try these on for size (the first two occurred on a PDP-10 I was
involved in administering circa 1970-1972):

1. A marginal core memory power supply (but the same thing could happen
   with RAMs) which caused bad data to be read ONLY when certain data sets
   were being processed. (Only two programs could cause the failure, and
   then only with certain inputs. One program was a cross-assembler which
   failed only when assembling certain versions [!] of a particular program;
   the other was an NMR simulation program, again, with certain inputs.)
   In each case, the problem occurred only after the program had consumed
   at least 5 minutes of CPU time.

   The failure affected NO other programs (that we could tell), including
   the operating system, and memory diagnostics did NOT find the problem!
   (The diagnostic patterns were worst-case for the *memories*; the bug was
   a pattern worst-case for the *power supplies*...)

2. A FORTRAN program which "occasionally" got "slightly different" answers
   when run with the same input data. Seems there was a leaky a transistor
   driving a reset line to a register which held the number of the general
   register the floating-point result should go into. Certain programs would
   sometimes generate enough noise to cause this (already marginal) line to
   twitch, dumping the results into register 0 instead of the correct one.
   The program in question did a LOT of very involved matrix calculations
   (that NMR stuff, again), and the odds of the error making a big change
   in the answers was slim. (Caused a major panic when discovered... all
   the programs used in calculating published research results had to be
   re-run, to see whether retractions or corrections were needed.)

3. The infamous ARPAnet "black-hole", wherein an IMP had a memory failure
   whose only effect was to make the routing table entries return zero for
   a large number of hosts (it just happened that the bad memory was where
   the table lived). "Zero" meant "I'm directly connected", so when it told
   its neighbors this (during the normal exchange or routing info), they
   cheerfully sent all their packets to the confused IMP, who sent them back
   out... to IMPs who sent them back in...  [I hope I got the story right]

Yes, parity-protected memory would have prevented this one, but that's
not always the case. Memories can fail into the all-ones, condition, too,
and simple parity is not enough.

4. A memory card address-decoder that was shorted, causing two banks of memory
   to be read at the same time (each got *written* with the correct data).
   Due to the fact that they collided at a TTL bus, as long as one bank had
   not been addressed "recently" (within a few microseconds), the correct bank
   won the "bus fight" (since the "older" bank's internal logic levels drifted
   up to TTL "high", and since a TTL "low [usually] wins a fight with a "high").
   The normal memory diagnostics worked just fine, as did the simple address
   test. But when a certain user program was run, it made frequent references to
   both banks "quickly", causing bad data to be read. (Still, NOT necessarily
   causing parity errors! ...though they did occasionally occur.)

> Of course, floating point hardware is a little different, since it
> is used only for numerical calculations which are part of the problem
> ( as opposed to the CPU alu which is also used for indexing, etc.)
> You can always arrange to run an FPU diagnostic every 5 mins if this
> is an issue.

In case #2 above, it didn't help. The floating-point diagnostics didn't
find the problem. The fault wasn't, in fact, in the floating-point
hardware per se, but in the very same CPU ALU used for indexing, etc.
It was just that there were very very few operations other than F.P.
which used that auxiliary "where should the result go?" register, and
none (that we ever knew of) other than the program in question which
generated the right pattern of noise to clear it WHEN IT WAS BEING USED.

Incidentally, problems #1 & #2 (occurred about a year apart) were eventually
solved when yours truly finally ignored the diagnostics (which was the only
thing the DEC serviceman had been trained to use), and got out an oscilloscope
and started probing around looking for something "not quite right". Both
errant signals showed up quite clearly as being "not right" on a 'scope,
though the systems passed all the diagnostics.

MORAL: "Testing can show the presence of bugs, but not their absence."
	[E. W. Dijkstra]

CORRELARY: I bet the same thing happens soon (if it hasn't already) inside
	   somebody's fancy new CPU chips...   And this time they won't
	   be able to just poke around with a scope, looking for "something".
	   The only solution will be to tell the customer, "Well, don't
	   run that program!"   ;-}

Rob Warnock
Systems Architecture Consultant

UUCP:	{amdcad,fortune,sun}!redwood!rpw3
DDD:	(415)572-2607
USPS:	627 26th Ave, San Mateo, CA  94403

kludge@gitpyr.gatech.EDU (Scott Dorsey) (11/17/86)

In article <1913@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
>A friend of mine who has used them tells me that when Crays go down, they
>tend to be down for periods measured in weeks, not hours or days.  Now tell
>me again that manufacturers don't trade off reliability and speed.

   When some machines go down, the manufacturer essentially comes out and
replaces the CPU board-by-board.  When DEC machines go down, DEC comes out
and uses software diagnostics to find the bad board, then goes into the truck
and pulls out a replacement.  Cray equipment is too expensive to keep such a
parts supply; you can't keep every possible board in the van like some other
companies do.  
   That's why it takes so long.  State of the art does not come without some
disadvantages.

-- 
Scott Dorsey
ICS Programming Lab (Where old terminals go to die),  Rich 110,
    Georgia Institute of Technology, Box 36681, Atlanta, Georgia 30332
    ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!kludge

garry@batcomputer.tn.cornell.edu (Garry Wiegand) (11/17/86)

In article <3576@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes:
> This is silly. Broken computers don't give wrong answers. They crash,
> or they log soft errors, or they act flaky. It is almost impossible to
> imagine a hardware fault that would have no visible effect other than
> to make the 'value' (whatever it may be) of the output wrong.

Can't help donating my favorite horror story: there's a Vax upstairs that 
once upon a time a little trouble with some very involved computations. 
After telling the user in question, for a couple weeks, to fix his program, 
we decided to investigate. Diagnostics ran fine. After much elaborate head-
scratching, we happened to ask it to plot out a nice sine curve. The curve
looked fine too - except for a spike or two. Our jaws dropped. We 
checked that plotting program five ways from Sunday, and it was OK. We 
pulled the FP card and the drop-outs vanished.

The machine was only two years old, so I guess that puts an upper bound on 
how long it had been giving wrong answers. :-)

garry wiegand   (garry%cadif-oak@cu-arpa.cs.cornell.edu)

henry@utzoo.UUCP (Henry Spencer) (11/17/86)

> CORRELARY: I bet the same thing happens soon (if it hasn't already) inside
> 	   somebody's fancy new CPU chips...

Arguably it has.  For quite a while the 32032 had an obscure (although
documented) bug reminiscent of your examples.  If I recall correctly, the
problem was that if a lot of the high bits in the stack pointer were 1s,
and an increment caused a carry to go up all the way and turn a lot of
those bits to 0s all at once, the increment gave the wrong result.  My
guess would be that the pattern-sensitivity indicates an analog electrical
problem rather than a logic error.

(I can't be sure I've got the details right, since I threw out my 32032
bug lists when I finally concluded that I was never going to use the chip
and didn't care about it any more, but it was something along those lines.)
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

rcb@rti-sel.UUCP (Random) (11/18/86)

In article <1510@batcomputer.tn.cornell.edu> garry%cadif-oak@cu-arpa.cs.cornell.edu writes:
>In article <3576@utcsri.UUCP>, greg@utcsri.UUCP (Gregory Smith) writes:
>> This is silly. Broken computers don't give wrong answers. They crash,
>> or they log soft errors, or they act flaky. It is almost impossible to
>> imagine a hardware fault that would have no visible effect other than
>> to make the 'value' (whatever it may be) of the output wrong.
>
>Can't help donating my favorite horror story: there's a Vax upstairs that 
>once upon a time a little trouble with some very involved computations. 
>After telling the user in question, for a couple weeks, to fix his program, 
>we decided to investigate. Diagnostics ran fine. After much elaborate head-
>scratching, we happened to ask it to plot out a nice sine curve. The curve
>looked fine too - except for a spike or two. Our jaws dropped. We 
>checked that plotting program five ways from Sunday, and it was OK. We 
>pulled the FP card and the drop-outs vanished.
>

I guess that DEC floating point boards get real strange when they break.
We ran some benchmarks recently on some new hardware and our old 750
The benchmarks on the 750 took twice as long as they had 1 year ago.
We examined everything we could and eventually determined that the
FP card had gone bad. The computer still recognized it as being there, but
all floating point instructions generated faults and I guess the CPU
interpreted the faults as a missing FP board and activated the emulation 
routines. So, no crashes, no wrong answers, just half speed. Real wierd huh?
-- 
					Random (Randy Buckland)
					Research Triangle Institute
					...!mcnc!rti-sel!rcb

wb8foz@ncoast.UUCP (11/19/86)

> Article <297@cartan.Berkeley.EDU> From: weemba@brahms (Matthew P Wiener)

| In article <1913@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
......discussion about how long Crays are down.....
The only one I was able to observe was repaired quickly.
BUT
There were 4 or 5 people assigned to it by Cray.
Said support cost over $500k/year, and that figure is
several years old.


-- 

		      decvax!cwruecmp!ncoast!wb8foz
			ncoast!wb8foz@case.csnet 
		(ncoast!wb8foz%case.csnet@csnet-relay.ARPA)

    	         		"SERIOUS?
		Bones, it could upset the entire percentage!"

chris@usc-oberon.UUCP (Christopher Ho) (11/21/86)

Two anecdotes about VAX '780s...

There was a story I heard a while ago where an early '780 had not-quite-
up-to-spec components, and (similar to the ns32k mentioned earlier)
increments of small negative numbers (eg -1) would yield unpredictable
results because the carry didn't propagate fast enough...

At one point we had a user on our '780 whose Fortran program would
generate access violations at a particular point in his program.  What
was interesting was that the very same binary wouldn't do this when run
on another '780 or our '750s.  Turned out to be the FP board, although
that took our 3rd party maintenance weeks to find out (the old swap 1
board, wait, try again, trick).

hes@ecsvax.UUCP (Henry Schaffer) (11/25/86)

In article <1709@ncoast.UUCP>, wb8foz@ncoast.UUCP (David Lesher) writes:
> > Article <297@cartan.Berkeley.EDU> From: weemba@brahms (Wimpy Math Grad Student)
> 
> | In article <1913@mmintl.UUCP> franka@mmintl.UUCP (Frank Adams) writes:
> ......discussion about how long Crays are down.....
> ...
> BUT
> There were 4 or 5 people assigned to it by Cray.
> Said support cost over $500k/year, and that figure is
> several years old.
> 		      decvax!cwruecmp!ncoast!wb8foz

What's wrong with that figure?  My rule of thumb is that hardware
maintenance is 10% of the list price of the equipment per year.  (The
common range is 6% - 12%.)  That would put the system price at upwards
of $5 million  - which seems about right.

--henry schaffer  n c state univ