[comp.sys.dec] 8800 crashing way too often

stergios@portia.Stanford.EDU (Stergios) (10/30/90)

We have a 8800 crashing at the rate of once every two days.  This has
been happening since January.  This is a big problem since we have
over 5000 active accounts on the system, and a total of 7700 accounts.
As you can imagine, we have quite a few very angry clients.

Quite a number of dec people have and still are looking into the
problem.  Every board has been replaced, even a new bi bus installed.
dec software engineering is leaning towards a problem in the mscp
code.

Weve installed and ran 8 different dec supplied debuggers inside the
kernel. Each one never tells what the problem is, only what the
problem is not.  Progress, I suppose.

It originally took a couple months to escalate the problem to the
point where we got attention. Now we have attention to the point of
twice weekly meetings with dec sales staff regarding our 8800
crashing.  lots-o-fun, but we still have a poorly performing machine.

There is talk of replacing out kdb50's with HSE's in the hope that the
problem will disappear.  This seems reasonable, I guess, but sounds
like a desperation move at this point.

Now we are starting to talk replacement systems (this is another story
all together, probably worse, and I wont air that kind of laundry in
public) and dec is pushing a 5500 at us.  I dont think the 5500`s
q-bus is going to take the beating our 8800 does.  we are currently
running a 5400 as an optional machine to the 8800, and the poor little
thing is choking.  I refuse to install ada and a number of other
packages on it becuase of its performance so far under our
environment.  This does not make our clients any happier: a machine
not runinng the necessary software is not any better than a crashed
machine, and we have plenty of both.

Are there any other buses or solutions available on the 5500?  I'm
asking here cause I've already been told "there is this neat way to
hook up a ra92 as a swap disk avoiding the qbus that gives an extra
M/s" by the sales types.  An extra M/s over the qbus is not going to
cut it for us.

What good is a maintenance contract? Are we being too lenient with DEC
by letting them drag this out as far as they have?

Any and all suggestions welcome.

sm
stergios@jessica.stanford.edu

grr@cbmvax.commodore.com (George Robbins) (10/31/90)

In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU> stergios@jessica.stanford.edu writes:
> 
> We have a 8800 crashing at the rate of once every two days.  This has
> been happening since January.  This is a big problem since we have
> over 5000 active accounts on the system, and a total of 7700 accounts.
> As you can imagine, we have quite a few very angry clients.
...
> It originally took a couple months to escalate the problem to the
> point where we got attention. Now we have attention to the point of
> twice weekly meetings with dec sales staff regarding our 8800
> crashing.  lots-o-fun, but we still have a poorly performing machine.
...
> What good is a maintenance contract? Are we being too lenient with DEC
> by letting them drag this out as far as they have?

I think the most important step would be to escalate the problem within
DEC until you reach the level where the the players have enough discretion
to deal with your problem outside of normal channels.  You will probably
have to have someone up your hierarchy with a title or some claim to the
purse strings get involved.  Anyway, what you want is to force DEC to define
a plan to both resolve your problem and also ameliorate the pain while they
are doing so.

I'd be inclined to reject the 5400/5500 approach and insist they supply
a second similar machine, either another 8800 or 64X0 with suitable memory
and disks.  This should help pin down if it's a hardware or software
problem and then you can demand appropriate engineering or support talent.

Considerable dynamism on your part wil be required, but it's better to
take the offensive than to stay mired in the low-level runaround.  DEC can't
afford to allow large, vocal and visible embarrassments fester, but it's
up to you to elevate your problem to that category.

*** these are, of course, only personal notions and not legal advice ***

-- 
George Robbins - now working for,     uucp:   {uunet|pyramid|rutgers}!cbmvax!grr
but no way officially representing:   domain: grr@cbmvax.commodore.com
Commodore, Engineering Department     phone:  215-431-9349 (only by moonlite)

alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) (10/31/90)

In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU>, stergios@portia.Stanford.EDU (Stergios) writes:
> 
> [ Customer has a VAX 8800 crashing very frequently. ]

	I have a VAX 8800 that crashed 96 days ago.  That's the
	last time it was down.  The time before that was 80 days.
	The I/O configuration is three VAXBIs with two KDB50s
	(2 RA90s each) and CIBCA with HSC70 and a bunch of disks.
	There's a DEBNI and DMB32 in there somewhere.  This kind
	of uptime seems to be typical for my system.  Use it for
	comparison purposes.
> 
> Quite a number of dec people have and still are looking into the
> problem.  Every board has been replaced, even a new bi bus installed.
> dec software engineering is leaning towards a problem in the mscp
> code.

	Is it the same error each time, a different one?  Which
	one Panic, machine check or "it stops".  What version of
	ULTRIX are you running?  If V4.0 has you installed and
	booted the mandatory upgrade?  Any non-DEC devices on
	the VAXBI or KDB50s?  Is there a UNIBUS on the system?
	Does it have anything important on it?  Could it be
	replaced by a native VAXBI device?
> 
> Weve installed and ran 8 different dec supplied debuggers inside the
> kernel. Each one never tells what the problem is, only what the
> problem is not.  Progress, I suppose.
> 
> It originally took a couple months to escalate the problem to the
> point where we got attention. Now we have attention to the point of
> twice weekly meetings with dec sales staff regarding our 8800
> crashing.  lots-o-fun, but we still have a poorly performing machine.

	A couple of months feels to long for me, but it depends on
	the situation.
> 
> There is talk of replacing out kdb50's with HSE's in the hope that the
> problem will disappear.  This seems reasonable, I guess, but sounds
> like a desperation move at this point.

	Find the problem first.  There is one out there somewhere
	and it is findable.
> 
> Now we are starting to talk replacement systems (this is another story
> all together, probably worse, and I wont air that kind of laundry in
> public) and dec is pushing a 5500 at us.  I dont think the 5500`s
> q-bus is going to take the beating our 8800 does.  we are currently
> running a 5400 as an optional machine to the 8800, and the poor little
> thing is choking.  I refuse to install ada and a number of other
> packages on it becuase of its performance so far under our
> environment.  This does not make our clients any happier: a machine
> not runinng the necessary software is not any better than a crashed
> machine, and we have plenty of both.

	Actaully most of the interesting I/O on a DECsystem 5500 will
	stay off the Q-bus unless you insist upon using KDA50s for
	most of the disks.  A couple of gigabyte SCSI disks and DSSI
	disks should be very impressive.  A VAX 8800 is good for
	moving bits between disk and memory, but a well configured
	DECsystem 5500 should be able to do better.  You'll need more
	memory to make up for the VAX to RISC switch.
> 
> Are there any other buses or solutions available on the 5500?  I'm
> asking here cause I've already been told "there is this neat way to
> hook up a ra92 as a swap disk avoiding the qbus that gives an extra
> M/s" by the sales types.  An extra M/s over the qbus is not going to
> cut it for us.

	There are three places to connect disks to a DECsystem 5500;
	one or more KDA50s on the Q-bus, the DSSI adapter and the
	SCSI adapter.  The only place >>>I<<< know of to connect an
	RA{anything} is the KDA50.  Find out what your sales critter
	is talking about.  If you go to a DECsystem 5500 you'll almost
	certainly want to switch from the RA{anything} to RFs or RZs
	or least put move some of the I/O load off the RAs.
> 
> What good is a maintenance contract? Are we being too lenient with DEC
> by letting them drag this out as far as they have?

	I'll put it this way.  You've been very patient.  I wouldn't
	have been that patient.

	Of course it also depends on what level of support you have.
	An 8 hour a day, 5 days a week Basic support contract is a
	very different beast from 24x7 DECsupport.  Each contract
	has time limits for how long things are allowed to "drag out".
	I don't think any of them are months though.
> 
> Any and all suggestions welcome.
>

	Tell us more about the errors in the hopes that we might
	recognize the problem from previous experience.
> sm
> stergios@jessica.stanford.edu


-- 
Alan Rollow				alan@nabeth.enet.dec.com

pavlov@canisius.UUCP (Greg Pavlov) (11/01/90)

In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU>, stergios@portia.Stanford.EDU (Stergios) writes:
> 
> We have a 8800 crashing at the rate of once every two days.  ....
> 
> Now we are starting to talk replacement systems (this is another story
> all together, probably worse, and I wont air that kind of laundry in
> public) and dec is pushing a 5500 at us.  I dont think the 5500`s
> q-bus is going to take the beating our 8800 does.  we are currently
> running a 5400 as an optional machine to the 8800, and the poor little
> thing is choking.  ....

  I would not base a decision about a 5500 based on experiences with a 5400.
  I would push DEC to provide you with non-disclosure info on the 5500 if, in
  fact, it was not already announced (this week, the rumors have been saying..)

  On the other hand, I don't think that I would try accomodating your user load
  on one of these systems either  (then again, I don't think that I would 
  try to do so on an 8800...).

   greg pavlov, fstrf, amherst, ny

skl@odin.math.uiuc.edu (Soren Lundsgaard) (11/02/90)

this reminds me of a story that I hear where I used to work.  said
company had a vax 11/780, pretty normal setup, and was running 4.1
bsd.  This machine started crashing, often.  DEC comes in, replaces
something, still crashes.  this happens for a long time, and of
course, DEC blames 4.1 for the problem, must be doing something wrong.
please excuse me but the details are sketchy.  they endup replacing
everything, practically, except the backplane, or something like that.
try running VMS over the weekends with all the exersizers going (maybe
an exageration, but they did try VMS out.)
Now as everyone knows, aps and company worked quite closely with
berkeley porting unix to the vaxes, probably one of the reasons why it
was so succesful.  turns out, part of the code in the system relied on
a certain specification in the hardware, to a closer tolerance than
VMS.  finally fixed it, bsd 4.1 exonerated.
please write if you would like the address of people who can tell you
more exact details about this.
and get this stuff fixed.  there is no excuse for letting DEC drag
their feet on this stuff.  ask for a replacement 8800.

skl.

pavlov@canisius.UUCP (Greg Pavlov) (11/02/90)

In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU>, stergios@portia.Stanford.EDU (Stergios) writes:
> 
> We have a 8800 crashing at the rate of once every two days.  This has
> been happening since January.  This is a big problem ...

  Now I am CERTAIN that you either have  a) a UPS, or b), you have had your
  incoming power supply monitored for at least a week recently...

  We had a similar experience several years ago with a venerable and well-
  loved DEC 2060.  It did not occur as frequently or continue as long as
  yours'.  The problem ?  A semi-loose apx. 20-gauge, apx. 2" wire segment...

   greg pavlov, fstrf, amherst, ny.

dennis@aus.stanford.edu (Dennis Michael) (11/03/90)

In article <1908@shodha.enet.dec.com> alan@shodha.enet.dec.com ( Alan's Home for Wayward Notes File.) writes:
>In article <STERGIOS.90Oct29193129@kt22.Stanford.EDU>, stergios@portia.Stanford.EDU (Stergios) writes:
>> 
>> [ Customer has a VAX 8800 crashing very frequently. ]
>
>	I have a VAX 8800 that crashed 96 days ago.  That's the
>	last time it was down.  The time before that was 80 days.
>	The I/O configuration is three VAXBIs with two KDB50s
>	(2 RA90s each) and CIBCA with HSC70 and a bunch of disks.
>	There's a DEBNI and DMB32 in there somewhere.  This kind
>	of uptime seems to be typical for my system.  Use it for
>	comparison purposes.
>> 
>> Quite a number of dec people have and still are looking into the
>> problem.  Every board has been replaced, even a new bi bus installed.
>> dec software engineering is leaning towards a problem in the mscp
>> code.
>
>	Is it the same error each time, a different one?  Which
>	one Panic, machine check or "it stops".  What version of
>	ULTRIX are you running?  If V4.0 has you installed and
>	booted the mandatory upgrade?  Any non-DEC devices on
>	the VAXBI or KDB50s?  Is there a UNIBUS on the system?
>	Does it have anything important on it?  Could it be
>	replaced by a native VAXBI device?

It is the same error every time - a trap type 8, segementation fault.
The footprint in the crash dump is the same every time.  We are running
ULTRIX 3.1 with every fix we could find.  There are no non-DEC devices
on the machine, no UNIBUS.  There are also no modifications to the
ULTRIX kernel.  Everything is 'vanilla' DEC.

The problem occurs in the connection block between the MSCP code and
the drivers.  I quote from a problem statement we received from
ULTRIX Engineering: "The panics occuring at Stanford appear to be caused
by a flink (forward link) of a request packet getting corrupted while
the request packet sits on the active queue (queue of active requests)
of the connection block for the underlying device.  In each case,
the low byte of the flink is overwritten with the value '04' (this
is always the case where the corrupted flink was from a request packet
that was the only or last request packet queued in the active queue
and therefore had a flink pointing back to the active queue of the
connection block)."

There are currently 4 possible explanations, and investigation is
continuing (and continuing, and continuing...).

Anyone seen this before?

>> 
>> [ mentioning replacement systems - particularly a 5500]
>>
>
>	Actaully most of the interesting I/O on a DECsystem 5500 will
>	stay off the Q-bus unless you insist upon using KDA50s for
>	most of the disks.  A couple of gigabyte SCSI disks and DSSI
>	disks should be very impressive.  A VAX 8800 is good for
>	moving bits between disk and memory, but a well configured
>	DECsystem 5500 should be able to do better.  You'll need more
>	memory to make up for the VAX to RISC switch.

We are looking at a 5500 with SCSI disks and possibly a DSSI swap disk.
We will definitely avoid using KDA50s on the Q-bus.  Our memory configuration
will be 128MB.

>> sm
>> stergios@jessica.stanford.edu
>
>
>-- 
>Alan Rollow				alan@nabeth.enet.dec.com
>

Dennis Michael
dennis@jessica.stanford.edu

frank@croton.enet.dec.com (Frank Wortner) (11/07/90)

>   I would push DEC to provide you with non-disclosure info on the 5500 if, in
>   fact, it was not already announced (this week, the rumors have been
saying..)

It was announced on October 31, and the machine is the current Systems and
Options Catalog.  We can talk about it now.  ;-)

					Frank