[net.unix-wizards] strange problems

hubert@entropy.UUCP (Steve Hubert) (04/22/86)

I wonder if anyone recognizes the following symptoms as symptoms of
something concrete I can try to fix.  We are running 4.3BSD on a
VAX11/785.  The disks are 3 RA81s on a single UDA.  The uda device
driver is version 6.12 from Berkeley (9/16/85) which seems to be equal
to or derived from a DEC driver from January 84.  I am not getting any
kernel error messages at all.  Here is symptom number 1:

% ls -l data
-rw-r--r--  1 pcraig     4480000 Apr 17 11:19 data

% cmp data data
data data differ: char 1777665, line 30650

% cmp data data
data data differ: char 1654785, line 28531

% cmp data data
data data differ: char 1683457, line 28955

--------------
Here is symptom number 2:

% cc -DBSD4 -DUWASH -DTLOG -c ckwart.c
"ckwart.c", line 391: warning: undeclared initializer name fp
"ckwart.c", line 391: warning: undeclared initializer name nam
"ckwart.c", line 391: warning: undeclared initializer name cont
"ckwart.c", line 391: warning: undeclared initializer name siz
"ckwart.c", line 392: syntax error
"ckwart.c", line 396: cannot recover from earlier errors: goodbye!

% cc -DBSD4 -DUWASH -DTLOG -c ckwart.c
  (ok this time, no changes to ckwart.c, it compiles correctly)

Something similar to this has happened several times.

-------------

All of our symptoms could be explained by bad reads.  That is, if we
don't always get the same data off the disk when we read it we would
get the symptoms we're getting.  However, we have never gotten any sort
of disk read error messages on the console or anywhere else.  Thanks.

Steve Hubert
 Dept. of Stat., U. of Wash, Seattle
 {decvax,ihnp4,ucbvax!lbl-csam}!uw-beaver!entropy!hubert
 hubert%entropy@uw-beaver.arpa

dyer@spdcc.UUCP (Steve Dyer) (04/25/86)

The problems you are having are well-known to some of us, at least,
who've suffered through it earlier.

There are problems with the 785 data path boards and timings which
cause such spurious and irreproducable errors.  An ECO should be
available to fix it.  Before you receive it, you can type some sort
of command at the console which runs the VAX CPU clock slower.  It
sort of makes your machine a 782.5.  I wish I remembered the details,
but this was all about 9 months ago.

I advise you to contact your DEC field service office ASAP.
-- 
Steve Dyer
dyer@harvard.HARVARD.EDU
{bbncca,bbnccv,harvard,ima,ihnp4}!spdcc!dyer

dave@onfcanim.UUCP (Dave Martindale) (04/25/86)

In article <279@entropy.UUCP> hubert@entropy.UUCP (Steve Hubert) writes:
>I wonder if anyone recognizes the following symptoms as symptoms of
>something concrete I can try to fix.  We are running 4.3BSD on a
>VAX11/785.  The disks are 3 RA81s on a single UDA.  The uda device
>driver is version 6.12 from Berkeley (9/16/85) which seems to be equal
>to or derived from a DEC driver from January 84.  I am not getting any
>kernel error messages at all.  Here is symptom number 1:
>
> [examples of cmp'ing a file with itself and getting non-repeatable errors,
> and C compiles which sometimes worked, sometimes not]

I had the same problem when installing our 780, and asked the disk
controller vendor to swap controller boards (Emulex SC780, driving
Eagles).  The problem remained.  The rest of the system passed DEC
diagnostics, so I didn't know where to look next.

Then we started occasionally getting soft ECC errors.  I like to keep
the memory system error-free, so I figured out which memory array board
the error was on and swapped it with another board, just to be sure.
The error remained in the same place!  So I swapped memory
controllers, and the problem did move.  (On the MS780-E memory system,
there are two controllers, on either side of the central bus interface
board).  So I pulled the bad controller entirely, the memory reverted
to non-interleaved operation on the remaining half memory, and the
mysterious data problems went away.  DEC has since replaced the bad
controller.

Moral of the story: a bad memory controller can mess up your data while
still passing DEC diagnostics and without giving any sort of error.
The memory ECC will catch bad RAM chips, and not much else.
There are also a number of places in the CPU unprotected by parity
checking where an intermittent hardware fault will damage data.

rick@nyit.UUCP (Rick Ace) (04/28/86)

> I wonder if anyone recognizes the following symptoms as symptoms of
> something concrete I can try to fix.  We are running 4.3BSD on a
> VAX11/785.  The disks are 3 RA81s on a single UDA.  The uda device
> driver is version 6.12 from Berkeley (9/16/85) which seems to be equal
> to or derived from a DEC driver from January 84.  I am not getting any
> kernel error messages at all.  Here is symptom number 1:
> 
> % ls -l data
> -rw-r--r--  1 pcraig     4480000 Apr 17 11:19 data
> 
> % cmp data data
> data data differ: char 1777665, line 30650
> 
> % cmp data data
> data data differ: char 1654785, line 28531
> 
> % cmp data data
> data data differ: char 1683457, line 28955

...

> All of our symptoms could be explained by bad reads.  That is, if we
> don't always get the same data off the disk when we read it we would
> get the symptoms we're getting.  However, we have never gotten any sort
> of disk read error messages on the console or anywhere else.  Thanks.
> 
> Steve Hubert
>  Dept. of Stat., U. of Wash, Seattle
>  {decvax,ihnp4,ucbvax!lbl-csam}!uw-beaver!entropy!hubert
>  hubert%entropy@uw-beaver.arpa

Sounds like flaky hardware.  Trouble is figuring out which piece of
gear is the culprit.  Here are some ideas:

1.  The UDA50 is sick.  See if Field Service will swap it for a spare
and try your experiments again.

2.  Another peripheral on the UNIBUS with the UDA50 is misbehaving and
corrupting the data transfer between the UDA50 and the UBA.  Try your
experiment after removing all UNIBUS devices except the UDA50 (remember
to install grant cards and NPG jumpers where necessary).  I've seen a
malfunctioning UNIBUS device make trouble for its neighbors before!

3.  The UNIBUS DD11 backplane has a problem.  This is a bit of a pain
to troubleshoot unless you have a spare backplane.  Or, if your
backplane is in two or more sections, shorten it to one section
and run the experiment, then try another one of the sections.

4.  The UBA or the UNIBUS cable is malfunctioning.  Again, ask Field
Service to swap as much gear as they can.

5.  Other unix wizards suggested possible problems in the KA785 CPU
and the memory controllers; these are also suspect.  Ask Field Service
to check the revision level of your CPU hardware, and to apply any
FCOs that you don't already have (i.e., get your money's worth for
your service contract).

If I read the DEC PDP11 Bus Handbook correctly, it appears that the
data lines on the UNIBUS are not parity-checked.  This would explain
why you're not seeing any diagnostic printf's from the kernel:
UNIBUS data can get mangled undetectably on its journey from the
UDA50 to the UBA.

The tried-and-true "swap it for a spare" approach is often the most
expedient route to solving problems like yours.  For what it's worth,
we're using an RA81/UDA50 on a Vax-11/780-5 (that's a CPU that was
born a 780 but received a 785 CPU transplant later in life) under
4.2bsd with the RIACS UDA50 driver, so such a hardware configuration
*can* work.  Our UDA50 sits alone on its own UBA because it won't play
nice with the boys on the other UBA, tho.

-----
Rick Ace
Computer Graphics Laboratory
New York Institute of Technology
Old Westbury, NY  11568
(516) 686-7644

{decvax,seismo}!philabs!nyit!rick