hubert@entropy.UUCP (Steve Hubert) (04/22/86)
I wonder if anyone recognizes the following symptoms as symptoms of something concrete I can try to fix. We are running 4.3BSD on a VAX11/785. The disks are 3 RA81s on a single UDA. The uda device driver is version 6.12 from Berkeley (9/16/85) which seems to be equal to or derived from a DEC driver from January 84. I am not getting any kernel error messages at all. Here is symptom number 1: % ls -l data -rw-r--r-- 1 pcraig 4480000 Apr 17 11:19 data % cmp data data data data differ: char 1777665, line 30650 % cmp data data data data differ: char 1654785, line 28531 % cmp data data data data differ: char 1683457, line 28955 -------------- Here is symptom number 2: % cc -DBSD4 -DUWASH -DTLOG -c ckwart.c "ckwart.c", line 391: warning: undeclared initializer name fp "ckwart.c", line 391: warning: undeclared initializer name nam "ckwart.c", line 391: warning: undeclared initializer name cont "ckwart.c", line 391: warning: undeclared initializer name siz "ckwart.c", line 392: syntax error "ckwart.c", line 396: cannot recover from earlier errors: goodbye! % cc -DBSD4 -DUWASH -DTLOG -c ckwart.c (ok this time, no changes to ckwart.c, it compiles correctly) Something similar to this has happened several times. ------------- All of our symptoms could be explained by bad reads. That is, if we don't always get the same data off the disk when we read it we would get the symptoms we're getting. However, we have never gotten any sort of disk read error messages on the console or anywhere else. Thanks. Steve Hubert Dept. of Stat., U. of Wash, Seattle {decvax,ihnp4,ucbvax!lbl-csam}!uw-beaver!entropy!hubert hubert%entropy@uw-beaver.arpa
dyer@spdcc.UUCP (Steve Dyer) (04/25/86)
The problems you are having are well-known to some of us, at least, who've suffered through it earlier. There are problems with the 785 data path boards and timings which cause such spurious and irreproducable errors. An ECO should be available to fix it. Before you receive it, you can type some sort of command at the console which runs the VAX CPU clock slower. It sort of makes your machine a 782.5. I wish I remembered the details, but this was all about 9 months ago. I advise you to contact your DEC field service office ASAP. -- Steve Dyer dyer@harvard.HARVARD.EDU {bbncca,bbnccv,harvard,ima,ihnp4}!spdcc!dyer
dave@onfcanim.UUCP (Dave Martindale) (04/25/86)
In article <279@entropy.UUCP> hubert@entropy.UUCP (Steve Hubert) writes: >I wonder if anyone recognizes the following symptoms as symptoms of >something concrete I can try to fix. We are running 4.3BSD on a >VAX11/785. The disks are 3 RA81s on a single UDA. The uda device >driver is version 6.12 from Berkeley (9/16/85) which seems to be equal >to or derived from a DEC driver from January 84. I am not getting any >kernel error messages at all. Here is symptom number 1: > > [examples of cmp'ing a file with itself and getting non-repeatable errors, > and C compiles which sometimes worked, sometimes not] I had the same problem when installing our 780, and asked the disk controller vendor to swap controller boards (Emulex SC780, driving Eagles). The problem remained. The rest of the system passed DEC diagnostics, so I didn't know where to look next. Then we started occasionally getting soft ECC errors. I like to keep the memory system error-free, so I figured out which memory array board the error was on and swapped it with another board, just to be sure. The error remained in the same place! So I swapped memory controllers, and the problem did move. (On the MS780-E memory system, there are two controllers, on either side of the central bus interface board). So I pulled the bad controller entirely, the memory reverted to non-interleaved operation on the remaining half memory, and the mysterious data problems went away. DEC has since replaced the bad controller. Moral of the story: a bad memory controller can mess up your data while still passing DEC diagnostics and without giving any sort of error. The memory ECC will catch bad RAM chips, and not much else. There are also a number of places in the CPU unprotected by parity checking where an intermittent hardware fault will damage data.
rick@nyit.UUCP (Rick Ace) (04/28/86)
> I wonder if anyone recognizes the following symptoms as symptoms of > something concrete I can try to fix. We are running 4.3BSD on a > VAX11/785. The disks are 3 RA81s on a single UDA. The uda device > driver is version 6.12 from Berkeley (9/16/85) which seems to be equal > to or derived from a DEC driver from January 84. I am not getting any > kernel error messages at all. Here is symptom number 1: > > % ls -l data > -rw-r--r-- 1 pcraig 4480000 Apr 17 11:19 data > > % cmp data data > data data differ: char 1777665, line 30650 > > % cmp data data > data data differ: char 1654785, line 28531 > > % cmp data data > data data differ: char 1683457, line 28955 ... > All of our symptoms could be explained by bad reads. That is, if we > don't always get the same data off the disk when we read it we would > get the symptoms we're getting. However, we have never gotten any sort > of disk read error messages on the console or anywhere else. Thanks. > > Steve Hubert > Dept. of Stat., U. of Wash, Seattle > {decvax,ihnp4,ucbvax!lbl-csam}!uw-beaver!entropy!hubert > hubert%entropy@uw-beaver.arpa Sounds like flaky hardware. Trouble is figuring out which piece of gear is the culprit. Here are some ideas: 1. The UDA50 is sick. See if Field Service will swap it for a spare and try your experiments again. 2. Another peripheral on the UNIBUS with the UDA50 is misbehaving and corrupting the data transfer between the UDA50 and the UBA. Try your experiment after removing all UNIBUS devices except the UDA50 (remember to install grant cards and NPG jumpers where necessary). I've seen a malfunctioning UNIBUS device make trouble for its neighbors before! 3. The UNIBUS DD11 backplane has a problem. This is a bit of a pain to troubleshoot unless you have a spare backplane. Or, if your backplane is in two or more sections, shorten it to one section and run the experiment, then try another one of the sections. 4. The UBA or the UNIBUS cable is malfunctioning. Again, ask Field Service to swap as much gear as they can. 5. Other unix wizards suggested possible problems in the KA785 CPU and the memory controllers; these are also suspect. Ask Field Service to check the revision level of your CPU hardware, and to apply any FCOs that you don't already have (i.e., get your money's worth for your service contract). If I read the DEC PDP11 Bus Handbook correctly, it appears that the data lines on the UNIBUS are not parity-checked. This would explain why you're not seeing any diagnostic printf's from the kernel: UNIBUS data can get mangled undetectably on its journey from the UDA50 to the UBA. The tried-and-true "swap it for a spare" approach is often the most expedient route to solving problems like yours. For what it's worth, we're using an RA81/UDA50 on a Vax-11/780-5 (that's a CPU that was born a 780 but received a 785 CPU transplant later in life) under 4.2bsd with the RIACS UDA50 driver, so such a hardware configuration *can* work. Our UDA50 sits alone on its own UBA because it won't play nice with the boys on the other UBA, tho. ----- Rick Ace Computer Graphics Laboratory New York Institute of Technology Old Westbury, NY 11568 (516) 686-7644 {decvax,seismo}!philabs!nyit!rick