[net.unix-wizards] Has anyone running 4.2BSD had similar problems?

sls@allegra.UUCP (11/11/83)

We recently brought up 4.2BSD on our 11/780. Our system only has
one disk at present, an RP07. Yesterday, we started getting 
soft errors on our / and /usr filesystems, followed by hard errors
(OPI, DVC, ECH, DCK). These errors (which were many) occurred over
a period of 10 minutes. The errors ceased, and the system stayed up
for about 30 minutes, and then crashed. / and /usr were totally
scrogged, but our user filesystem escaped unscathed. DEC tested out 
the drive and found absolutely nothing wrong with it. I booted
Unix from the distribution tape and restored /usr. We have had 
no errors of any kind since then.
Has anyone running 4.2 had a similar problem?

                 Susan Shaw
		 BTL - MH
		 allegra!sls

serge%ucbcory%berkeley@sri-unix.UUCP (11/15/83)

From:  Serge Granik <serge%ucbcory@berkeley>

	We are currently running 4.2 BSD UNIX on 11/750.  Both systems
seem to crash about once a day, often more.  The cause is often
"panic: Hard I/O error in swap".  Sometimes it's "panic: vgetu".  Any ideas?

							serge@ucbcory

alt@aids-unix@sri-unix.UUCP (11/19/83)

From:  Howard Alt <alt@aids-unix>

The people at utexas-780 (ut-sally) had a problem like that when
they first got the machine.  It was the RP07.  DEC had to run
diags on it for about 8 hours before the problems surfaced.
				Howard.

guy%ucla-locus@sri-unix.UUCP (11/22/83)

From:  Richard Guy <guy@ucla-locus>

This is a 4.1 tale, but I suspect it relates well to your 4.2 problem,
which I attribute to the swap code being overly sensitive to disk errors:

Here at UCLA we're running some dozen 750's with a variant of 4.1bsd.
Each system has a Fujitsu disk (160Mb or 450Mb), plus the inevitable RK07
disk.  (there were ineveitable when we got the systems a year ago)  For the
most part, we avoid using the RK07's whenever possible, since the controller
doesn't buffer data very well while waiting to grab the unibus.  To attempt
to deal with the problem, we recabled all the systems so that the RK07 is at
the physical/logical front of the bus--means the bus is 20' longer now, sigh.
(This helps because devices at the 'front' of the unibus have a slight edge
over other devices 'farther' away, when it comes to bus arbitration)

We finally ran out of swap space on the Fuji's, so we added two more swap
partitions on the RK07's.  To save time/effort, we enabled both partitions
for each system, all in the same day.  Within a week, each of our systems
was crashing at least once a day with 'panic: hard i/o error in swap'.  Turns
out the RK07 just can't seem to deliver the goods when it has multiple swap
partitions on the same spindle.  We backed off to using only one RK07 partition,
and our problems have been gone for 5 months now.  A better solution would have
been to beef up the code and have it retry at least once to get the data.


A question for those running a lot of RK07's:  How have they worked out for you?
Our experience has been a minor disaster. The basic problem is described above;
others had to do with pack unreliability--'DC' packs fall apart after three
months, so we replaced most with 'EF' error free ones;  they fall apart too, but
it takes six months.  (fall apart means new bad sectors start appearing once a
week or more--real bad news if you're using it as a boot device!)  On the
positive side, DEC has been reasonably responsive about replacing the packs.
(all under maintenance, of course)  In summary, if we don't use the things, they
don't break. (very often)  As soon as they get any significant usage...they die.

richard