[fa.info-vax] wizardly help needed

bruce@THINK.COM (Bruce Nemnich) (09/13/85)

We took shipment of a new 785 about two weeks ago, and I have been
having big problems getting it up.  At the moment the configuration
has 8Mb and two UBAs with only a UDA-50 (with 4 RA81s attached) on the
first and in Interlan ethernet card on the second.  I am trying to run
4.2bsd.  The system runs diagnostics without error.

There are two distinct symptoms.  The less frequent one is a system
hang; after 30 seconds or so the port lights on the RA81s will go out.
Nothing is echoed on the terminal (even interrupt chars, etc).  This
either happens during the boot sequence either right before or after the
first single-user shell prompt.   The PC is within Swtch() as I recall.

The 2nd symptom is that it takes a Segmentation Fault on a virtual
address which is always 3ffffffc or 5ffffffc; trouble is, that address
should never be referenced by the instruction it trapped on, which is
usually a push on the kernel stack.  When this happens while Unix is
running, it is usually in the syscall() routine before dispatching to
the apropriate system call routine.  

When it happens during the boot process, it happens a few instructions
into process 0 (after the ldpctx and rei for the first process and
before the call of main()).  The latter case results in recursive
traps (it traps in the trap handler) until the kernel stack is
exhausted, and then continues to recursively trap until the interrupt
stack is exhausted.  The result is an ?INT-STK INVALID message on the
console just after the "xxxxx+yyyyy+zzzzz start at 0xnnnn" message
printed by BOOT.

This error is persistent.  It will sometimes not happen for a couple
of days, but when it does crash (first case above), I often can't
reboot for hours thereafter (second case above).  Power-cycling the
machine (including memory and unibuses) doesn't help.  The one thing
which DOES often work to get it out of this mode (discovered
accidentally) is to physically remove the connection to the first
unibus by reseating the UBA or the paddle card in the back of the
machine.  Both paddle cards and UBA have been swapped without helping
the problem.  Even when it is in "failure mode" it passes diagnostics.

I have observed these under three versions of 4.2bsd: the current
version I am running on a 750, a two-year-old 4.2bsd distribution
tape, and a recent Ultrix distribution tape.  I have been running my
current version most of the time.

There is one more disk-related problem.  When the machine had been
working for two days, I decided to run some filesystem tuning
benchmarks (nothing sophistcated, just the ones in the "disk subsystem
choices" paper from Berkeley).  I found I was getting a maximum
throughput to one drive of under 200kb/s, which is terrible.  I ran
the same tests on a similarly configured 750 and got 400kb/s, which is
what I expected.  Putting the 750s UDA-50 in the 785 gave it full
performance (a little better than the 750; most things were
deivce-speed limited).  I had DEC give me a new UDA-50 for the 785,
but it gets the low throughput!  I didn't try the 785s UDA on the 750
yet.  I am running a uda driver from daves@riacs. I plan to do further
investigation on this one.

If anyone has seen problems like these, please let me know.  Both
DEC Field Service and I are baffled.

--bruce