[net.unix-wizards] wizardly help needed

bruce@think.ARPA (Bruce Nemnich) (09/13/85)

We took shipment of a new 785 about two weeks ago, and I have been
having big problems getting it up.  At the moment the configuration
has 8Mb and two UBAs with only a UDA-50 (with 4 RA81s attached) on the
first and in Interlan ethernet card on the second.  I am trying to run
4.2bsd.  The system runs diagnostics without error.

There are two distinct symptoms.  The less frequent one is a system
hang; after 30 seconds or so the port lights on the RA81s will go out.
Nothing is echoed on the terminal (even interrupt chars, etc).  This
either happens during the boot sequence either right before or after the
first single-user shell prompt.   The PC is within Swtch() as I recall.

The 2nd symptom is that it takes a Segmentation Fault on a virtual
address which is always 3ffffffc or 5ffffffc; trouble is, that address
should never be referenced by the instruction it trapped on, which is
usually a push on the kernel stack.  When this happens while Unix is
running, it is usually in the syscall() routine before dispatching to
the apropriate system call routine.  

When it happens during the boot process, it happens a few instructions
into process 0 (after the ldpctx and rei for the first process and
before the call of main()).  The latter case results in recursive
traps (it traps in the trap handler) until the kernel stack is
exhausted, and then continues to recursively trap until the interrupt
stack is exhausted.  The result is an ?INT-STK INVALID message on the
console just after the "xxxxx+yyyyy+zzzzz start at 0xnnnn" message
printed by BOOT.

This error is persistent.  It will sometimes not happen for a couple
of days, but when it does crash (first case above), I often can't
reboot for hours thereafter (second case above).  Power-cycling the
machine (including memory and unibuses) doesn't help.  The one thing
which DOES often work to get it out of this mode (discovered
accidentally) is to physically remove the connection to the first
unibus by reseating the UBA or the paddle card in the back of the
machine.  Both paddle cards and UBA have been swapped without helping
the problem.  Even when it is in "failure mode" it passes diagnostics.

I have observed these under three versions of 4.2bsd: the current
version I am running on a 750, a two-year-old 4.2bsd distribution
tape, and a recent Ultrix distribution tape.  I have been running my
current version most of the time.

There is one more disk-related problem.  When the machine had been
working for two days, I decided to run some filesystem tuning
benchmarks (nothing sophistcated, just the ones in the "disk subsystem
choices" paper from Berkeley).  I found I was getting a maximum
throughput to one drive of under 200kb/s, which is terrible.  I ran
the same tests on a similarly configured 750 and got 400kb/s, which is
what I expected.  Putting the 750s UDA-50 in the 785 gave it full
performance (a little better than the 750; most things were
deivce-speed limited).  I had DEC give me a new UDA-50 for the 785,
but it gets the low throughput!  I didn't try the 785s UDA on the 750
yet.  I am running a uda driver from daves@riacs. I plan to do further
investigation on this one.

If anyone has seen problems like these, please let me know.  Both
DEC Field Service and I are baffled.

--bruce

mangler@CIT-VAX.ARPA (System Mangler) (09/13/85)

    The UDA-50 has a set of pluggable jumpers that set the "Unibus delay" -
the amount of time the UDA waits between DMA requests to give other devices
a chance at the bus.  This can be set for 0us, 6.7us, or 10us.  The burst
transfer rate at each setting is:
	0us	800 kilobytes/second
	6.7us	350 kilobytes/second
	10us	250 kilobytes/second
Some devices with very little buffering (RK07's and RL02's) will get data-
lates if competing with a UDA set to any but the slowest setting, so I guess
DEC has been tending to ship them set for 10us.  I found that an RL02
sharing a 750 Unibus with a UDA and an Interlan would get data lates
on ANY of the three settings.  (So we sold both the RL02 *and* the
UDA, and bought Eagles).
    You might also try fiddling with "tunefs -d", which sets the rotational
distance between consecutive blocks of files.  When set optimally, the cpu
asks for the next block just as it comes under the heads.  Unfortunately,
if you use the default value, the cpu asks for the block just AFTER it
passes under the heads.  When we still had UDA-50's on our 750's I found
that the optimal value was a whole revolution - which you specify as 0,
to avoid forcing a track switch and the consequent quarter revolution
of head-switching delay that is built into the sector numbering.
(E.g, track 1 sector 0 is 1/4 revolution away from track 0 sector 0).
    Of course if you really need speed you don't use RA81's...

				Don Speck	speck@cit-vax.arpa