v.wales@ucla-locus@sri-unix.UUCP (09/26/83)
From: Rich Wales <v.wales@ucla-locus>
About a week ago, I sent a message to UNIX-Wizards asking for help in
tracking down a problem with zero UNIBUS interrupt vectors on our VAX
11/780 running 4.1BSD. I received several replies, which I am summar-
izing below. Unfortunately, we are still having the original problem.
Apparently I didn't explain the problem adequately in my first message.
I have been observing two different kinds of "zero vector" scenarios:
(1) A steady "trickle" of zero vectors (about 100-200 an hour) -- which
I assume is due to the "grant stealing"/"passive release" behavior
described below by Dave Martindale, and which therefore is probably
nothing to worry about.
(2) An occasional "glitch", where the entire UBA seems to lock up for
several seconds, registering hundreds of thousands of zero vectors
in succession until the count finally exceeds 250K and the error
code in dev/uba.c clears things up via a UBA reset.
We have, on the average, one of these "glitches" about every three days
-- though sometimes we have gone "clean" for two weeks, and other times
we have had three or four in a single day. About a third of them are
associated with UBA error messages indicating UBSTO (UNIBUS Select
Timeout) conditions; the FUBAR register value in the error message has
generally pointed to one of our DZ's (but not the same DZ every time).
My specific questions right now are:
(1) Suppose I get a UBA error, like the following:
uba0: uba error sr=2<UBSTO> fmer=0 fubar=760104
Does this mean that the device whose register space is cited in the
FUBAR (e.g., 760104, which on our 780 is one of the registers on a
DZ) is defective? Or is this device simply the innocent victim of
some other problem in the UNIBUS or the UBA?
(2) We do have some people around here who are experienced in 'scoping
logic circuits, but none of us have ever tried to analyze a UNIBUS.
Can anyone suggest (in reasonable detail) an approach which might
locate the source of a "glitch" such as the ones we are having?
(3) Is there any way for the kernel to tell whether a zero vector is
the result of a "passive release" (see Martindale's description
below)? Since VMS logs zero vectors (at least, our DEC CE says it
does), I would think that the VMS error log would be swamped with
zero vector reports unless there were some way of weeding out the
red herrings.
The BRRVR FULL bits in the UBA Status Register (see page 287 of the
1982-83 VAX Hardware Handbook) sort-of sound like what I am looking
for, but the handbook seems to say that these bits remain set only
in case of an error. (What's the story, Armando?)
(4) We have several 750's, and I have never seen any zero UNIBUS vec-
tors on any of them (not even the "trickle" syndrome). Is the
750's UBA smarter than the 780's in this respect?
Here are summarized versions of the replies I have received to date,
with my comments.
-- Rich Wales <wales@UCLA-LOCUS>
-------
>> Zero UBA vector is caused by "grant stealing" in the 8647 chip
>> used in most DEC boards, including the DZ11. If one of these
>> boards sees NPR and bus grant simultaneously, it seizes the
>> grant, and then gives it up (passive release) if it didn't want
>> it. The UBA never receives a vector value, so it thinks it saw
>> a zero vector. Zero vectors would thus be common on a machine
>> with both lots of UNIBUS DMA and devices which interrupt
>> frequently (such as DZ's).
>> Dave Martindale <decvax!utzoo!watmath!watcgl!dmmartindale>
>> (not a direct reply -- from an old UNIX-Wizards message I
>> found amongst my vast backlog of mail)
This undoubtedly explains the "trickle", but not the "glitches".
-------
>> Pull out boards one at a time until the problem disappears.
>> Doug Gwyn <gwyn@BRL-VLD>
>> lacasse@RAND-UNIX
>> Peter Gross <hao!pag@SEISMO>
Unfortunately, the 780 in question is a heavily used production
machine, and the "glitches" occur at irregular and non-reproducible
intervals -- so we cannot afford the luxury of running for days or
weeks at a time without all the hardware in place.
-------
>> Find out what interrupt occurs right after a zero vector.
>> Greg Chesson <chesson%Shasta@SU-SCORE>
I added a few lines to sys/locore.s to tally (in an array) inter-
rupt vectors occurring right after a zero vector at the same IPL.
(There were far, far too many of them to use a console "printf".)
Just for fun, I am tallying vectors both before and after a zero
vector -- in two 1-D arrays, not a 2-D array (at least for now).
The "trickle" of vectors showed up distributed among all my UNIBUS
devices, essentially in proportion to how heavily they were being
used. (I had previously installed other code to tally interrupts
by device, so I knew how many inputs and outputs occurred on each
DZ, DH/DM, etc.) This behavior seems to agree quite well with
Martindale's "grant stealing" analysis.
A few non-zero vectors occurring right after zero vectors corres-
ponded to DMA devices (SI disk and DH/DM output), but not many.
After one of the "glitches", my "tallying" code showed me that I
had had some 248,000 zero vectors followed by zero vectors. In
other words, the zero vectors must have come all in a row, without
anything else intervening. (The reason for 248,000 rather than
250,000 is that the "trickle" had raised the total zero-vector
count to about 2,000 before the "glitch".)
-------
>> Try moving the SI disk interface to the front of the UNIBUS.
>> Greg Chesson <chesson%Shasta@SU-SCORE>
I did this, with no effect whatsoever on the system (i.e., I con-
tinued to have "glitches" after moving the SI).
-------
>> Make sure every DMA interface on the UNIBUS has the NPR jumper
>> removed.
>> Rick Adams <rlgvax!ra@SEISMO>
I have checked this, and they do.
In any case, though, if I had a DMA interface with the NPR jumper
still in place, I wouldn't think the device in question would work
at all.
-------
>> DEC DZ's and ABLE DH/DM's are probably not at fault.
>> Bob Walsh <walsh@BBN-UNIX>
>> Culprit is probably the SI disk controller.
>> Sam Leffler <sam@BERKELEY>
Assuming the SI controller is at fault, any ideas on how to prove
it (keeping in mind that we depend on it and cannot run the system
in production without it)?
-------
>> Put a scope on the UNIBUS.
>> Sam Leffler <sam@BERKELEY>
We do have some people here who are experienced in 'scoping logic
circuits, but none of our people have troubleshot a UNIBUS before.
Can anyone suggest exactly what to monitor and how to do it?hull@hao.UUCP (Howard Hull) (10/02/83)
doned bus requests, since an
interface that has been reset (probably by internal microcode)
will thereafter allow bus grants to pass freely. The active
terminator will assert BUS SACK if it sees any request. The
processor/arbitrator *will not* interpret this as a zero
interrupt vector *unless* somebody (anybody) asserts BUS INTR,
places a vector on the bus, and then drops away just after
negation of BUS SACK by the M9303.
2. The "Grant Blocking"/"Grant Stealing"/"Passive Release" action
of controllers is not a common feature of all DMA interfaces.
they have to be deliberately jumpered to obtain it. The event
occurs when the interface in question receives a *priority*
grant (not an NPG) at the level of its status reporting unit
and simultaneously sees NPR on the bus. In order to reduce
the latency in behalf of the DMA interface requesting the bus
(whichever interface that might be) the priority interrupt
controller on the interface in question blocks the priority
grant from interfaces installed further down the bus, and
asserts BUS SACK. When the arbitrator sees the SACK, it then
negates the priority grant. The interface in question then
negates the SACK, and the arbitrator, seeing SACK negated in
absence of an assertion of BUS BBSY, BUS INTR or MSYN, asserts
NPG for the DMA device. Obviously, the behavior of the bus will
depend on how close to the front of the bus (more in terms of the
number of interfaces than in terms of distance) the interface
in question has been installed. Also, double obviously, this
is a supreme moment for another microcoded (and thus confused)
interface to assert (erroneously) BUS BBSY and BUS INTR without
even knowing what vector it wants to put on the data lines
(so *no vector*).
3. A DMA interface can work EVEN IF THE NPG LINK HAS NOT BEEN CUT
provided that it is the last DMA device on the Unibus prior to
the terminator. The only problem that may occur is a due to
possible differences in timing between the interface's SACK
assertion, and the "me too" SACK of the active terminator.
The fun begins when an unsuspecting (thus foolish) installer
adds a new DMA interface to the bus behind the one with the
uncut link. The poor fellow goes crazy trying to figure out
what is wrong with *his* installation that causes both DMA
interfaces (whose data, addresses and vectors are going to be
"wire or'd") to fail all of the diagnostics.
4. The best method I know of to "scope" a Unibus is to first check
all 56 of the Unibus levels with the computer in halt mode (this
will reduce the amount of bus activity to either none or nearly
none, depending on the processor and DMA devices installed).
Look for +3.4 to +3.7 volts for zeros, +0.3 volts for ones.
Anything "resting at 1.5 volts" is suspect. ACLO and DCLO are
usually less than 3 volts, though. Then start the system and
put the scope on BRx, BGx, BBSY, SACK, INTR, MSYN, SSYN one at
a time with the scope set up to see glitches of about 100 ns.
Detect these using AC sync, either + or - slope, and slowly
rotating the Trigger Level control through its range. Turn
the trace brightness up far enough that you can see the faint
stuff near the trigger edge against the brighter regular cycles.
Normal Unibus activity will not have high-low-high or low-high-
low control signal transitions shorter than around 275 ns.
A Tek 475A scope is ideal for this sort of work because of
its fast write.
So.
You're welcome in advance! YT, Howard Hull
{ucbvax!hplabs | allegra!nbires | decvax!brl-bmd | harpo!seismo | menlo70}
!hao!hulldmmartindale@watcgl.UUCP (Dave Martindale) (10/03/83)
The "glitch" sounds like some device is grabbing the bus and then refusing to release it. The Select Time Out error means that the unibus adapter couldn't get access to the unibus to do an NPR cycle to read or write a unibus address. The DZ's show up in the FUBAR probably because they are polled 60 times a second by the clock interrupt, so if the bus locks up they are almost certainly the first thing to be referenced. Finding out what happens during the lockup is an exercise in guessing and experimentation. Try to look at the state of the bus next time it locks up (you'll have to disable the unibus reset code to prevent an automatic reset...). If a device is continually holding down BBSY, that will be visible. If something is holding down NPR, or pulling it down very frequently, that will be visible too. Assuming that something is requesting NPR cycles continuously, you can tell which device is accepting the grants (and thus probably doing the requesting) by following the NPG line from slot to slot until you find the device which is blocking it.