v.wales@ucla-locus@sri-unix.UUCP (09/26/83)
From:            Rich Wales <v.wales@ucla-locus>
About a week ago, I sent a message to UNIX-Wizards asking for help in
tracking down a problem with zero UNIBUS interrupt vectors on our VAX
11/780 running 4.1BSD.  I received several replies, which I am summar-
izing below.  Unfortunately, we are still having the original problem.
Apparently I didn't explain the problem adequately in my first message.
I have been observing two different kinds of "zero vector" scenarios:
(1) A steady "trickle" of zero vectors (about 100-200 an hour) -- which
    I assume is due to the "grant stealing"/"passive release" behavior
    described below by Dave Martindale, and which therefore is probably
    nothing to worry about.
(2) An occasional "glitch", where the entire UBA seems to lock up for
    several seconds, registering hundreds of thousands of zero vectors
    in succession until the count finally exceeds 250K and the error
    code in dev/uba.c clears things up via a UBA reset.
We have, on the average, one of these "glitches" about every three days
-- though sometimes we have gone "clean" for two weeks, and other times
we have had three or four in a single day.  About a third of them are
associated with UBA error messages indicating UBSTO (UNIBUS Select
Timeout) conditions; the FUBAR register value in the error message has
generally pointed to one of our DZ's (but not the same DZ every time).
My specific questions right now are:
(1) Suppose I get a UBA error, like the following:
	    uba0: uba error sr=2<UBSTO> fmer=0 fubar=760104
    Does this mean that the device whose register space is cited in the
    FUBAR (e.g., 760104, which on our 780 is one of the registers on a
    DZ) is defective?  Or is this device simply the innocent victim of
    some other problem in the UNIBUS or the UBA?
(2) We do have some people around here who are experienced in 'scoping
    logic circuits, but none of us have ever tried to analyze a UNIBUS.
    Can anyone suggest (in reasonable detail) an approach which might
    locate the source of a "glitch" such as the ones we are having?
(3) Is there any way for the kernel to tell whether a zero vector is
    the result of a "passive release" (see Martindale's description
    below)?  Since VMS logs zero vectors (at least, our DEC CE says it
    does), I would think that the VMS error log would be swamped with
    zero vector reports unless there were some way of weeding out the
    red herrings.
    The BRRVR FULL bits in the UBA Status Register (see page 287 of the
    1982-83 VAX Hardware Handbook) sort-of sound like what I am looking
    for, but the handbook seems to say that these bits remain set only
    in case of an error.  (What's the story, Armando?)
(4) We have several 750's, and I have never seen any zero UNIBUS vec-
    tors on any of them (not even the "trickle" syndrome).  Is the
    750's UBA smarter than the 780's in this respect?
Here are summarized versions of the replies I have received to date,
with my comments.
-- Rich Wales <wales@UCLA-LOCUS>
-------
    >>  Zero UBA vector is caused by "grant stealing" in the 8647 chip
    >>  used in most DEC boards, including the DZ11.  If one of these
    >>  boards sees NPR and bus grant simultaneously, it seizes the
    >>  grant, and then gives it up (passive release) if it didn't want
    >>  it.  The UBA never receives a vector value, so it thinks it saw
    >>  a zero vector.  Zero vectors would thus be common on a machine
    >>  with both lots of UNIBUS DMA and devices which interrupt
    >>  frequently (such as DZ's).
    >>      Dave Martindale <decvax!utzoo!watmath!watcgl!dmmartindale>
    >>      (not a direct reply -- from an old UNIX-Wizards message I
    >>      found amongst my vast backlog of mail)
    This undoubtedly explains the "trickle", but not the "glitches".
-------
    >>  Pull out boards one at a time until the problem disappears.
    >>      Doug Gwyn <gwyn@BRL-VLD>
    >>      lacasse@RAND-UNIX
    >>      Peter Gross <hao!pag@SEISMO>
    Unfortunately, the 780 in question is a heavily used production
    machine, and the "glitches" occur at irregular and non-reproducible
    intervals -- so we cannot afford the luxury of running for days or
    weeks at a time without all the hardware in place.
-------
    >>  Find out what interrupt occurs right after a zero vector.
    >>      Greg Chesson <chesson%Shasta@SU-SCORE>
    I added a few lines to sys/locore.s to tally (in an array) inter-
    rupt vectors occurring right after a zero vector at the same IPL.
    (There were far, far too many of them to use a console "printf".)
    Just for fun, I am tallying vectors both before and after a zero
    vector -- in two 1-D arrays, not a 2-D array (at least for now).
    The "trickle" of vectors showed up distributed among all my UNIBUS
    devices, essentially in proportion to how heavily they were being
    used.  (I had previously installed other code to tally interrupts
    by device, so I knew how many inputs and outputs occurred on each
    DZ, DH/DM, etc.)  This behavior seems to agree quite well with
    Martindale's "grant stealing" analysis.
    A few non-zero vectors occurring right after zero vectors corres-
    ponded to DMA devices (SI disk and DH/DM output), but not many.
    After one of the "glitches", my "tallying" code showed me that I
    had had some 248,000 zero vectors followed by zero vectors.  In
    other words, the zero vectors must have come all in a row, without
    anything else intervening.  (The reason for 248,000 rather than
    250,000 is that the "trickle" had raised the total zero-vector
    count to about 2,000 before the "glitch".)
-------
    >>  Try moving the SI disk interface to the front of the UNIBUS.
    >>      Greg Chesson <chesson%Shasta@SU-SCORE>
    I did this, with no effect whatsoever on the system (i.e., I con-
    tinued to have "glitches" after moving the SI).
-------
    >>  Make sure every DMA interface on the UNIBUS has the NPR jumper
    >>  removed.
    >>      Rick Adams <rlgvax!ra@SEISMO>
    I have checked this, and they do.
    In any case, though, if I had a DMA interface with the NPR jumper
    still in place, I wouldn't think the device in question would work
    at all.
-------
    >>  DEC DZ's and ABLE DH/DM's are probably not at fault.
    >>      Bob Walsh <walsh@BBN-UNIX>
    >>  Culprit is probably the SI disk controller.
    >>      Sam Leffler <sam@BERKELEY>
    Assuming the SI controller is at fault, any ideas on how to prove
    it (keeping in mind that we depend on it and cannot run the system
    in production without it)?
-------
    >>  Put a scope on the UNIBUS.
    >>      Sam Leffler <sam@BERKELEY>
    We do have some people here who are experienced in 'scoping logic
    circuits, but none of our people have troubleshot a UNIBUS before.
    Can anyone suggest exactly what to monitor and how to do it?hull@hao.UUCP (Howard Hull) (10/02/83)
doned bus requests, since an
	    interface that has been reset (probably by internal microcode)
	    will thereafter allow bus grants to pass freely.  The active
	    terminator will assert BUS SACK if it sees any request.  The
	    processor/arbitrator *will not* interpret this as a zero
	    interrupt vector *unless* somebody (anybody) asserts BUS INTR,
	    places a vector on the bus, and then drops away just after
	    negation of BUS SACK by the M9303.
	2. The "Grant Blocking"/"Grant Stealing"/"Passive Release" action
	    of controllers is not a common feature of all DMA interfaces.
	    they have to be deliberately jumpered to obtain it.  The event
	    occurs when the interface in question receives a *priority*
	    grant (not an NPG) at the level of its status reporting unit
	    and simultaneously sees NPR on the bus.  In order to reduce
	    the latency in behalf of the DMA interface requesting the bus
	    (whichever interface that might be) the priority interrupt
	    controller on the interface in question blocks the priority
	    grant from interfaces installed further down the bus, and
	    asserts BUS SACK.  When the arbitrator sees the SACK, it then
	    negates the priority grant.  The interface in question then
	    negates the SACK, and the arbitrator, seeing SACK negated in
	    absence of an assertion of BUS BBSY, BUS INTR or MSYN, asserts
	    NPG for the DMA device.  Obviously, the behavior of the bus will
	    depend on how close to the front of the bus (more in terms of the
	    number of interfaces than in terms of distance) the interface
	    in question has been installed.  Also, double obviously, this
	    is a supreme moment for another microcoded (and thus confused)
	    interface to assert (erroneously) BUS BBSY and BUS INTR without
	    even knowing what vector it wants to put on the data lines
	    (so *no vector*).
	3. A DMA interface can work EVEN IF THE NPG LINK HAS NOT BEEN CUT
	    provided that it is the last DMA device on the Unibus prior to
	    the terminator.  The only problem that may occur is a due to
	    possible differences in timing between the interface's SACK
	    assertion, and the "me too" SACK of the active terminator.
	    The fun begins when an unsuspecting (thus foolish) installer
	    adds a new DMA interface to the bus behind the one with the
	    uncut link.  The poor fellow goes crazy trying to figure out
	    what is wrong with *his* installation that causes both DMA 
	    interfaces (whose data, addresses and vectors are going to be
	    "wire or'd") to fail all of the diagnostics.
	4. The best method I know of to "scope" a Unibus is to first check
	   all 56 of the Unibus levels with the computer in halt mode (this
	   will reduce the amount of bus activity to either none or nearly
	   none, depending on the processor and DMA devices installed).
	   Look for +3.4 to +3.7 volts for zeros, +0.3 volts for ones.
	   Anything "resting at 1.5 volts" is suspect.  ACLO and DCLO are
	   usually less than 3 volts, though.  Then start the system and
	   put the scope on BRx, BGx, BBSY, SACK, INTR, MSYN, SSYN one at
	   a time with the scope set up to see glitches of about 100 ns.
	   Detect these using AC sync, either + or - slope, and slowly
	   rotating the Trigger Level control through its range.  Turn
	   the trace brightness up far enough that you can see the faint
	   stuff near the trigger edge against the brighter regular cycles.
	   Normal Unibus activity will not have high-low-high or low-high-
	   low control signal transitions shorter than around 275 ns.
	   A Tek 475A scope is ideal for this sort of work because of
	   its fast write.
So.
You're welcome in advance!     YT,  Howard Hull
 {ucbvax!hplabs | allegra!nbires | decvax!brl-bmd | harpo!seismo | menlo70}
       		        !hao!hulldmmartindale@watcgl.UUCP (Dave Martindale) (10/03/83)
The "glitch" sounds like some device is grabbing the bus and then refusing to release it. The Select Time Out error means that the unibus adapter couldn't get access to the unibus to do an NPR cycle to read or write a unibus address. The DZ's show up in the FUBAR probably because they are polled 60 times a second by the clock interrupt, so if the bus locks up they are almost certainly the first thing to be referenced. Finding out what happens during the lockup is an exercise in guessing and experimentation. Try to look at the state of the bus next time it locks up (you'll have to disable the unibus reset code to prevent an automatic reset...). If a device is continually holding down BBSY, that will be visible. If something is holding down NPR, or pulling it down very frequently, that will be visible too. Assuming that something is requesting NPR cycles continuously, you can tell which device is accepting the grants (and thus probably doing the requesting) by following the NPG line from slot to slot until you find the device which is blocking it.