v.wales@ucla-locus@sri-unix.UUCP (09/26/83)
From: Rich Wales <v.wales@ucla-locus> About a week ago, I sent a message to UNIX-Wizards asking for help in tracking down a problem with zero UNIBUS interrupt vectors on our VAX 11/780 running 4.1BSD. I received several replies, which I am summar- izing below. Unfortunately, we are still having the original problem. Apparently I didn't explain the problem adequately in my first message. I have been observing two different kinds of "zero vector" scenarios: (1) A steady "trickle" of zero vectors (about 100-200 an hour) -- which I assume is due to the "grant stealing"/"passive release" behavior described below by Dave Martindale, and which therefore is probably nothing to worry about. (2) An occasional "glitch", where the entire UBA seems to lock up for several seconds, registering hundreds of thousands of zero vectors in succession until the count finally exceeds 250K and the error code in dev/uba.c clears things up via a UBA reset. We have, on the average, one of these "glitches" about every three days -- though sometimes we have gone "clean" for two weeks, and other times we have had three or four in a single day. About a third of them are associated with UBA error messages indicating UBSTO (UNIBUS Select Timeout) conditions; the FUBAR register value in the error message has generally pointed to one of our DZ's (but not the same DZ every time). My specific questions right now are: (1) Suppose I get a UBA error, like the following: uba0: uba error sr=2<UBSTO> fmer=0 fubar=760104 Does this mean that the device whose register space is cited in the FUBAR (e.g., 760104, which on our 780 is one of the registers on a DZ) is defective? Or is this device simply the innocent victim of some other problem in the UNIBUS or the UBA? (2) We do have some people around here who are experienced in 'scoping logic circuits, but none of us have ever tried to analyze a UNIBUS. Can anyone suggest (in reasonable detail) an approach which might locate the source of a "glitch" such as the ones we are having? (3) Is there any way for the kernel to tell whether a zero vector is the result of a "passive release" (see Martindale's description below)? Since VMS logs zero vectors (at least, our DEC CE says it does), I would think that the VMS error log would be swamped with zero vector reports unless there were some way of weeding out the red herrings. The BRRVR FULL bits in the UBA Status Register (see page 287 of the 1982-83 VAX Hardware Handbook) sort-of sound like what I am looking for, but the handbook seems to say that these bits remain set only in case of an error. (What's the story, Armando?) (4) We have several 750's, and I have never seen any zero UNIBUS vec- tors on any of them (not even the "trickle" syndrome). Is the 750's UBA smarter than the 780's in this respect? Here are summarized versions of the replies I have received to date, with my comments. -- Rich Wales <wales@UCLA-LOCUS> ------- >> Zero UBA vector is caused by "grant stealing" in the 8647 chip >> used in most DEC boards, including the DZ11. If one of these >> boards sees NPR and bus grant simultaneously, it seizes the >> grant, and then gives it up (passive release) if it didn't want >> it. The UBA never receives a vector value, so it thinks it saw >> a zero vector. Zero vectors would thus be common on a machine >> with both lots of UNIBUS DMA and devices which interrupt >> frequently (such as DZ's). >> Dave Martindale <decvax!utzoo!watmath!watcgl!dmmartindale> >> (not a direct reply -- from an old UNIX-Wizards message I >> found amongst my vast backlog of mail) This undoubtedly explains the "trickle", but not the "glitches". ------- >> Pull out boards one at a time until the problem disappears. >> Doug Gwyn <gwyn@BRL-VLD> >> lacasse@RAND-UNIX >> Peter Gross <hao!pag@SEISMO> Unfortunately, the 780 in question is a heavily used production machine, and the "glitches" occur at irregular and non-reproducible intervals -- so we cannot afford the luxury of running for days or weeks at a time without all the hardware in place. ------- >> Find out what interrupt occurs right after a zero vector. >> Greg Chesson <chesson%Shasta@SU-SCORE> I added a few lines to sys/locore.s to tally (in an array) inter- rupt vectors occurring right after a zero vector at the same IPL. (There were far, far too many of them to use a console "printf".) Just for fun, I am tallying vectors both before and after a zero vector -- in two 1-D arrays, not a 2-D array (at least for now). The "trickle" of vectors showed up distributed among all my UNIBUS devices, essentially in proportion to how heavily they were being used. (I had previously installed other code to tally interrupts by device, so I knew how many inputs and outputs occurred on each DZ, DH/DM, etc.) This behavior seems to agree quite well with Martindale's "grant stealing" analysis. A few non-zero vectors occurring right after zero vectors corres- ponded to DMA devices (SI disk and DH/DM output), but not many. After one of the "glitches", my "tallying" code showed me that I had had some 248,000 zero vectors followed by zero vectors. In other words, the zero vectors must have come all in a row, without anything else intervening. (The reason for 248,000 rather than 250,000 is that the "trickle" had raised the total zero-vector count to about 2,000 before the "glitch".) ------- >> Try moving the SI disk interface to the front of the UNIBUS. >> Greg Chesson <chesson%Shasta@SU-SCORE> I did this, with no effect whatsoever on the system (i.e., I con- tinued to have "glitches" after moving the SI). ------- >> Make sure every DMA interface on the UNIBUS has the NPR jumper >> removed. >> Rick Adams <rlgvax!ra@SEISMO> I have checked this, and they do. In any case, though, if I had a DMA interface with the NPR jumper still in place, I wouldn't think the device in question would work at all. ------- >> DEC DZ's and ABLE DH/DM's are probably not at fault. >> Bob Walsh <walsh@BBN-UNIX> >> Culprit is probably the SI disk controller. >> Sam Leffler <sam@BERKELEY> Assuming the SI controller is at fault, any ideas on how to prove it (keeping in mind that we depend on it and cannot run the system in production without it)? ------- >> Put a scope on the UNIBUS. >> Sam Leffler <sam@BERKELEY> We do have some people here who are experienced in 'scoping logic circuits, but none of our people have troubleshot a UNIBUS before. Can anyone suggest exactly what to monitor and how to do it?
hull@hao.UUCP (Howard Hull) (10/02/83)
doned bus requests, since an interface that has been reset (probably by internal microcode) will thereafter allow bus grants to pass freely. The active terminator will assert BUS SACK if it sees any request. The processor/arbitrator *will not* interpret this as a zero interrupt vector *unless* somebody (anybody) asserts BUS INTR, places a vector on the bus, and then drops away just after negation of BUS SACK by the M9303. 2. The "Grant Blocking"/"Grant Stealing"/"Passive Release" action of controllers is not a common feature of all DMA interfaces. they have to be deliberately jumpered to obtain it. The event occurs when the interface in question receives a *priority* grant (not an NPG) at the level of its status reporting unit and simultaneously sees NPR on the bus. In order to reduce the latency in behalf of the DMA interface requesting the bus (whichever interface that might be) the priority interrupt controller on the interface in question blocks the priority grant from interfaces installed further down the bus, and asserts BUS SACK. When the arbitrator sees the SACK, it then negates the priority grant. The interface in question then negates the SACK, and the arbitrator, seeing SACK negated in absence of an assertion of BUS BBSY, BUS INTR or MSYN, asserts NPG for the DMA device. Obviously, the behavior of the bus will depend on how close to the front of the bus (more in terms of the number of interfaces than in terms of distance) the interface in question has been installed. Also, double obviously, this is a supreme moment for another microcoded (and thus confused) interface to assert (erroneously) BUS BBSY and BUS INTR without even knowing what vector it wants to put on the data lines (so *no vector*). 3. A DMA interface can work EVEN IF THE NPG LINK HAS NOT BEEN CUT provided that it is the last DMA device on the Unibus prior to the terminator. The only problem that may occur is a due to possible differences in timing between the interface's SACK assertion, and the "me too" SACK of the active terminator. The fun begins when an unsuspecting (thus foolish) installer adds a new DMA interface to the bus behind the one with the uncut link. The poor fellow goes crazy trying to figure out what is wrong with *his* installation that causes both DMA interfaces (whose data, addresses and vectors are going to be "wire or'd") to fail all of the diagnostics. 4. The best method I know of to "scope" a Unibus is to first check all 56 of the Unibus levels with the computer in halt mode (this will reduce the amount of bus activity to either none or nearly none, depending on the processor and DMA devices installed). Look for +3.4 to +3.7 volts for zeros, +0.3 volts for ones. Anything "resting at 1.5 volts" is suspect. ACLO and DCLO are usually less than 3 volts, though. Then start the system and put the scope on BRx, BGx, BBSY, SACK, INTR, MSYN, SSYN one at a time with the scope set up to see glitches of about 100 ns. Detect these using AC sync, either + or - slope, and slowly rotating the Trigger Level control through its range. Turn the trace brightness up far enough that you can see the faint stuff near the trigger edge against the brighter regular cycles. Normal Unibus activity will not have high-low-high or low-high- low control signal transitions shorter than around 275 ns. A Tek 475A scope is ideal for this sort of work because of its fast write. So. You're welcome in advance! YT, Howard Hull {ucbvax!hplabs | allegra!nbires | decvax!brl-bmd | harpo!seismo | menlo70} !hao!hull
dmmartindale@watcgl.UUCP (Dave Martindale) (10/03/83)
The "glitch" sounds like some device is grabbing the bus and then refusing to release it. The Select Time Out error means that the unibus adapter couldn't get access to the unibus to do an NPR cycle to read or write a unibus address. The DZ's show up in the FUBAR probably because they are polled 60 times a second by the clock interrupt, so if the bus locks up they are almost certainly the first thing to be referenced. Finding out what happens during the lockup is an exercise in guessing and experimentation. Try to look at the state of the bus next time it locks up (you'll have to disable the unibus reset code to prevent an automatic reset...). If a device is continually holding down BBSY, that will be visible. If something is holding down NPR, or pulling it down very frequently, that will be visible too. Assuming that something is requesting NPR cycles continuously, you can tell which device is accepting the grants (and thus probably doing the requesting) by following the NPG line from slot to slot until you find the device which is blocking it.