[mod.computers.vax] A very strange 4.2 looping problem

phil@RICE.EDU (William LeFebvre) (10/27/85)
Over the past few days, one of our vaxes has exhibited very strange
symptoms.  We were finally able to track down the problem, but I hope
that by sending this message I can help keep others out there from
spending all afternoon in a machine room pulling out their hair (as I
did).

The particular machine is a VAX 11/750 running 4.2BSD.  It has two
unibusses (unibi?), containing:  [bus 0] IN-Card (connection to
Telenet) and UDA-50, [bus 1] two DZ-11s, an Able VMZ, and an Interlan
1010-A (ethernet controller).

Tuesday evening, the machine went into a loop in the kernel.  It didn't
look very tight (taking samples by repeating ^P and C).  I didn't think
much of it, rebooted the machine and went home.  When I got in the next
day, I discovered that the machine had hung again shortly after it had
rebooted.  So, I tried to take a dump (drop -1 in the PC and 1F0000 in
the PSL).  It promptly said "panic: segmentation fault" and followed it
with a hex stack trace -- exactly what I expected it to do.  After
that, I expected it to sync and dump.  But it didn't:  it went right
back to its looping behavior.  I finally gave up and rebooted.  This
time, I waited (over 20 minutes with this machine).  It went all the
way thru the boot procedure, ran /etc/rc (and /etc/rc.local), starting
all the daemons and doing everything just fine.  Then it came up
multi-user.  A getty prompt appeared on the console, and it
*immediately* went into its loop again.  I mean:  I hit return right
after I got the "login:" prompt, and it "didn't do nuthin'".

"This is REAL peculiar!" I said.  So I put on my hardware hat and
started running diagnostics.  I ran every diagnostic I could find that
was appropriate, and some that weren't.  The only thing I couldn't find
a diagnostic documented for was the DZ-11 boards.  I said:  "Ahhh, it
couldn't be those anyway."  Every diagnostic ran with no problems.  A
call to Field Service got the response:  "it's probably one of those
foreign boards.  Why don't you try booting the system without them?" In
other words:  "we won't even think about solving your problem until you
can prove that it's our fault" (not a wholly unreasonable attitude to
take, just annoying).  "Well," I said, "it must be that flakey IN-Card.
It's given us many problems before."  So I pulled the IN-Card and
rebooted.  Same symptoms.  "Well, it must be the software that drives
the IN-Card."  So I booted single user (which worked just fine) and,
with the help of those more in the know about network things, disabled
all the stuff in rc.local that "turned on" that interface.  Then I
booted the rest of the way.  Same problem.

At this point, we were all stumped.  So, I moved the disk to another
machine (as unit 1) so we could look at the data out there:
specifically the kernel image.  Adb and the stack traces told us that
it was spending alot of time in the DZ interrupt handler.  I said "Too
bad there isn't a DZ diagnostic."  My officemate replied, "But there
is.  I know there is.  There's got to be."  So, after much poking
around I found it (its EVDAA for those who want to know -- and typing
"help dev dz11" *won't* tell you, either (unless our help files are
mangled (which might be the case))).  Turned out that indeed one of our
DZs failed diagnostics.  We pulled that board and, whadayaknow, the
system stopped looping and started behaving normally.

My best guess is that the board was generating interrupts at a furious
rate -- enough to keep the kernel so busy that it couldn't do anything
else.

This failure was not all that unusual, but the symptoms were so
peculiar and downright unhelpful that I felt this message would be
appropriate and, hopefully, appreciated.  Forgive me for its length.

			William LeFebvre
			Department of Computer Science
			Rice University
			<phil@Rice.arpa>
                        or, for the daring: <phil@Rice.edu>