[net.unix-wizards] A very strange 4.2 looping problem

phil@RICE.ARPA (William LeFebvre) (10/27/85)

Over the past few days, one of our vaxes has exhibited very strange
symptoms.  We were finally able to track down the problem, but I hope
that by sending this message I can help keep others out there from
spending all afternoon in a machine room pulling out their hair (as I
did).

The particular machine is a VAX 11/750 running 4.2BSD.  It has two
unibusses (unibi?), containing:  [bus 0] IN-Card (connection to
Telenet) and UDA-50, [bus 1] two DZ-11s, an Able VMZ, and an Interlan
1010-A (ethernet controller).

Tuesday evening, the machine went into a loop in the kernel.  It didn't
look very tight (taking samples by repeating ^P and C).  I didn't think
much of it, rebooted the machine and went home.  When I got in the next
day, I discovered that the machine had hung again shortly after it had
rebooted.  So, I tried to take a dump (drop -1 in the PC and 1F0000 in
the PSL).  It promptly said "panic: segmentation fault" and followed it
with a hex stack trace -- exactly what I expected it to do.  After
that, I expected it to sync and dump.  But it didn't:  it went right
back to its looping behavior.  I finally gave up and rebooted.  This
time, I waited (over 20 minutes with this machine).  It went all the
way thru the boot procedure, ran /etc/rc (and /etc/rc.local), starting
all the daemons and doing everything just fine.  Then it came up
multi-user.  A getty prompt appeared on the console, and it
*immediately* went into its loop again.  I mean:  I hit return right
after I got the "login:" prompt, and it "didn't do nuthin'".

"This is REAL peculiar!" I said.  So I put on my hardware hat and
started running diagnostics.  I ran every diagnostic I could find that
was appropriate, and some that weren't.  The only thing I couldn't find
a diagnostic documented for was the DZ-11 boards.  I said:  "Ahhh, it
couldn't be those anyway."  Every diagnostic ran with no problems.  A
call to Field Service got the response:  "it's probably one of those
foreign boards.  Why don't you try booting the system without them?" In
other words:  "we won't even think about solving your problem until you
can prove that it's our fault" (not a wholly unreasonable attitude to
take, just annoying).  "Well," I said, "it must be that flakey IN-Card.
It's given us many problems before."  So I pulled the IN-Card and
rebooted.  Same symptoms.  "Well, it must be the software that drives
the IN-Card."  So I booted single user (which worked just fine) and,
with the help of those more in the know about network things, disabled
all the stuff in rc.local that "turned on" that interface.  Then I
booted the rest of the way.  Same problem.

At this point, we were all stumped.  So, I moved the disk to another
machine (as unit 1) so we could look at the data out there:
specifically the kernel image.  Adb and the stack traces told us that
it was spending alot of time in the DZ interrupt handler.  I said "Too
bad there isn't a DZ diagnostic."  My officemate replied, "But there
is.  I know there is.  There's got to be."  So, after much poking
around I found it (its EVDAA for those who want to know -- and typing
"help dev dz11" *won't* tell you, either (unless our help files are
mangled (which might be the case))).  Turned out that indeed one of our
DZs failed diagnostics.  We pulled that board and, whadayaknow, the
system stopped looping and started behaving normally.

My best guess is that the board was generating interrupts at a furious
rate -- enough to keep the kernel so busy that it couldn't do anything
else.

This failure was not all that unusual, but the symptoms were so
peculiar and downright unhelpful that I felt this message would be
appropriate and, hopefully, appreciated.  Forgive me for its length.

			William LeFebvre
			Department of Computer Science
			Rice University
			<phil@Rice.arpa>
                        or, for the daring: <phil@Rice.edu>

kre@munnari.OZ (Robert Elz) (10/27/85)

In article <1504.phil.Dione@Rice>, phil@RICE.EDU (William LeFebvre)
described a problem that he found to be a broken DZ.  This is a fairly
old problem - the original 32V dz driver had a special hack in it to
work around this (though when it occurs its not usually as serious
as you describe - it usually goes away after a reboot, for a while).

Some dz's seem to like to give transmit interrupts when the transmit
clock is off.  The 4.x bsd driver when it gets a transmit interrupt
when there is no data to send simply disables the transmit clock,
if you have one of these broken dz's that doesn't do a lot of good.
Fortunately, all that's needed is to give it a character to send,
with the clock still off, and the dz generally happily goes back to
sleep for a while.  That's what the 32v driver did.

I'm not sure why Berkeley deleted this hack, but I guess there aren't
that many broken dz's about.  I have one (of 7) that does this, sometimes
a dozen times a day, sometimes not for weeks, I guess I should have DEC
replace it, but with the driver hack installed it doesn't do a lot
of harm.

It certainly helped when it first occurred here that I knew of the
couple of lines of code in the 32v dz driver.

Robert Elz		seismo!munnari!kre   kre%munnari.oz@seismo.css.gov