phil@RICE.EDU (William LeFebvre) (10/27/85)
Over the past few days, one of our vaxes has exhibited very strange symptoms. We were finally able to track down the problem, but I hope that by sending this message I can help keep others out there from spending all afternoon in a machine room pulling out their hair (as I did). The particular machine is a VAX 11/750 running 4.2BSD. It has two unibusses (unibi?), containing: [bus 0] IN-Card (connection to Telenet) and UDA-50, [bus 1] two DZ-11s, an Able VMZ, and an Interlan 1010-A (ethernet controller). Tuesday evening, the machine went into a loop in the kernel. It didn't look very tight (taking samples by repeating ^P and C). I didn't think much of it, rebooted the machine and went home. When I got in the next day, I discovered that the machine had hung again shortly after it had rebooted. So, I tried to take a dump (drop -1 in the PC and 1F0000 in the PSL). It promptly said "panic: segmentation fault" and followed it with a hex stack trace -- exactly what I expected it to do. After that, I expected it to sync and dump. But it didn't: it went right back to its looping behavior. I finally gave up and rebooted. This time, I waited (over 20 minutes with this machine). It went all the way thru the boot procedure, ran /etc/rc (and /etc/rc.local), starting all the daemons and doing everything just fine. Then it came up multi-user. A getty prompt appeared on the console, and it *immediately* went into its loop again. I mean: I hit return right after I got the "login:" prompt, and it "didn't do nuthin'". "This is REAL peculiar!" I said. So I put on my hardware hat and started running diagnostics. I ran every diagnostic I could find that was appropriate, and some that weren't. The only thing I couldn't find a diagnostic documented for was the DZ-11 boards. I said: "Ahhh, it couldn't be those anyway." Every diagnostic ran with no problems. A call to Field Service got the response: "it's probably one of those foreign boards. Why don't you try booting the system without them?" In other words: "we won't even think about solving your problem until you can prove that it's our fault" (not a wholly unreasonable attitude to take, just annoying). "Well," I said, "it must be that flakey IN-Card. It's given us many problems before." So I pulled the IN-Card and rebooted. Same symptoms. "Well, it must be the software that drives the IN-Card." So I booted single user (which worked just fine) and, with the help of those more in the know about network things, disabled all the stuff in rc.local that "turned on" that interface. Then I booted the rest of the way. Same problem. At this point, we were all stumped. So, I moved the disk to another machine (as unit 1) so we could look at the data out there: specifically the kernel image. Adb and the stack traces told us that it was spending alot of time in the DZ interrupt handler. I said "Too bad there isn't a DZ diagnostic." My officemate replied, "But there is. I know there is. There's got to be." So, after much poking around I found it (its EVDAA for those who want to know -- and typing "help dev dz11" *won't* tell you, either (unless our help files are mangled (which might be the case))). Turned out that indeed one of our DZs failed diagnostics. We pulled that board and, whadayaknow, the system stopped looping and started behaving normally. My best guess is that the board was generating interrupts at a furious rate -- enough to keep the kernel so busy that it couldn't do anything else. This failure was not all that unusual, but the symptoms were so peculiar and downright unhelpful that I felt this message would be appropriate and, hopefully, appreciated. Forgive me for its length. William LeFebvre Department of Computer Science Rice University <phil@Rice.arpa> or, for the daring: <phil@Rice.edu>