kam@cs.utexas.edu (Katherine Minister Hosch) (08/01/89)
Howdy all. This has happened twice now, and I'm starting to get a little
worried. We have a standalone 4/260 that crashed the other day, leaving
only the message (on the console):
> watchdog reset
I rebooted, and ran fsck, but it didn't seem to have any big problems.
Today it crashed again, with the same console message. The messages file
didn't contain anything the first time, but the second time there was a
message from two days ago about a memory failure:
Jul 29 13:48:41 mars vmunix: mem3: soft ecc addr 4ce550 syn 5b<S16,S4,S2,S0,SX> 41 U1841
Somehow I suspect that the message is unrelated to the reset though.
My question is this: what *causes* a 'watchdog reset', other than pushing
the 'reset' button on the back of the machine? In neither case was the
button pushed.
This sounds like a bad problem to me; does anyone have any ideas about it?
Katherine Minister Hosch: kam@titan.tsd.arlut.utexas.edu
Applied Research Laboratories (512)-835-3148
University of Texas at Austin
P.O. Box 8029, Austin, TX 78713-8029
us214777@uc.msc.umn.edu (John C. Schultz) (08/19/89)
In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes: >X-Sun-Spots-Digest: Volume 8, Issue 91, message 16 of 16 > >Howdy all. This has happened twice now, and I'm starting to get a little >worried. We have a standalone 4/260 that crashed the other day, leaving >only the message (on the console): > > > watchdog reset > Don't know if this will help but... Our SUN 3/160 gets a watchdog reset trying to boot if I do not have the VME box at the other end of the VME repeater (Bit 3 brand) powered up. Once the SUN has booted, I can power the remote VME system up and down. Perhaps you might want to try removing extra hardware and then rearranging the boards in the backplane. Stranger things have worked.
ehrlich@cs.psu.edu (Daniel Ehrlich) (08/23/89)
In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes: Howdy all. This has happened twice now, and I'm starting to get a little worried. We have a standalone 4/260 that crashed the other day, leaving only the message (on the console): > watchdog reset We have been seeing these on a regular basis on our `new' 4/280S. A question, is your 4/260 equipped with the FPU2 floating point daughter board? If the CPU occupies slots 1 and 2 it probably does. In any event our problem seems to be related to having the two 7053 disk controllers trying to access the VME bus at the exact same time. I rebooted, and ran fsck, but it didn't seem to have any big problems. Today it crashed again, with the same console message. The messages file didn't contain anything the first time, but the second time there was a message from two days ago about a memory failure: Jul 29 13:48:41 mars vmunix: mem3: soft ecc addr 4ce550 syn 5b<S16,S4,S2,S0,SX> 41 U1841 Somehow I suspect that the message is unrelated to the reset though. My question is this: what *causes* a 'watchdog reset', other than pushing the 'reset' button on the back of the machine? In neither case was the button pushed. What I have been told by the folks at Sun is that the 'Watchdog reset' occurs when there is a double bit parity error on the VME bus. This sounds like a bad problem to me; does anyone have any ideas about it? Sun's standard response is to replace the CPU board. Although this may not be the real cause of the problem. Katherine Minister Hosch: kam@titan.tsd.arlut.utexas.edu Applied Research Laboratories (512)-835-3148 University of Texas at Austin P.O. Box 8029, Austin, TX 78713-8029 -- Dan Ehrlich <ehrlich@shire.cs.psu.edu> | Disclaimer: The opinions expressed are The Pennsylvania State University | my own, and should not be attributed Department of Computer Science | to anyone else, living or dead. University Park, PA 16802 |
rowe@cme.nist.gov (Walter Rowe) (08/31/89)
On 19 Aug 89 03:33:26 GMT, mmm!us214777@uc.msc.umn.edu (John C. Schultz) said: > > watchdog reset > > Perhaps you might want to try removing extra hardware and then > rearranging the boards in the backplane. Stranger things have > worked. I had this problem with a Sun 4/280, also, and it started to occur more frequently. Eventually, a memory card went bad. Once I replaced the memory board these crashes stopped happening. Walter
beau@ultra.com (Beau James {Manager - SW Devel - Ultra Networks}) (08/31/89)
In SunSpots v8n105, Daniel Ehrlich (ehrlich@cs.psu.edu) writes: > My question is this: what *causes* a 'watchdog reset', other than pushing > the 'reset' button on the back of the machine? In neither case was the > button pushed. > A "Watchdog reset" occurs when the watchdog timer (a hardware circuit) on the CPU board detects that the processor is halted. The processor is restarted and vectored into the PROMs at the watchdog reset handler, which prints the "Watchdog reset" message on the system console. At this point, the hardware maps and assorted other state have been reset, so there's no chance to go back to Unix to run the core dump subroutines in the kernel. Reboot and start over. > What I have been told by the folks at Sun is that the 'Watchdog reset' > occurs when there is a double bit parity error on the VME bus. > ... > Sun's standard response is to replace the CPU board. Although this may > not be the real cause of the problem. It can be a bad CPU board. Or it could be provoked by bad hardware - CPU, memory, or perhaps a peripheral. But it can also be a software problem. Probably the most common cause of processor halts is double bus faults. That is, the processor gets a fault trap while processing a fault trap. The most common cause of this is overflowing the stack - especially the interrupt stack. SunOS puts a guard page (invalid page) below the interrupt stack in kernel virtual address space. If the processor is in the middle of pushing an exception frame on the stack - perhaps an level 6 interrupt interrupting a level 5 interrupt interrupting a level 4 ... (you get the idea) - and the stack overflows into the guard page, that's a double bus fault. 'Taint nothing the thing can do but give up. Software can contribute to the problem by - not allocating a worst-case-size interrupt stack - recursing in an interrupt handler - using too big a local stack frame in an interrupt handler - reenabling interrupts before exiting from the interrupt handler and similar sorts of screwup. By the way, there's no parity on the VME bus, so the idea of a "double bit parity error on the VMEbus" is nonsense. Someone might have been referring to "double bit ECC error on the memory bus"; that results in another type of hardware interrupt. Normally, Unix will catch it and panic. If the processor overflows the stack while trying to take the ECC trap, then it's double bus fault time, as described above. Misbehaving hardware can cause other types of interrupts as well, for example timeout on access to the VMEbus. These all turn into processor traps that try to push exception frames on the stack. The comment in your followup mail: > The "Watchdog reset" errors seem to occur when both 7053 disk controllers > as busy. One can usually generate a "Watchdog reset" in sigle user mode > by running fsck(8) in parallel on disks attached to the two controllers. unfortunately doesn't help resolve whether it's bad hardware or a software bug. The only way to deterministically figure out which is to blame is to hook up a bus analyzer. Non-deterministically, the usual procedure is to swap hardware until it seems probable that the problem is generic rather than a sample defect. Beau James beau@Ultra.COM Ultra Network Technologies {sun,ames}!ultra.com!beau
poffen@sj.ate.slb.com (Russ Poffenberger) (09/01/89)
In article <1074@brazos.Rice.edu> ehrlich@cs.psu.edu (Daniel Ehrlich) writes: >X-Sun-Spots-Digest: Volume 8, Issue 105, message 9 of 12 > >In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes: > >Howdy all. This has happened twice now, and I'm starting to get a little >worried. We have a standalone 4/260 that crashed the other day, leaving >only the message (on the console): > > > watchdog reset > <Stuff deleted> >My question is this: what *causes* a 'watchdog reset', other than pushing >the 'reset' button on the back of the machine? In neither case was the >button pushed. > I have found that a watchdog reset can also occur in the case of other VME errors such as a board generating an interrupt but not responding to the interrupt acknowledge (IACK controlled by jumper PX04 on the backplane) or by a similar event relating to Bus Grant. If a board requests the bus, then when granted (controlled by jumper PX03 on the backplane) it does not respond, you will get this. The thing to check is ALL of the jumpers on your backplane. Some boards require them in place (If the board doesn't use it, it must be daisy chained to the rest of the boards after it) and it must be out if the board does use it. (If it is in, the board may ignore it, or worse, two boards may try to respond at the same time). If these are standard sun boards, check the requirements in the Sun manual: Cardcage Slot Assignments and Backplane Configuration Procedures to make sure your jumpers are correct. If they are not sun boards, you will have to look through the manuals on the boards and decide what they need. Generally DMA type devices need both IACK and BG3 out. Boards that just generate interrupts need IACK out but BG3 in. If you have more questions, e-mail me. Russ Poffenberger Schlumberger Technologies poffen@sj.ate.slb.com
drears@pica.army.mil (Dennis G. Rears (FSAC)) (02/28/90)
Recently my Sun 4 went down with the mysterious message "watchdog reset". I looked at all my Sun manuals and found only a few references to it. They all were for resetting the eeprom or nvram on what action to take when a "watchdog reset" happens. Does anybody have any idea what a watchdog reset is? In what manual is it fully explain. Please email to me as I don't read this list that much. (I barely have time to browse. [[Ed's Note: v8n118 has quite a bit of material some of which I have included below. -bdg]] X-Date: Wed, 30 Aug 89 19:31:47 PDT X-From: beau@ultra.com (Beau James {Manager - SW Devel - Ultra Networks}) X-Subject: Re: Watchdog reset [What is...?] A "Watchdog reset" occurs when the watchdog timer (a hardware circuit) on the CPU board detects that the processor is halted. The processor is restarted and vectored into the PROMs at the watchdog reset handler, which prints the "Watchdog reset" message on the system console. [Causes?] It can be a bad CPU board. Or it could be provoked by bad hardware - CPU, memory, or perhaps a peripheral. But it can also be a software problem. Probably the most common cause of processor halts is double bus faults. That is, the processor gets a fault trap while processing a fault trap. The most common cause of this is overflowing the stack - especially the interrupt stack.