[comp.sys.sun] Watchdog Reset

kam@cs.utexas.edu (Katherine Minister Hosch) (08/01/89)

Howdy all.  This has happened twice now, and I'm starting to get a little
worried.  We have a standalone 4/260 that crashed the other day, leaving
only the message (on the console):

	> watchdog reset

I rebooted, and ran fsck, but it didn't seem to have any big problems.
Today it crashed again, with the same console message.  The messages file
didn't contain anything the first time, but the second time there was a
message from two days ago about a memory failure:

Jul 29 13:48:41 mars vmunix: mem3: soft ecc addr 4ce550 syn 5b<S16,S4,S2,S0,SX> 41 U1841

Somehow I suspect that the message is unrelated to the reset though.

My question is this:  what *causes* a 'watchdog reset', other than pushing
the 'reset' button on the back of the machine?  In neither case was the
button pushed.  

This sounds like a bad problem to me; does anyone have any ideas about it?

Katherine Minister Hosch:		kam@titan.tsd.arlut.utexas.edu
Applied Research Laboratories		(512)-835-3148
University of Texas at Austin
P.O. Box 8029, Austin, TX  78713-8029

us214777@uc.msc.umn.edu (John C. Schultz) (08/19/89)

In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes:
>X-Sun-Spots-Digest: Volume 8, Issue 91, message 16 of 16
>
>Howdy all.  This has happened twice now, and I'm starting to get a little
>worried.  We have a standalone 4/260 that crashed the other day, leaving
>only the message (on the console):
>
>	> watchdog reset
>

Don't know if this will help but...  Our SUN 3/160 gets a watchdog reset
trying to boot if I do not have the VME box at the other end of the VME
repeater (Bit 3 brand) powered up.  Once the SUN has booted, I can power
the remote VME system up and down.

Perhaps you might want to try removing extra hardware and then rearranging
the boards in the backplane.  Stranger things have worked.

ehrlich@cs.psu.edu (Daniel Ehrlich) (08/23/89)

In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes:

Howdy all.  This has happened twice now, and I'm starting to get a little
worried.  We have a standalone 4/260 that crashed the other day, leaving
only the message (on the console):

	   > watchdog reset

We have been seeing these on a regular basis on our `new' 4/280S.  A
question, is your 4/260 equipped with the FPU2 floating point daughter
board?  If the CPU occupies slots 1 and 2 it probably does.  In any event
our problem seems to be related to having the two 7053 disk controllers
trying to access the VME bus at the exact same time.

I rebooted, and ran fsck, but it didn't seem to have any big problems.
Today it crashed again, with the same console message.  The messages file
didn't contain anything the first time, but the second time there was a
message from two days ago about a memory failure:

Jul 29 13:48:41 mars vmunix: mem3: soft ecc addr 4ce550 syn
5b<S16,S4,S2,S0,SX> 41 U1841

Somehow I suspect that the message is unrelated to the reset though.

My question is this:  what *causes* a 'watchdog reset', other than pushing
the 'reset' button on the back of the machine?  In neither case was the
button pushed.  

What I have been told by the folks at Sun is that the 'Watchdog reset'
occurs when there is a double bit parity error on the VME bus.

This sounds like a bad problem to me; does anyone have any ideas about it?

Sun's standard response is to replace the CPU board.  Although this may
not be the real cause of the problem.

   Katherine Minister Hosch:		kam@titan.tsd.arlut.utexas.edu
   Applied Research Laboratories		(512)-835-3148
   University of Texas at Austin
   P.O. Box 8029, Austin, TX  78713-8029

--
Dan Ehrlich <ehrlich@shire.cs.psu.edu> | Disclaimer: The opinions expressed are
The Pennsylvania State University      | my own, and should not be attributed
Department of Computer Science         | to anyone else, living or dead.
University Park, PA   16802            |

rowe@cme.nist.gov (Walter Rowe) (08/31/89)

On 19 Aug 89 03:33:26 GMT,
mmm!us214777@uc.msc.umn.edu (John C. Schultz) said:

>	> watchdog reset
>
> Perhaps you might want to try removing extra hardware and then
> rearranging the boards in the backplane.  Stranger things have
> worked.

I had this problem with a Sun 4/280, also, and it started to occur
more frequently.  Eventually, a memory card went bad.  Once I replaced
the memory board these crashes stopped happening.

Walter

beau@ultra.com (Beau James {Manager - SW Devel - Ultra Networks}) (08/31/89)

In SunSpots v8n105, Daniel Ehrlich (ehrlich@cs.psu.edu) writes:

> My question is this:  what *causes* a 'watchdog reset', other than pushing
> the 'reset' button on the back of the machine?  In neither case was the
> button pushed.  
> 

A "Watchdog reset" occurs when the watchdog timer (a hardware circuit) on
the CPU board detects that the processor is halted.  The processor is
restarted and vectored into the PROMs at the watchdog reset handler, which
prints the "Watchdog reset" message on the system console.

At this point, the hardware maps and assorted other state have been reset,
so there's no chance to go back to Unix to run the core dump subroutines
in the kernel.  Reboot and start over.

> What I have been told by the folks at Sun is that the 'Watchdog reset'
> occurs when there is a double bit parity error on the VME bus.
> ...
> Sun's standard response is to replace the CPU board.  Although this may
> not be the real cause of the problem.

It can be a bad CPU board.  Or it could be provoked by bad hardware - CPU,
memory, or perhaps a peripheral.  But it can also be a software problem.

Probably the most common cause of processor halts is double bus faults.
That is, the processor gets a fault trap while processing a fault trap.
The most common cause of this is overflowing the stack - especially the
interrupt stack.

SunOS puts a guard page (invalid page) below  the interrupt stack in
kernel virtual address space. If the processor is in the middle of pushing
an exception frame on the stack - perhaps an level 6 interrupt
interrupting a level 5 interrupt interrupting a level 4 ... (you get the
idea) - and the stack overflows into the guard page, that's a double bus
fault.  'Taint nothing the thing can do but give up.

Software can contribute to the problem by
	- not allocating a worst-case-size interrupt stack
	- recursing in an interrupt handler
	- using too big a local stack frame in an interrupt
		handler
	- reenabling interrupts before exiting from the
		interrupt handler
and similar sorts of screwup.

By the way, there's no parity on the VME bus, so the idea of a
"double bit parity error on the VMEbus" is nonsense.  Someone might
have been referring to "double bit ECC error on the memory bus"; that
results in another type of hardware interrupt.  Normally, Unix will catch
it and panic.  If the processor overflows the stack while trying to take
the ECC trap, then it's double bus fault time, as described above.

Misbehaving hardware can cause other types of interrupts as well,
for example timeout on access to the VMEbus.  These all turn into
processor traps that try to push exception frames on the stack.

The comment in your followup mail:

> The "Watchdog reset" errors seem to occur when both 7053 disk controllers
> as busy.  One can usually generate a "Watchdog reset" in sigle user mode
> by running fsck(8) in parallel on disks attached to the two controllers.

unfortunately doesn't help resolve whether it's bad hardware or a
software bug.

The only way to deterministically figure out which is to blame is to
hook up a bus analyzer.  Non-deterministically, the usual procedure
is to swap hardware until it seems probable that the problem is generic
rather than a sample defect.

Beau James				beau@Ultra.COM
Ultra Network Technologies		{sun,ames}!ultra.com!beau

poffen@sj.ate.slb.com (Russ Poffenberger) (09/01/89)

In article <1074@brazos.Rice.edu> ehrlich@cs.psu.edu (Daniel Ehrlich) writes:
>X-Sun-Spots-Digest: Volume 8, Issue 105, message 9 of 12
>
>In article <718@brazos.Rice.edu> titan.tsd.arlut.utexas.edu!kam@cs.utexas.edu (Katherine Minister Hosch) writes:
>
>Howdy all.  This has happened twice now, and I'm starting to get a little
>worried.  We have a standalone 4/260 that crashed the other day, leaving
>only the message (on the console):
>
>	   > watchdog reset
>

<Stuff deleted>

>My question is this:  what *causes* a 'watchdog reset', other than pushing
>the 'reset' button on the back of the machine?  In neither case was the
>button pushed.  
>

I have found that a watchdog reset can also occur in the case of other VME
errors such as a board generating an interrupt but not responding to the
interrupt acknowledge (IACK controlled by jumper PX04 on the backplane) or
by a similar event relating to Bus Grant. If a board requests the bus,
then when granted (controlled by jumper PX03 on the backplane) it does not
respond, you will get this. The thing to check is ALL of the jumpers on
your backplane.  Some boards require them in place (If the board doesn't
use it, it must be daisy chained to the rest of the boards after it) and
it must be out if the board does use it. (If it is in, the board may
ignore it, or worse, two boards may try to respond at the same time).  If
these are standard sun boards, check the requirements in the Sun manual:
Cardcage Slot Assignments and Backplane Configuration Procedures to make
sure your jumpers are correct. If they are not sun boards, you will have
to look through the manuals on the boards and decide what they need.
Generally DMA type devices need both IACK and BG3 out. Boards that just
generate interrupts need IACK out but BG3 in.  If you have more questions,
e-mail me.

Russ Poffenberger
Schlumberger Technologies
poffen@sj.ate.slb.com

drears@pica.army.mil (Dennis G. Rears (FSAC)) (02/28/90)

Recently my Sun 4 went down with the mysterious message "watchdog reset".
I looked at all my Sun manuals and found only a few references to it.
They all were for resetting the eeprom or nvram on what action to take
when a "watchdog reset" happens.   Does anybody have any idea what a
watchdog reset is?  In what manual is it fully explain.  Please email to
me as I don't read this list that much.  (I barely have time to browse.

[[Ed's Note: v8n118 has quite a bit of material some of which I have
included below. -bdg]]

X-Date:    Wed, 30 Aug 89 19:31:47 PDT
X-From:    beau@ultra.com (Beau James {Manager - SW Devel - Ultra Networks})
X-Subject: Re: Watchdog reset

[What is...?]

A "Watchdog reset" occurs when the watchdog timer (a hardware circuit) on
the CPU board detects that the processor is halted.  The processor is
restarted and vectored into the PROMs at the watchdog reset handler, which
prints the "Watchdog reset" message on the system console.

[Causes?]

It can be a bad CPU board.  Or it could be provoked by bad hardware - CPU,
memory, or perhaps a peripheral.  But it can also be a software problem.
Probably the most common cause of processor halts is double bus faults.
That is, the processor gets a fault trap while processing a fault trap.
The most common cause of this is overflowing the stack - especially the
interrupt stack.