[net.unix-wizards] BSD Unix machines hanging

narten@purdue.arpa (Thomas Narten) (10/04/86)

We have been experiencing a rather odd and intermittant problem with
our Unix machines. It is not confined to a particular machine or Unix;
it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and
uVAX II machines.

Symptoms: The machines appear to lock up, users cannot get characters
echoed, console is hung. In short, the machine seems dead. The only
way to recover is a reboot. 

However, the machine is still running in a sense. One can ping the
machine in question, and it responds. One can open a TCP connection to
the machine, and the connection succeeds, but hangs at that point.

When this happens, we have halted the cpu, looked at the PC, continued
the system, repeating the above in hopes of finding the machine caught
in a tight loop somewhere. It is not in a tight loop. In fact, when
this nailed one of our idle machines, the system was spending all of
its time in the context switch routine "Swtch". Other attempts at this
have found the PC in unrelated procedures an each halt.

This has hit most of our machines at one time or another, but usually
only gets one at  a time. Sometimes its a month between hangs,
sometimes several times in a day.

I suspect that we are tweaking some sort of networking bug where the
setting of the processor priority level gets messed up, leaving the
machine in a higher priority than it should be, so that user processes
no longer are scheduled.

Evidence to support this is an increase in network traffic on our
Ethernets over the last 6 months. Also, the last time one of the
machines hung, the last message on the console was a "qe0: restart"
message, indicating that the DEQNA Ethernet board had become wedged.
The problem is not restricted to machines with a DEQNA.

Has anyone else run into a similar problem?

Thomas
----------

baccala@USNA.arpa (Brent W Baccala) (10/07/86)

In <23407@gwen.cs.purdue.edu>, you write:

>We have been experiencing a rather odd and intermittant problem with
>our Unix machines. It is not confined to a particular machine or Unix;
>it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and
>uVAX II machines.
>
>Symptoms: The machines appear to lock up, users cannot get characters
>echoed, console is hung. In short, the machine seems dead. The only
>way to recover is a reboot. 
>
>However, the machine is still running in a sense. One can ping the
>machine in question, and it responds. One can open a TCP connection to
>the machine, and the connection succeeds, but hangs at that point.

We had the EXACT same problem with a PDP-11/55 running 2.9 BSD. It was
much more consistant, though - ours would go down regularly every night
(after hours, of course).  I think fixed it this weekend (its been up
for more than two nights straight - a major achivement).

The problem appears to have been in a locally written version of
"syslogd".  I, too, suspected the network (though I'm far from a guru),
but only looked briefly at the networking code.  And since only one of
our programs (a port of phone) uses syslog, I didn't think tracking down
the bug justified the downtime that would be involved. Whether this
problem is peculiar to our local syslogd, I don't know.  Nor do I know
exactly what triggers the bug; it make not be peculiar to syslogd
either.

It's interesting, but my experience has been that whenever there's
a problem, turning off syslogd fixes it...maybe ours is just a broken
daemon.

Hope this helps (and let me know if you find the bug)

					-bwb

			- BRENT W. BACCALA -
			Aerospace Engineering Department
			U.S. Naval Academy
			Annapolis, MD

			<baccala@usna.arpa>

	"I do graphics work on an SGI Iris, fun work on a VAX 11/780,
		grunge work on an IBM XT"

hartley@uvm-gen.UUCP (Stephen J. Hartley) (10/07/86)

> We have been experiencing a rather odd and intermittant problem with
> our Unix machines. It is not confined to a particular machine or Unix;
> Symptoms: The machines appear to lock up, users cannot get characters
> echoed, console is hung. In short, the machine seems dead. The only
> way to recover is a reboot. 

We have experienced the same problem with our VAX-11/750 and 780, each
running vanilla 4.3 BSD.  Same symptoms:  characters not echoed, console
dead.  We can only reboot to remove the problem.  However, this has only
happened in the last two months, since we brought up 4.3.  It never
happened that I can remember in the 2 1/2 years we ran 4.2 on these
machines.

rbj@ICST-CMR.arpa (Root Boy Jim) (10/07/86)

In <23407@gwen.cs.purdue.edu>, you write:

>We have been experiencing a rather odd and intermittant problem with
>our Unix machines. It is not confined to a particular machine or Unix;
>it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and
>uVAX II machines.
>
>Symptoms: The machines appear to lock up, users cannot get characters
>echoed, console is hung. In short, the machine seems dead. The only
>way to recover is a reboot. 
>
>However, the machine is still running in a sense. One can ping the
>machine in question, and it responds. One can open a TCP connection to
>the machine, and the connection succeeds, but hangs at that point.

You didn't say what kind(s) of disks/controller(s) you have. We have
an SI 9900 controller connected to a CDC 9766 (rm05) and an eagle.
Every so often the controller goes out to lunch, but is fortunately
awakened by flipping the controller reset switch. Anyone doing I/O
thru this beast will hang until the remedy is applied. Anyone doing
I/O only to our DEC Massbus rm03's will be unaffected. This is annoying,
but not disasterous except when dialing in from home.

	(Root Boy) Jim Cottrell		<rbj@icst-cmr.arpa>

jeff@ntvax.UUCP (10/10/86)

> Symptoms: The machines appear to lock up, users cannot get characters
> echoed, console is hung. In short, the machine seems dead. The only
> way to recover is a reboot. 
> 
> However, the machine is still running in a sense. One can ping the
> machine in question, and it responds. One can open a TCP connection to
> the machine, and the connection succeeds, but hangs at that point.

We are experiencing the same symptoms about once every two weeks (average).
When the gremlin decides to assert him(her)self the terminals die one by
one when they go to disk.  Anybody running an application that doesn't
require a disk access can run forever even though every other terminal
has died (including the console).  Once they exit from that application,
they're history.  We're running 4.2 on a 780 with Massbus rm80s and an rp07.
Any ideas?

Jeff Carruth (the new guy)
North Texas State University

ihnp4!infoswx!ntvax!jeff

robert@aragorn.OZ (Robert Ruge) (10/17/86)

In article <6700001@ntvax> jeff@ntvax.UUCP writes:
>
>We are experiencing the same symptoms about once every two weeks (average).
>When the gremlin decides to assert him(her)self the terminals die one by
>one when they go to disk.  Anybody running an application that doesn't
>require a disk access can run forever even though every other terminal
>has died (including the console).  Once they exit from that application,
>they're history.  We're running 4.2 on a 780 with Massbus rm80s and an rp07.
>Any ideas?
>

I recently experienced this problem on a Gould PN6031 and traced it
down to a bad block on one of the disks. Whenever this block was
accessed the disk controller would hang so that when a program or user
went to access the disk they would also hang waiting for the disk
controller to complete its operation. However if your program is
running in memory then you can execute for as long as you like, until
you either finish execution or perform a disk access. This results in
terminals going out one by one. To find the bad block we wrote a small
program that opened the c partition (whole disk) and sequentially read
each sector and printed its number. Where the program stopped is where
the bad block is. Flagging the block as bad cleared up the whole
problem. I hope that this helps you.

Robert Ruge	  | UUCP:   {seismo,mcvax,ukc,
Computing/Maths	  |          hplabs,nttlab}!munnari!aragorn.oz!robert
Deakin University | ARPA:   munnari!aragorn.oz!robert@SEISMO.ARPA
Victoria, 3217	  | CSNET:  robert@aragorn.oz
Australia	  | ACSNET: robert@aragorn.oz  PHONE:  +61 52 471319