narten@purdue.arpa (Thomas Narten) (10/04/86)
We have been experiencing a rather odd and intermittant problem with our Unix machines. It is not confined to a particular machine or Unix; it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and uVAX II machines. Symptoms: The machines appear to lock up, users cannot get characters echoed, console is hung. In short, the machine seems dead. The only way to recover is a reboot. However, the machine is still running in a sense. One can ping the machine in question, and it responds. One can open a TCP connection to the machine, and the connection succeeds, but hangs at that point. When this happens, we have halted the cpu, looked at the PC, continued the system, repeating the above in hopes of finding the machine caught in a tight loop somewhere. It is not in a tight loop. In fact, when this nailed one of our idle machines, the system was spending all of its time in the context switch routine "Swtch". Other attempts at this have found the PC in unrelated procedures an each halt. This has hit most of our machines at one time or another, but usually only gets one at a time. Sometimes its a month between hangs, sometimes several times in a day. I suspect that we are tweaking some sort of networking bug where the setting of the processor priority level gets messed up, leaving the machine in a higher priority than it should be, so that user processes no longer are scheduled. Evidence to support this is an increase in network traffic on our Ethernets over the last 6 months. Also, the last time one of the machines hung, the last message on the console was a "qe0: restart" message, indicating that the DEQNA Ethernet board had become wedged. The problem is not restricted to machines with a DEQNA. Has anyone else run into a similar problem? Thomas ----------
baccala@USNA.arpa (Brent W Baccala) (10/07/86)
In <23407@gwen.cs.purdue.edu>, you write: >We have been experiencing a rather odd and intermittant problem with >our Unix machines. It is not confined to a particular machine or Unix; >it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and >uVAX II machines. > >Symptoms: The machines appear to lock up, users cannot get characters >echoed, console is hung. In short, the machine seems dead. The only >way to recover is a reboot. > >However, the machine is still running in a sense. One can ping the >machine in question, and it responds. One can open a TCP connection to >the machine, and the connection succeeds, but hangs at that point. We had the EXACT same problem with a PDP-11/55 running 2.9 BSD. It was much more consistant, though - ours would go down regularly every night (after hours, of course). I think fixed it this weekend (its been up for more than two nights straight - a major achivement). The problem appears to have been in a locally written version of "syslogd". I, too, suspected the network (though I'm far from a guru), but only looked briefly at the networking code. And since only one of our programs (a port of phone) uses syslog, I didn't think tracking down the bug justified the downtime that would be involved. Whether this problem is peculiar to our local syslogd, I don't know. Nor do I know exactly what triggers the bug; it make not be peculiar to syslogd either. It's interesting, but my experience has been that whenever there's a problem, turning off syslogd fixes it...maybe ours is just a broken daemon. Hope this helps (and let me know if you find the bug) -bwb - BRENT W. BACCALA - Aerospace Engineering Department U.S. Naval Academy Annapolis, MD <baccala@usna.arpa> "I do graphics work on an SGI Iris, fun work on a VAX 11/780, grunge work on an IBM XT"
hartley@uvm-gen.UUCP (Stephen J. Hartley) (10/07/86)
> We have been experiencing a rather odd and intermittant problem with > our Unix machines. It is not confined to a particular machine or Unix; > Symptoms: The machines appear to lock up, users cannot get characters > echoed, console is hung. In short, the machine seems dead. The only > way to recover is a reboot. We have experienced the same problem with our VAX-11/750 and 780, each running vanilla 4.3 BSD. Same symptoms: characters not echoed, console dead. We can only reboot to remove the problem. However, this has only happened in the last two months, since we brought up 4.3. It never happened that I can remember in the 2 1/2 years we ran 4.2 on these machines.
rbj@ICST-CMR.arpa (Root Boy Jim) (10/07/86)
In <23407@gwen.cs.purdue.edu>, you write: >We have been experiencing a rather odd and intermittant problem with >our Unix machines. It is not confined to a particular machine or Unix; >it has happened with 4.2, 4.2 NFS, and 4.3 BSD on VAX 780, 785 and >uVAX II machines. > >Symptoms: The machines appear to lock up, users cannot get characters >echoed, console is hung. In short, the machine seems dead. The only >way to recover is a reboot. > >However, the machine is still running in a sense. One can ping the >machine in question, and it responds. One can open a TCP connection to >the machine, and the connection succeeds, but hangs at that point. You didn't say what kind(s) of disks/controller(s) you have. We have an SI 9900 controller connected to a CDC 9766 (rm05) and an eagle. Every so often the controller goes out to lunch, but is fortunately awakened by flipping the controller reset switch. Anyone doing I/O thru this beast will hang until the remedy is applied. Anyone doing I/O only to our DEC Massbus rm03's will be unaffected. This is annoying, but not disasterous except when dialing in from home. (Root Boy) Jim Cottrell <rbj@icst-cmr.arpa>
jeff@ntvax.UUCP (10/10/86)
> Symptoms: The machines appear to lock up, users cannot get characters > echoed, console is hung. In short, the machine seems dead. The only > way to recover is a reboot. > > However, the machine is still running in a sense. One can ping the > machine in question, and it responds. One can open a TCP connection to > the machine, and the connection succeeds, but hangs at that point. We are experiencing the same symptoms about once every two weeks (average). When the gremlin decides to assert him(her)self the terminals die one by one when they go to disk. Anybody running an application that doesn't require a disk access can run forever even though every other terminal has died (including the console). Once they exit from that application, they're history. We're running 4.2 on a 780 with Massbus rm80s and an rp07. Any ideas? Jeff Carruth (the new guy) North Texas State University ihnp4!infoswx!ntvax!jeff
robert@aragorn.OZ (Robert Ruge) (10/17/86)
In article <6700001@ntvax> jeff@ntvax.UUCP writes: > >We are experiencing the same symptoms about once every two weeks (average). >When the gremlin decides to assert him(her)self the terminals die one by >one when they go to disk. Anybody running an application that doesn't >require a disk access can run forever even though every other terminal >has died (including the console). Once they exit from that application, >they're history. We're running 4.2 on a 780 with Massbus rm80s and an rp07. >Any ideas? > I recently experienced this problem on a Gould PN6031 and traced it down to a bad block on one of the disks. Whenever this block was accessed the disk controller would hang so that when a program or user went to access the disk they would also hang waiting for the disk controller to complete its operation. However if your program is running in memory then you can execute for as long as you like, until you either finish execution or perform a disk access. This results in terminals going out one by one. To find the bad block we wrote a small program that opened the c partition (whole disk) and sequentially read each sector and printed its number. Where the program stopped is where the bad block is. Flagging the block as bad cleared up the whole problem. I hope that this helps you. Robert Ruge | UUCP: {seismo,mcvax,ukc, Computing/Maths | hplabs,nttlab}!munnari!aragorn.oz!robert Deakin University | ARPA: munnari!aragorn.oz!robert@SEISMO.ARPA Victoria, 3217 | CSNET: robert@aragorn.oz Australia | ACSNET: robert@aragorn.oz PHONE: +61 52 471319