bruce@godot.UUCP (Bruce Nemnich) (11/14/84)
I have been having an awful lot of problems with hung lines with 4.2bsd uucp. By hung, I mean a uucp process sitting blocking on a read from the dialout line, but having removed the LCKfile for that line and system (presumably after a timeout). The offending processes must be killed by hand. I breifly looked at the code, but I didn't notice anything too bad. The stack frame at the hung point is often main()/conn()/login()/expect()/read(), though sometimes main()/imsg()/read(). Any quick fixes? -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa
wls@astrovax.UUCP (William L. Sebok) (11/15/84)
> I have been having an awful lot of problems with hung lines with 4.2bsd > uucp. By hung, I mean a uucp process sitting blocking on a read from > the dialout line, but having removed the LCKfile for that line and > system (presumably after a timeout). The offending processes must be > killed by hand. I breifly looked at the code, but I didn't notice > anything too bad. The stack frame at the hung point is often > main()/conn()/login()/expect()/read(), though sometimes > main()/imsg()/read(). Any quick fixes? > -- > --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA I am having this problem too. I would sure as heck like to know what is going on. The hanging seems to occur when the load average is high. I believe that I installed a posted bug fix to alarm() and it didn't cure the problem. -- Bill Sebok Princeton University, Astrophysics {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls
russg@daemon.UUCP (Russel Gorby) (11/17/84)
In article <449@godot.UUCP> bruce@godot.UUCP (Bruce Nemnich) writes: >I have been having an awful lot of problems with hung lines with 4.2bsd >uucp. By hung, I mean a uucp process sitting blocking on a read from >the dialout line, but having removed the LCKfile for that line and >system (presumably after a timeout). The offending processes must be >killed by hand. I breifly looked at the code, but I didn't notice >anything too bad. The stack frame at the hung point is often >main()/conn()/login()/expect()/read(), though sometimes >main()/imsg()/read(). Any quick fixes? >-- >--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA > ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa We had that problem here at tektronix also. It only happens when the uucico session was initiated in answer mode, and the answering modem is connected to a data switch. I believe that the problem stems from the fact that the TIOCHPCL call works correctly ( or so I've been told) under 4.2BSD. The problem is in the disconnect routine in the xqt module and depending on what defines are set, will hang on either the first close ( close(0) ) or on the open("/dev/tty",2) which occurs just after the last close. The reason the hangs occur is that at some point the dataswitch drops the line, and later uucico tries to do I/O to the port. At some point in the code DTR get toggled (before the disconnect routine), which on a dataswitch will cause the connection to be severed, and DTR to get held low. Any subsequent I/O to the port will hang because the port won't interrupt any more. This will cause a hang on the initial close in disconnect(). If this doesn't get you, then the TIOCHPCL will, because after the last close ( in disconnect() ), it tries to do the open("/dev/tty",2) which will hang for the same reason. Our fix (at least for the moment) was to put a timeout mechanism in the disconnect routine, and this seems to have fixed that particular problem. -- Russ Gorby ucbvax!tektronix!russg (503) 627-1153 Tektronix PO Box 500 M.S. 19-333 Beaverton OR 97077
wls@astrovax.UUCP (William L. Sebok) (11/17/84)
In article <449@godot.UUCP> bruce@godot.UUCP (Bruce Nemnich) writes: >>I have been having an awful lot of problems with hung lines with 4.2bsd >>uucp. By hung, I mean a uucp process sitting blocking on a read from >>the dialout line, but having removed the LCKfile for that line and >>system (presumably after a timeout). The offending processes must be >>killed by hand. I breifly looked at the code, but I didn't notice >>anything too bad. The stack frame at the hung point is often >>main()/conn()/login()/expect()/read(), though sometimes >>main()/imsg()/read(). Any quick fixes? And in <85@daemon.UUCP> (<--huh?) Russ Gorby (tektronix!russg) responds: > We had that problem here at tektronix also. It only happens when the > uucico session was initiated in answer mode, and the answering modem > is connected to a data switch. ... We have that problem at astrovax also but 1) we have no data switch, and 2) it only occurs when dialing out, not in answer mode. The problem has occured both when the dialing modem is a Hayes and when the dialing modem is a Racal/Vadic 3451. Actually, there is one difference in the symptom here. The lockfile is still there when it is hung. The hang point on the latest hang was main()/conn()/Acuopn()/vadopn()/expect()/read(). -- Bill Sebok Princeton University, Astrophysics {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls
bruce@godot.UUCP (Bruce Nemnich) (11/18/84)
I should also add that it only happens to me on dialout (originate) lines. The disconnect() bug doesn't seem relevent since in most cases it is hung in login(). -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa
bks@ski.UUCP ( Bruce K. Smith) (11/20/84)
> > I have been having an awful lot of problems with hung lines with 4.2bsd > > uucp. By hung, I mean a uucp process sitting blocking on a read from > > the dialout line, but having removed the LCKfile for that line and > > system (presumably after a timeout). The offending processes must be > > killed by hand. I breifly looked at the code, but I didn't notice > > anything too bad. The stack frame at the hung point is often > > main()/conn()/login()/expect()/read(), though sometimes > > main()/imsg()/read(). Any quick fixes? > > -- > > --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA > > I am having this problem too. I would sure as heck like to know what is > going on. The hanging seems to occur when the load average is high. I > believe that I installed a posted bug fix to alarm() and it didn't cure the > problem. > -- > Bill Sebok Princeton University, Astrophysics > {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls We're running a Masscomp MC-530 with their RTU 2.1C ("Real Time Unix", some combination of System III and 4.2BSD, I think). Our uucico's hang if and only if there is no outgoing work for them (ie iff they never enter MASTER mode), which can only happen when it's an incoming call. To work around the problem, we had to arrange to mail a dummy message to a fake user on our sole uucp neighbor. We did it by moving /usr/lib/uucp/uucico to .../uucico.actual, and replacing .../uucico with a shell script which reads as follows: ========================================================================== #!/bin/sh # by Bruce K. Smith and Jordan K. Hubbard, ski!{bks,jkh}, 24 October 1984 # incoming calls: if no outgoing uucico work, mail a dummy message to work # around uucico.actual's hangup bug # outgoing calls: use the "use" script to avoid stomping on anyone logged in on # tty2 or using another init state # then call uucico.actual, the original version of uucico # for security: PATH=/bin:/usr/bin:/usr/local/bin if test $# -eq 0 then # incoming call echo `date`: $0 >> /tmp/uulog outs=`echo /usr/spool/uucp/C* /usr/spool/uucp/D*` if test "$outs" = "/usr/spool/uucp/C* /usr/spool/uucp/D*" then mail politik!none < /usr/lib/uucp/dummy.msg fi /usr/lib/uucp/uucico.actual else # outgoing call, try to grab tty2 echo outgoing at `date`: $* >> /tmp/uulog use tty2 -u uucp /usr/lib/uucp/uucico.actual $* fi ========================================================================== The lines with C* and D* are for mailing the message only when there is no work waiting to go out. If your system can be called by more than one other system, you'll probably have to make this section more intelligent, or else just send the message out all the time. The "outgoing call" stuff is for an unrelated problem. I have no idea whether this workaround works on systems other than the Masscomp. -- Bruce Smith {sun!texsun, ucbvax!mtxinu, dual!ptsfa} !politik!ski!bks
wls@astrovax.UUCP (William L. Sebok) (11/21/84)
In review: the subject is the hung uucico processes that we have here at astrovax and godot. This is when running rtiuucp under 4.2 BSD. A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read(). allegra has also reported such problems with honey danber under 4.2 BSD. More information. in cico.c the SIGALRM handler is timeout() { logent(Rmtname, "TIMEOUT"); systat(Rmtname, SS_FAIL, "TIMEOUT"); longjmp(Sjbuf, 1); } while in pk1.c the SIGALRM handler is cgalarm() { longjmp(Getjbuf, 1); } However in condevs.c and conn.c the SIGALRM handler is: alarmtr() { signal(SIGALRM, alarmtr); if (next_fd >= 0) { if (close(next_fd)) logent("FAIL", "ACU LINE CLOSE"); next_fd = -1; } longjmp(Sjbuf, 1); } I have never seen any hung processes when the first two handlers are in effect, only when the last one is in effect. Assuming its a race, I don't know if it involves the signal() or the close(). -- Bill Sebok Princeton University, Astrophysics {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls
ron@trsvax.UUCP (11/21/84)
Could you pass the code (or diffs) along for the timeout fix you've added? {microsoft,ctvax}!trsvax!ron
muller@sdcc3.UUCP (Keith Muller) (11/23/84)
> In review: the subject is the hung uucico processes that we have here > at astrovax and godot. This is when running rtiuucp under 4.2 BSD. > A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read(). > allegra has also reported such problems with honey danber under 4.2 BSD. > I am not sure that this might be your problem, but I had a similar problem with the dialer code in 4.2 condevs.c. The following is extracted from a rlog on my condevs.c. A problem exisisted in many dialer routines that next_fd was NOT set to -1 right after a successful open. The routine getnextfd() that preceeded the open attempts to predict the next file descriptor to be allocated. The sigalrm handler alarmtr would close the file descriptor set in next_fd (set by getnextfd). So when the routine expect timed out during login it called alarmtr, the file was closed causing a BAD WRITE abort. What you need to make sure that right after the open call in vadopn() you have the following: /* open call ....*/ alarm(0); /* cancel time out wait on open */ next_fd = -1; /* this was missing in many of the dialer code*/ Keith Muller Academic Computer Center University of California, San Diego
jpl@allegra.UUCP (John P. Linderman) (11/24/84)
>In review: the subject is the hung uucico processes that we have here >at astrovax and godot. This is when running rtiuucp under 4.2 BSD. >A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read(). >allegra has also reported such problems with honey danber under 4.2 BSD. To be more explicit, Phil Karn and I would occasionally find a hung honey danber uucico. The processes were not always in the same place. We found, after tweaking pstat to provide a little extra information, that the common theme was that the processes had an alarm pending, but with alarms masked off. Alarms are not masked off by honey dan-ber, so the problem appears to be a race somewhere in the 4.2 signal code, somehow failing to reset signals following a longjmp out of an alarm handler. We replaced alarm calls with a macro-sequence that explicitly set SIGALRM on in the signal mask before doing the alarm call. We haven't seen a hung uucico since. In summary, honey danber seems to be guilty only of exercising alarms rather more heavily than typical programs, thereby exposing some problems in the underlying 4.2 operating system. We didn't have problems with honey danber under 4.2, we exposed problems with 4.2 through honey danber. John P. Linderman Department of Alarming Errors allegra!jpl
jpl@allegra.UUCP (John P. Linderman) (11/25/84)
Could you pass the code (or diffs) along for the timeout fix you've added? {microsoft,ctvax}!trsvax!ron I think I can do better. Kirk Smith pointed out a bug in sleep(3) that causes alarms to be masked off and never rearmed. Since uucico does both sleeps and alarms, his fix (enclosed below) looks like it will cure the disease whereas mine merely treated the symptoms. From: pur-ee!ef:ks (Kirk Smith) Received: by ef.ECN; Mon, 27 Aug 84 10:32:00 est (4.12/5.20) Message-Id: <8408271532.AA03198@ef.ECN> Subject: sleep(3) and SIGALRM Index: /usr/src/lib/libc/gen/sleep.c 4.2BSD Apparently-To: ucbvax!4bsd-bugs Description: After a call to sleep(), the signal SIGALRM remains blocked. Repeat-By: main() { sleep(1); alarm(1); } This program will never terminate. Fix: In sleep.c add the following line to set the mask back to the old mask before returning as follows: (void) sigvec(SIGALRM, &ovec, (struct sigvec *)0); (void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0); +++ (void) sigsetmask(omask); } John P. Linderman (Not alarmed by the slip in sleep) allegra!jpl
bruce@godot.UUCP (Bruce Nemnich) (11/28/84)
In article <2923@allegra.UUCP> jpl@allegra.UUCP (John P. Linderman) writes: > >I think I can do better. Kirk Smith pointed out a bug in sleep(3) that >causes alarms to be masked off and never rearmed. Since uucico does >both sleeps and alarms, his fix (enclosed below) looks like it will cure >the disease whereas mine merely treated the symptoms. >.... > Description: > After a call to sleep(), the signal SIGALRM remains blocked. Gee, I thought we were on to something here, but.... > Repeat-By: > main() > { > sleep(1); > alarm(1); > } > This program will never terminate. Huh?? How could this not terminate? > > Fix: > In sleep.c add the following line to set the mask back to the old > mask before returning as follows: > > (void) sigvec(SIGALRM, &ovec, (struct sigvec *)0); > (void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0); > +++ (void) sigsetmask(omask); > } > >John P. Linderman (Not alarmed by the slip in sleep) allegra!jpl The VAX 4.2bsd sleep(3) I have doesn't ever alter the signal mask, so it shouldn't need restoring. It uses sigpause() to make sure SIGALRM is allowed while it waits for the timer to go off; sigpause() restores the mask before returning. I actually blindly installed this "fix," but when it didn't help, I took a closer look at the code. Then I ran a bunch of tests on the original, and everything behaved as it should (SIGALRM is *not* blocked after a sleep()). Even if one longjmps out of a SIGALRM handler (during which SIGALRM is blocked), the mask is set back to its value at setjmp time by longjmp. -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa
chris@umcp-cs.UUCP (Chris Torek) (11/28/84)
Actually, there *is* a bug in the 4.2BSD sleep; it's just that the bug isn't a missing sigsetmask() but two missing calls (sigblock() and sigsetmask()). It is *always* *always* *always* a mistake to call sigpause() to await a signal when the new mask is trying to unblock a signal that has never been blocked. (Often it works, but it is still a mistake.) The reason is simple: if it's not blocked now, the signal might happen between the C ``sigpause()'' call and the actual entry into system code. System calls are not atomic operations (at user level) until kernel code is entered; that's *why* sigpause is *defined* as ``atomically set signal mask and await signal''. (Gee, maybe we should kludge up the kernel to gripe about sigpause()s that don't release any signals, giving the name of the offending program... :-) ) -- (This line accidently left nonblank.) In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690 UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@maryland
wls@astrovax.UUCP (William L. Sebok) (11/29/84)
> In article <2923@allegra.UUCP> jpl@allegra.UUCP (John P. Linderman) writes: > > > >I think I can do better. Kirk Smith pointed out a bug in sleep(3) that > >causes alarms to be masked off and never rearmed. Since uucico does > >both sleeps and alarms, his fix (enclosed below) looks like it will cure > >the disease whereas mine merely treated the symptoms. > >.... > > Description: > > After a call to sleep(), the signal SIGALRM remains blocked. > > Gee, I thought we were on to something here, but.... > > > Repeat-By: > > main() > > { > > sleep(1); > > alarm(1); > > } > > This program will never terminate. > > Huh?? How could this not terminate? I tried this myself and it terminated. Then I tried inserting pause() after the alarm(1). The shell reported that the process had been killed by an alarm signal. > > > > Fix: > > In sleep.c add the following line to set the mask back to the old > > mask before returning as follows: > > > > (void) sigvec(SIGALRM, &ovec, (struct sigvec *)0); > > (void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0); > > +++ (void) sigsetmask(omask); > > } > > > >John P. Linderman (Not alarmed by the slip in sleep) allegra!jpl > I actually blindly installed this "fix," but when it didn't help, I took > a closer look at the code. Then I ran a bunch of tests on the original, > and everything behaved as it should (SIGALRM is *not* blocked after a > sleep()). Even if one longjmps out of a SIGALRM handler (during which > SIGALRM is blocked), the mask is set back to its value at setjmp time by > longjmp. > --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA For the record I installed this "fix" today myself and we hung again. Load average at that time was about 13. A pstat -p showed that SIGALRM was pending in the hung process. -- Bill Sebok Princeton University, Astrophysics {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls
jpl@allegra.UUCP (John P. Linderman) (11/29/84)
Right you are. I hadn't tried the suggested sequence (sleep(1);alarm(1);) to see if it really failed to terminate. When I did try it with the original sleep code, it terminated just fine. It looks like the real problem remains to be found. In the meantime, here's what I did to honey danber to prevent uucico from getting into a state where an alarm was pending, but alarms were masked off. I replaced each alarm() call with a non-zero argument with the following macro: #define ALARM(X) sigsetmask(sigblock(0)&~(1<<(SIGALRM - 1))), alarm(X) to explicitly unblock SIGALRM before each alarm() call. That eliminated hung cico's here, but William Sebok said the same fix did not cure the problem at astrovax. Maybe we have just been lucky, or maybe the vanilla 4.2 uucp has additional problems. In any event, the real question still remains, how is SIGALRM getting masked off to begin with? I had hoped Kirk had discovered the cause, but the search will still have to go on. John P. Linderman allegra!jpl
chris@umcp-cs.UUCP (Chris Torek) (11/30/84)
At high load averages, sleep() may hang randomly waiting for a signal
that has already arrived. I haven't tried this myself, but you might
try changing
while (!ringring)
sigpause(omask & ~mask(SIGALRM));
to
sigblock(mask(SIGALRM)); /* make next test atomic */
while (!ringring) { /* test */
sigpause(omask & ~mask(SIGALRM)); /* await signal(s) */
sigsetmask(omask); /* allow alarm through */
sigblock(mask(SIGALRM));/* make next test atomic */
}
sigsetmask(omask); /* restore signal mask */
--
(This line accidently left nonblank.)
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690
UUCP: {seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet: chris@umcp-cs ARPA: chris@maryland
bruce@godot.UUCP (Bruce Nemnich) (12/07/84)
I fixed the window of vulnerability in my sleep(3) as soon as you pointed it out. It doesn't help with the hung line problem though (I didn't expect it to, since it is not getting hung in sleep()). I have exactly the same situation as mentioned before. My outbound uucico occasionally gets hung in a read(2) system call. The read timeout SIGALRM is pending, but SIGALRM is blocked (*only* SIGALRM is blocked). I don't understand how SIGALRM could get blocked at any time in uucico. I grepped through the uucico and libc code, but I noticed no sigblock() or sigsetmask() calls other than the one I recently added to sleep(), and we know that's not the problem. That leads me to suspect a 4.2 kernel bug which is resulting in a spurious mask of SIGALRM. -- --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa