[net.bugs.uucp] hung line help needed

bruce@godot.UUCP (Bruce Nemnich) (11/14/84)

I have been having an awful lot of problems with hung lines with 4.2bsd
uucp.  By hung, I mean a uucp process sitting blocking on a read from
the dialout line, but having removed the LCKfile for that line and
system (presumably after a timeout).  The offending processes must be
killed by hand.  I breifly looked at the code, but I didn't notice
anything too bad.  The stack frame at the hung point is often
main()/conn()/login()/expect()/read(), though sometimes
main()/imsg()/read().  Any quick fixes?
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

wls@astrovax.UUCP (William L. Sebok) (11/15/84)

> I have been having an awful lot of problems with hung lines with 4.2bsd
> uucp.  By hung, I mean a uucp process sitting blocking on a read from
> the dialout line, but having removed the LCKfile for that line and
> system (presumably after a timeout).  The offending processes must be
> killed by hand.  I breifly looked at the code, but I didn't notice
> anything too bad.  The stack frame at the hung point is often
> main()/conn()/login()/expect()/read(), though sometimes
> main()/imsg()/read().  Any quick fixes?
> -- 
> --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA

I am having this problem too.  I would sure as heck like to know what is
going on.  The hanging seems to occur when the load average is high.  I
believe that I installed a posted bug fix to alarm() and it didn't cure the
problem.
-- 
Bill Sebok			Princeton University, Astrophysics
{allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

russg@daemon.UUCP (Russel Gorby) (11/17/84)

In article <449@godot.UUCP> bruce@godot.UUCP (Bruce Nemnich) writes:
>I have been having an awful lot of problems with hung lines with 4.2bsd
>uucp.  By hung, I mean a uucp process sitting blocking on a read from
>the dialout line, but having removed the LCKfile for that line and
>system (presumably after a timeout).  The offending processes must be
>killed by hand.  I breifly looked at the code, but I didn't notice
>anything too bad.  The stack frame at the hung point is often
>main()/conn()/login()/expect()/read(), though sometimes
>main()/imsg()/read().  Any quick fixes?
>-- 
>--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
>  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

We had that problem here at tektronix also. It only happens when the
uucico session was initiated in answer mode, and the answering modem
is connected to a data switch. I believe that the problem stems from
the fact that the TIOCHPCL call works correctly ( or so I've been told)
under 4.2BSD. The problem is in the disconnect routine in the xqt module
and depending on what defines are set, will hang on either the first
close ( close(0) ) or on the open("/dev/tty",2) which occurs just after
the last close. The reason the hangs occur is that at some point the
dataswitch drops the line, and later uucico tries to do I/O to the
port. At some point in the code DTR get toggled (before the disconnect
routine), which on a dataswitch will cause the connection to be
severed, and DTR to get held low. Any subsequent I/O to the port
will hang because the port won't interrupt any more. This will
cause a hang on the initial close in disconnect(). If this doesn't
get you, then the TIOCHPCL will, because after the last
close ( in disconnect() ), it tries to do the open("/dev/tty",2)
which will hang for the same reason. Our fix (at least for the moment)
was to put a timeout mechanism in the disconnect routine, and this
seems to have fixed that particular problem.

-- 
Russ Gorby
ucbvax!tektronix!russg
(503) 627-1153
Tektronix
PO Box 500 M.S. 19-333
Beaverton OR  97077

wls@astrovax.UUCP (William L. Sebok) (11/17/84)

In article <449@godot.UUCP> bruce@godot.UUCP (Bruce Nemnich) writes:

>>I have been having an awful lot of problems with hung lines with 4.2bsd
>>uucp.  By hung, I mean a uucp process sitting blocking on a read from
>>the dialout line, but having removed the LCKfile for that line and
>>system (presumably after a timeout).  The offending processes must be
>>killed by hand.  I breifly looked at the code, but I didn't notice
>>anything too bad.  The stack frame at the hung point is often
>>main()/conn()/login()/expect()/read(), though sometimes
>>main()/imsg()/read().  Any quick fixes?

And in <85@daemon.UUCP> (<--huh?) Russ Gorby (tektronix!russg) responds:

> We had that problem here at tektronix also. It only happens when the
> uucico session was initiated in answer mode, and the answering modem
> is connected to a data switch. ...

We have that problem at astrovax also but 1) we have no data switch, and 2) it
only occurs when dialing out, not in answer mode.  The problem has occured both
when the dialing modem is a Hayes and when the dialing modem is a Racal/Vadic
3451.

Actually, there is one difference in the symptom here.  The lockfile is still
there when it is hung. The hang point on the latest hang was
main()/conn()/Acuopn()/vadopn()/expect()/read().
-- 
Bill Sebok			Princeton University, Astrophysics
{allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

bruce@godot.UUCP (Bruce Nemnich) (11/18/84)

I should also add that it only happens to me on dialout (originate)
lines.  The disconnect() bug doesn't seem relevent since in most cases
it is hung in login().
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

bks@ski.UUCP ( Bruce K. Smith) (11/20/84)

> > I have been having an awful lot of problems with hung lines with 4.2bsd
> > uucp.  By hung, I mean a uucp process sitting blocking on a read from
> > the dialout line, but having removed the LCKfile for that line and
> > system (presumably after a timeout).  The offending processes must be
> > killed by hand.  I breifly looked at the code, but I didn't notice
> > anything too bad.  The stack frame at the hung point is often
> > main()/conn()/login()/expect()/read(), though sometimes
> > main()/imsg()/read().  Any quick fixes?
> > -- 
> > --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
> 
> I am having this problem too.  I would sure as heck like to know what is
> going on.  The hanging seems to occur when the load average is high.  I
> believe that I installed a posted bug fix to alarm() and it didn't cure the
> problem.
> -- 
> Bill Sebok			Princeton University, Astrophysics
> {allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

We're running a Masscomp MC-530 with their RTU 2.1C ("Real Time Unix", some
combination of System III and 4.2BSD, I think). Our uucico's hang if and only
if there is no outgoing work for them (ie iff they never enter MASTER mode),
which can only happen when it's an incoming call. To work around the problem,
we had to arrange to mail a dummy message to a fake user on our sole uucp
neighbor. We did it by moving /usr/lib/uucp/uucico to .../uucico.actual,
and replacing .../uucico with a shell script which reads as follows:
==========================================================================
#!/bin/sh
# by Bruce K. Smith and Jordan K. Hubbard, ski!{bks,jkh}, 24 October 1984
# incoming calls: if no outgoing uucico work, mail a dummy message to work
#	around uucico.actual's hangup bug
# outgoing calls: use the "use" script to avoid stomping on anyone logged in on
#	tty2 or using another init state
# then call uucico.actual, the original version of uucico

# for security:
PATH=/bin:/usr/bin:/usr/local/bin

if test $# -eq 0
then
	# incoming call
	echo `date`: $0 >> /tmp/uulog
	outs=`echo /usr/spool/uucp/C* /usr/spool/uucp/D*`
	if test "$outs" = "/usr/spool/uucp/C* /usr/spool/uucp/D*"
	then
		mail politik!none < /usr/lib/uucp/dummy.msg
	fi
	/usr/lib/uucp/uucico.actual
else
	# outgoing call, try to grab tty2
	echo outgoing at `date`: $* >> /tmp/uulog
	use tty2 -u uucp /usr/lib/uucp/uucico.actual $*
fi
==========================================================================
The lines with C* and D* are for mailing the message only when there is no
work waiting to go out. If your system can be called by more than one other
system, you'll probably have to make this section more intelligent, or else
just send the message out all the time.
   The "outgoing call" stuff is for an unrelated problem.
   I have no idea whether this workaround works on systems other than the
Masscomp.

-- 
	Bruce Smith
	{sun!texsun, ucbvax!mtxinu, dual!ptsfa} !politik!ski!bks

wls@astrovax.UUCP (William L. Sebok) (11/21/84)

In review: the subject is the hung uucico processes that we have here
at astrovax and godot.  This is when running rtiuucp under 4.2 BSD.
A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read().
allegra has also reported such problems with honey danber under 4.2 BSD.

More information.

in cico.c the SIGALRM handler is

	timeout()
	{
		logent(Rmtname, "TIMEOUT");
		systat(Rmtname, SS_FAIL, "TIMEOUT");
		longjmp(Sjbuf, 1);
	}

while in pk1.c  the SIGALRM handler is

	cgalarm() { longjmp(Getjbuf, 1); }

However in condevs.c and conn.c the SIGALRM handler is:

	alarmtr()
	{
		signal(SIGALRM, alarmtr);
		if (next_fd >= 0) {
			if (close(next_fd))
				logent("FAIL", "ACU LINE CLOSE");
			next_fd = -1;
		}
		longjmp(Sjbuf, 1);
	}

I have never seen any hung processes when the first two handlers are
in effect, only when the last one is in effect.  Assuming its a race, I
don't know if it involves the signal() or the close().
-- 
Bill Sebok			Princeton University, Astrophysics
{allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

ron@trsvax.UUCP (11/21/84)

Could you pass the code (or diffs) along for the timeout fix you've added?

	{microsoft,ctvax}!trsvax!ron

muller@sdcc3.UUCP (Keith Muller) (11/23/84)

> In review: the subject is the hung uucico processes that we have here
> at astrovax and godot.  This is when running rtiuucp under 4.2 BSD.
> A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read().
> allegra has also reported such problems with honey danber under 4.2 BSD.
> 

I am not sure that this might be your problem, but I had a similar problem
with the dialer code in 4.2 condevs.c. The following is extracted from a
rlog on my condevs.c.


A problem exisisted in many dialer routines that next_fd was NOT
set to -1 right after a successful open. The routine getnextfd() that
preceeded the open attempts to predict the next file descriptor
to be allocated. The sigalrm handler alarmtr would close the file
descriptor set in next_fd (set by getnextfd). So when the routine
expect timed out during login it called alarmtr, the file was closed
causing a BAD WRITE abort.

What you need to make sure that right after the open call in vadopn()
you have the following:
		/* open call ....*/
		alarm(0); /* cancel time out wait on open */
		next_fd = -1; /* this was missing in many of the dialer code*/


	Keith Muller
	Academic Computer Center
	University of California, San Diego

jpl@allegra.UUCP (John P. Linderman) (11/24/84)

>In review: the subject is the hung uucico processes that we have here
>at astrovax and godot.  This is when running rtiuucp under 4.2 BSD.
>A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read().
>allegra has also reported such problems with honey danber under 4.2 BSD.

To be more explicit, Phil Karn and I would occasionally find a hung
honey danber uucico.  The processes were not always in the same place.
We found, after tweaking pstat to provide a little extra information,
that the common theme was that the processes had an alarm pending, but
with alarms masked off.  Alarms are not masked off by honey dan-ber,
so the problem appears to be a race somewhere in the 4.2 signal code,
somehow failing to reset signals following a longjmp out of an alarm
handler.  We replaced alarm calls with a macro-sequence that explicitly
set SIGALRM on in the signal mask before doing the alarm call.  We haven't
seen a hung uucico since.

In summary, honey danber seems to be guilty only of exercising alarms
rather more heavily than typical programs, thereby exposing some problems
in the underlying 4.2 operating system.  We didn't have problems with
honey danber under 4.2, we exposed problems with 4.2 through honey danber.

John P. Linderman  Department of Alarming Errors  allegra!jpl

jpl@allegra.UUCP (John P. Linderman) (11/25/84)

    Could you pass the code (or diffs) along for the timeout fix you've added?

	{microsoft,ctvax}!trsvax!ron

I think I can do better.  Kirk Smith pointed out a bug in sleep(3) that
causes alarms to be masked off and never rearmed.  Since uucico does
both sleeps and alarms, his fix (enclosed below) looks like it will cure
the disease whereas mine merely treated the symptoms.

    From: pur-ee!ef:ks (Kirk Smith)
    Received: by ef.ECN; Mon, 27 Aug 84 10:32:00 est (4.12/5.20)
    Message-Id: <8408271532.AA03198@ef.ECN>
    Subject: sleep(3) and SIGALRM
    Index: 	/usr/src/lib/libc/gen/sleep.c 4.2BSD
    Apparently-To: ucbvax!4bsd-bugs

    Description:
	    After a call to sleep(), the signal SIGALRM remains blocked.

    Repeat-By:
	    main()
	    {
		    sleep(1);
		    alarm(1);
	    }
	    This program will never terminate.

    Fix:
	    In sleep.c add the following line to set the mask back to the old
	    mask before returning as follows:

		(void) sigvec(SIGALRM, &ovec, (struct sigvec *)0);
		(void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0);
	+++	(void) sigsetmask(omask);
	}

John P. Linderman  (Not alarmed by the slip in sleep)  allegra!jpl

bruce@godot.UUCP (Bruce Nemnich) (11/28/84)

In article <2923@allegra.UUCP> jpl@allegra.UUCP (John P. Linderman) writes:
>
>I think I can do better.  Kirk Smith pointed out a bug in sleep(3) that
>causes alarms to be masked off and never rearmed.  Since uucico does
>both sleeps and alarms, his fix (enclosed below) looks like it will cure
>the disease whereas mine merely treated the symptoms.
>....
>    Description:
>	    After a call to sleep(), the signal SIGALRM remains blocked.

Gee, I thought we were on to something here, but....

>    Repeat-By:
>	    main()
>	    {
>		    sleep(1);
>		    alarm(1);
>	    }
>	    This program will never terminate.

Huh??  How could this not terminate?

>
>    Fix:
>	    In sleep.c add the following line to set the mask back to the old
>	    mask before returning as follows:
>
>		(void) sigvec(SIGALRM, &ovec, (struct sigvec *)0);
>		(void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0);
>	+++	(void) sigsetmask(omask);
>	}
>
>John P. Linderman  (Not alarmed by the slip in sleep)  allegra!jpl

The VAX 4.2bsd sleep(3) I have doesn't ever alter the signal mask, so it
shouldn't need restoring.  It uses sigpause() to make sure SIGALRM is
allowed while it waits for the timer to go off; sigpause() restores the
mask before returning.

I actually blindly installed this "fix," but when it didn't help, I took
a closer look at the code.  Then I ran a bunch of tests on the original,
and everything behaved as it should (SIGALRM is *not* blocked after a
sleep()).  Even if one longjmps out of a SIGALRM handler (during which
SIGALRM is blocked), the mask is set back to its value at setjmp time by
longjmp.
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa

chris@umcp-cs.UUCP (Chris Torek) (11/28/84)

Actually, there *is* a bug in the 4.2BSD sleep; it's just that the bug
isn't a missing sigsetmask() but two missing calls (sigblock() and
sigsetmask()).  It is *always* *always* *always* a mistake to call
sigpause() to await a signal when the new mask is trying to unblock a
signal that has never been blocked.  (Often it works, but it is still a
mistake.)  The reason is simple: if it's not blocked now, the signal
might happen between the C ``sigpause()'' call and the actual entry
into system code.  System calls are not atomic operations (at user
level) until kernel code is entered; that's *why* sigpause is *defined*
as ``atomically set signal mask and await signal''.

(Gee, maybe we should kludge up the kernel to gripe about sigpause()s
that don't release any signals, giving the name of the offending
program... :-) )
-- 
(This line accidently left nonblank.)

In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

wls@astrovax.UUCP (William L. Sebok) (11/29/84)

> In article <2923@allegra.UUCP> jpl@allegra.UUCP (John P. Linderman) writes:
> >
> >I think I can do better.  Kirk Smith pointed out a bug in sleep(3) that
> >causes alarms to be masked off and never rearmed.  Since uucico does
> >both sleeps and alarms, his fix (enclosed below) looks like it will cure
> >the disease whereas mine merely treated the symptoms.
> >....
> >    Description:
> >	    After a call to sleep(), the signal SIGALRM remains blocked.
> 
> Gee, I thought we were on to something here, but....
> 
> >    Repeat-By:
> >	    main()
> >	    {
> >		    sleep(1);
> >		    alarm(1);
> >	    }
> >	    This program will never terminate.
> 
> Huh??  How could this not terminate?

I tried this myself and it terminated.  Then I tried inserting pause()
after the alarm(1).  The shell reported that the process had been killed
by an alarm signal.

> >
> >    Fix:
> >	    In sleep.c add the following line to set the mask back to the old
> >	    mask before returning as follows:
> >
> >		(void) sigvec(SIGALRM, &ovec, (struct sigvec *)0);
> >		(void) setitimer(ITIMER_REAL, &oitv, (struct itimerval *)0);
> >	+++	(void) sigsetmask(omask);
> >	}
> >
> >John P. Linderman  (Not alarmed by the slip in sleep)  allegra!jpl

> I actually blindly installed this "fix," but when it didn't help, I took
> a closer look at the code.  Then I ran a bunch of tests on the original,
> and everything behaved as it should (SIGALRM is *not* blocked after a
> sleep()).  Even if one longjmps out of a SIGALRM handler (during which
> SIGALRM is blocked), the mask is set back to its value at setjmp time by
> longjmp.
> --Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA

For the record I installed this "fix" today myself and we hung again.
Load average at that time was about 13.  A pstat -p  showed that SIGALRM was
pending in the hung process.
-- 
Bill Sebok			Princeton University, Astrophysics
{allegra,akgua,burl,cbosgd,decvax,ihnp4,noao,princeton,vax135}!astrovax!wls

jpl@allegra.UUCP (John P. Linderman) (11/29/84)

Right you are.  I hadn't tried the suggested sequence (sleep(1);alarm(1);)
to see if it really failed to terminate.  When I did try it with the original
sleep code, it terminated just fine.  It looks like the real problem remains
to be found.  In the meantime, here's what I did to honey danber to prevent
uucico from getting into a state where an alarm was pending, but alarms were
masked off.  I replaced each alarm() call with a non-zero argument with the
following macro:

#define ALARM(X) sigsetmask(sigblock(0)&~(1<<(SIGALRM - 1))), alarm(X)

to explicitly unblock SIGALRM before each alarm() call.  That eliminated
hung cico's here, but William Sebok said the same fix did not cure the
problem at astrovax.  Maybe we have just been lucky, or maybe the vanilla
4.2 uucp has additional problems.  In any event, the real question still
remains, how is SIGALRM getting masked off to begin with?  I had hoped
Kirk had discovered the cause, but the search will still have to go on.

John P. Linderman  allegra!jpl

chris@umcp-cs.UUCP (Chris Torek) (11/30/84)

At high load averages, sleep() may hang randomly waiting for a signal
that has already arrived.  I haven't tried this myself, but you might
try changing

	while (!ringring)
		sigpause(omask & ~mask(SIGALRM));

to

	sigblock(mask(SIGALRM));	/* make next test atomic */
	while (!ringring) {		/* test */
		sigpause(omask & ~mask(SIGALRM)); /* await signal(s) */
		sigsetmask(omask);	/* allow alarm through */
		sigblock(mask(SIGALRM));/* make next test atomic */
	}
	sigsetmask(omask);		/* restore signal mask */
-- 
(This line accidently left nonblank.)

In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland

bruce@godot.UUCP (Bruce Nemnich) (12/07/84)

I fixed the window of vulnerability in my sleep(3) as soon as you
pointed it out.  It doesn't help with the hung line problem though (I
didn't expect it to, since it is not getting hung in sleep()).

I have exactly the same situation as mentioned before.  My outbound
uucico occasionally gets hung in a read(2) system call.  The read
timeout SIGALRM is pending, but SIGALRM is blocked (*only* SIGALRM is
blocked).

I don't understand how SIGALRM could get blocked at any time in uucico.
I grepped through the uucico and libc code, but I noticed no sigblock()
or sigsetmask() calls other than the one I recently added to sleep(),
and we know that's not the problem.  That leads me to suspect a 4.2
kernel bug which is resulting in a spurious mask of SIGALRM.
-- 
--Bruce Nemnich, Thinking Machines Corporation, Cambridge, MA
  ihnp4!godot!bruce, bjn@mit-mc.arpa ... soon to be bruce@godot.arpa