[net.bugs.4bsd] hung line help needed

jpl@allegra.UUCP (John P. Linderman) (11/24/84)

>In review: the subject is the hung uucico processes that we have here
>at astrovax and godot.  This is when running rtiuucp under 4.2 BSD.
>A typical hang point is main()/conn()/Acuopn()/vadopn()/expect()/read().
>allegra has also reported such problems with honey danber under 4.2 BSD.

To be more explicit, Phil Karn and I would occasionally find a hung
honey danber uucico.  The processes were not always in the same place.
We found, after tweaking pstat to provide a little extra information,
that the common theme was that the processes had an alarm pending, but
with alarms masked off.  Alarms are not masked off by honey dan-ber,
so the problem appears to be a race somewhere in the 4.2 signal code,
somehow failing to reset signals following a longjmp out of an alarm
handler.  We replaced alarm calls with a macro-sequence that explicitly
set SIGALRM on in the signal mask before doing the alarm call.  We haven't
seen a hung uucico since.

In summary, honey danber seems to be guilty only of exercising alarms
rather more heavily than typical programs, thereby exposing some problems
in the underlying 4.2 operating system.  We didn't have problems with
honey danber under 4.2, we exposed problems with 4.2 through honey danber.

John P. Linderman  Department of Alarming Errors  allegra!jpl

chris@umcp-cs.UUCP (Chris Torek) (11/28/84)

Actually, there *is* a bug in the 4.2BSD sleep; it's just that the bug
isn't a missing sigsetmask() but two missing calls (sigblock() and
sigsetmask()).  It is *always* *always* *always* a mistake to call
sigpause() to await a signal when the new mask is trying to unblock a
signal that has never been blocked.  (Often it works, but it is still a
mistake.)  The reason is simple: if it's not blocked now, the signal
might happen between the C ``sigpause()'' call and the actual entry
into system code.  System calls are not atomic operations (at user
level) until kernel code is entered; that's *why* sigpause is *defined*
as ``atomically set signal mask and await signal''.

(Gee, maybe we should kludge up the kernel to gripe about sigpause()s
that don't release any signals, giving the name of the offending
program... :-) )
-- 
(This line accidently left nonblank.)

In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (301) 454-7690
UUCP:	{seismo,allegra,brl-bmd}!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@maryland