[comp.sys.hp] Interrupted library calls

paul@mecazh.UUCP (Paul Breslaw) (01/17/90)

This problem cropped up in the context of Xlib, but could equally apply to 
any Unix library. Hence the posting to more than one group.

Our application (a CAM package on HP9000/3xx machines under HP-UX6.5 X11.R2)
crashes sometimes when we handle a signal and return from 
the signal handler in a different context from the one in which the handler 
was entered. In other words we do a longjmp(3) from inside the handler.

We found that this is an elegant way to design certain features into a program.

[ Those of you who might want to argue this assertion read on. Those
  who are prepared to accept it can skip to the end of this []'ed bit.

  Our CAM package is a monolithic application running as a single
  process. Until Open Look or Motif is declared winner of the current
  X Look and Feel War, our application remains implemented using no tool
  kit, ie only pure Xlib calls.

  A user of our package can start a computation/display operation
  that might take a long time to complete. We wanted to allow him to
  hit a key to stop it, which would take him back to an earlier point
  in the dialogue.

  There are a large number of such long operations, so we needed a
  fairly general mechanism. 

  We did not want to sprinkle calls to X arbitrarily in the code
  in the hope that they would provide a frequent enough poll.  

  Neither did we want a signal handler to set a global flag and return
  normally, because that is simply the same polling problem in a different
  guise. You then have to sprinkle calls to check the global flag in the
  hope ... etc etc.

  So we had to have a signal handler to implement the required 
  asynchronousness, and it had to exit abnormally to achieve its end.
]

It is all the same, a pretty dangerous thing to do.    

This is especially so if the signal is allowed to interrupt any old bit of 
code that might be updating some data structure that is subsequently needed. 
And this, of course, is what happened when certain Xlib routines were 
interrupted.

Now good old BSD and friends (like Ultrix and HP-UX) offer a number of
means for dealing with the problem.

1. Interrupted system calls can be identified, and restarted when (if) the
   signal handler returns normally.

2. The application can be defensively programmed so that system calls which
   can be interrupted or partially completed are correctly handled.

3. Critical regions can be created with sigblock(2) and sigsetmask(2) providing
   DISABLE and ENABLE capabilities.

Clearly 1 and 2 are fine for system calls, but useless for libraries.

That leaves 3 - but whose responsibility is it to defend the data in the
library - the implementor or the user?

I suppose someone out there will cry `caveat emptor', but there are 
literally hundreds of X calls. How do I know which ones are critical and 
which ones not? If I bracket all the ones I use, I will end up with
ugly code that runs slowly (remember it's two system calls per X call).

Clearly this is a general problem, but I do not recall seeing anything
about it on the net.

Advice welcomed.


Paul Breslaw.
   

-- 
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Paul Breslaw, Mecasoft SA,          |  telephone :  41 1 362 2040
Guggachstrasse 10, CH-8057 Zurich,  |  e-mail    :  mcsun!chx400!mecazh!paul
Switzerland.                        |               paul@mecazh.UUCP

darrylo@hpnmdla.HP.COM (Darryl Okahata) (01/17/90)

In comp.sys.hp, paul@mecazh.UUCP (Paul Breslaw) writes:

> This problem cropped up in the context of Xlib, but could equally apply to 
> any Unix library. Hence the posting to more than one group.
> 
> Our application (a CAM package on HP9000/3xx machines under HP-UX6.5 X11.R2)
> crashes sometimes when we handle a signal and return from 
> the signal handler in a different context from the one in which the handler 
> was entered. In other words we do a longjmp(3) from inside the handler.
> 
> We found that this is an elegant way to design certain features into a program.

[ You're not going to like this ... ]

     This was one of the problems discussed in comp.windows.x some time
back: you cannot interrupt (via a signal) an X-window library call (at
least this is true of X11R2 and, I believe, X11R3 -- I'm not sure about
X11R4, however).

     As this is a generic problem with X-windows (this is not an "HP"
problem), you have the following choices:

1. Don't use signals (even to just set a flag and return).
2. Temporarily block *ALL* signals during a X-window call.  This is, I
   believe, the approach taken by GNU Emacs.

     -- Darryl Okahata
	UUCP: {hplabs!, hpcea!, hpfcla!} hpnmd!darrylo
	Internet: darrylo%hpnmd@hpcea.HP.COM

DISCLAIMER: this message is the author's personal opinion and does not
constitute the support, opinion or policy of Hewlett-Packard or of the
little green men that have been following him all day.

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/17/90)

In article <373@node17.mecazh.UUCP> paul@mecazh.UUCP (Paul Breslaw) writes:
>This is especially so if the signal is allowed to interrupt any old bit of 
>code that might be updating some data structure that is subsequently needed. 
>And this, of course, is what happened when certain Xlib routines were 
>interrupted.
>That leaves 3 - but whose responsibility is it to defend the data in the
>library - the implementor or the user?
>Clearly this is a general problem, but I do not recall seeing anything
>about it on the net.

The relevant properties are reentrancy and noninterruptibility.

These issues were recognized by the various standardization groups.
For example, ANSI C requires that signal() be invokable within any
signal handler, and that a signal handler function terminate only
via return, abort(), exit(), or longjmp().  IEEE 1003.1 adds a
large number of ("system call") functions that are required to be
invokable reentrantly or else block signals during their operation
(so that reentrance is not possible).  The X/Open Portability Guide
adds chroot() to this list and imposes these constraints on abort(),
exit(), and longjmp() (which are therefore hard to implement!).

Note that stdio functions and other similar library functions were
NOT so constrained, in order to avoid paying a run-time penalty on
each use of these heavily-used functions.  However, some vendors of
multiprocessor implementations of UNIX have decided to go ahead and
use semaphores to protect critical regions within such library
functions, in order to prevent the kind of problem you encountered.

Unless the specification of a library function states that it is
safe to abort or reenter it, you the application programmer should
take steps to avoid doing so.

rer@hpfcdc.HP.COM (Rob Robason) (01/18/90)

> 3. Critical regions can be created with sigblock(2) and sigsetmask(2)
>    providing DISABLE and ENABLE capabilities.
...
> That leaves 3 - but whose responsibility is it to defend the data in the
> library - the implementor or the user?

Having experienced the extreme other end of the spectrum with an
unrelated library that, unknown to me, was IGNoring SIGCLD (thus losing
the event altogether) while doing critical things, I can assure you that
you are sometimes better off controlling things yourself.

It seems as though it would be useful for X to BLOCK signals in critical
regions so that the application writer didn't have to know the internals
of X to get decent performance.  Perhaps the fact that there are so many
possible UN*X signal mechanisms out there has prevented the choice of
ONE to do the job.

> I suppose someone out there will cry `caveat emptor', but there are
> literally hundreds of X calls.  How do I know which ones are critical
> and which ones not?  If I bracket all the ones I use, I will end up with
> ugly code that runs slowly (remember it's two system calls per X call).

I agree with you that surrounding X calls with sigblock's is an
unacceptable solution.  I'm interested in the answer to this too.

> Paul Breslaw.

Rob Robason

stroyan@hpfcdq.HP.COM (Mike Stroyan) (01/18/90)

>> 3. Critical regions can be created with sigblock(2) and sigsetmask(2)
>>    providing DISABLE and ENABLE capabilities.
>...
>> That leaves 3 - but whose responsibility is it to defend the data in the
>> library - the implementor or the user?
>
>Having experienced the extreme other end of the spectrum with an
>unrelated library that, unknown to me, was IGNoring SIGCLD (thus losing
>the event altogether) while doing critical things, I can assure you that
>you are sometimes better off controlling things yourself.
>
>It seems as though it would be useful for X to BLOCK signals in critical
>regions so that the application writer didn't have to know the internals
>of X to get decent performance.  Perhaps the fact that there are so many
>possible UN*X signal mechanisms out there has prevented the choice of
>ONE to do the job.

Not only is there variation in the ways to handle signals, there is also
a high cost to handling signals inside libraries.  An application can
reduce the cost by masking out signals during a group of library calls.
A library implementor has no such option.  The library must return
signal handling to the initial state before returning from each call.

>> I suppose someone out there will cry `caveat emptor', but there are
>> literally hundreds of X calls.  How do I know which ones are critical
>> and which ones not?  If I bracket all the ones I use, I will end up with
>> ugly code that runs slowly (remember it's two system calls per X call).
>
>I agree with you that surrounding X calls with sigblock's is an
>unacceptable solution.  I'm interested in the answer to this too.
>
>> Paul Breslaw.
>
>Rob Robason

If you are going to get your process involved with signals and longjmps,
then you should defend _all_ library calls from receiving signals.
Very few library calls are safe from signals.  Even malloc cannot
handle a longjmp out of a signal handler.

You can reduce the number of system calls by blocking signals for a
longer time.  You should only leave the signals unblocked when your
application code will be running without library calls for a relatively
long time.  In the extreme case your application will begin to use the
unmask/mask call pairs as polling operations.  The advantage over
polling is that you can effectively poll continuously over a range of
code.

Don't treat signals lightly.  Adding a signal handler is the equivalent
of adding thousands of conditional branches to your code.  The number of
potential failures and the difficulty of testing an application grows
enormously.

Mike Stroyan

barmar@think.com (Barry Margolin) (01/18/90)

In article <373@node17.mecazh.UUCP> paul@mecazh.UUCP (Paul Breslaw) writes:
>That leaves 3 - but whose responsibility is it to defend the data in the
>library - the implementor or the user?

I think it *should* be the implementor's responsibility.  However, given
that most library implementors don't do so, it is effectively the user's
responsibility.

The best situation would be for library implementors to protect their
critical regions.  Next best would be for them to document which routines
have critical regions, so that the caller can bracket calls to those
routines with signal masks (unfortunately, this means that signals are
masked for longer than they need to -- the critical region may be a small
part of the library routine).  Every routine for which such documentation
doesn't exist must be assumed to have critical data, and cannot be aborted.

In addition to maintaining consistent data structures, it's also necessary
for library routines to clean up after themselves.  For instance, if a
subroutine opens and closes a file, that file should always be closed when
the subroutine is exited.  I'm primarily a Lisp programmer, and C and Unix
(among others) are missing a really important facility for systems
programming: UNWIND-PROTECT.  This is a mechanism for insisting that a
particular piece of code be run upon exiting a context, no matter how that
context is exited (either by returning or by non-local transfer).  When I
was a Multics programmer we had a similar thing; a handler could be written
for the "cleanup" condition, and the handler is run when a frame is exited
via non-local transfer.  In C it's possible to implement something like
this using a setjmp/longjmp protocol, but it only works with cooperating
routines; library routines won't obey the protocol, though.

--
Barry Margolin, Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar