[comp.unix.programmer] more on error checking

scs@adam.mit.edu (Steve Summit) (02/10/91)

In article <1991Feb07.013637.6542@convex.com>, Tom Christiansen writes:
>As was mentioned earlier, *every* system call should *always*
>be checked, even if you "know" it can't fail.

Indeed.  However, at the risk of confusing the issue further just
when this basic fact is starting to sink in, it's important to
remember, when you set out to check for every error everywhere,
that the thing to do upon finding one may not be to complain and
abort.  I'll provide several examples.

My own code is usually liberally sprinkled with assertions and
sanity checks.  One fruitful place to put them is in the default
case of a switch, when the default should "never" be reached.
Suppose I have a little routine which decodes the value of some
internal, numeric flag (presumably so it can be printed out
somewhere):

	#include "status.h"

	char *
	status_str(sts)
	int sts;
	{
	switch(sts)
		{
		case STS_GOLDEN:
			return "STS_GOLDEN";
			break;

		case STS_DEMENTED:
			return "STS_DEMENTED";
			break;

		case STS_ANHYDROUS:
			return "STS_ANHYDROUS";
			break;

		default:
			panic("status_str: bad status %d", sts);
		}
	}

The panic is there, of course, so that next week when I add
another status code but forget to update the status_str()
routine, my mistake will be very noticeable, and there will be a
strong reminder and incentive to fix it.

But does it really have to be a fatal error?  Probably not,
especially if there's any chance that the "impossible case" won't
be triggered by me, but by some (hopefully beta-test) user.  A
better way to handle this particular "impossible case" is
nonfatally: 

		default:
			{
			static char tmpbuf[15];
			oddity("status_str: bad status %d", sts);
			sprintf(tmpbuf, "status %d", sts);
			return tmpbuf;
			}

In my code, "oddity" is just like panic (i.e. it prints debugging
messages that are to be construed as the developer's problem, not
the user's) but it's nonfatal.  There's still a noisy message,
there's still incentive and reminder to fix the bug, but at least
the program can continue to run.

There was a case like this in the 4.1bsd tty driver.  (I don't
know if it's still there.)  The driver tries to do CRT erase
processing (i.e. the backspace-space-backspace stuff that wipes
characters off the screen as you press the delete or rubout or
backspace key or whatever) perfectly.  The tricky part is that
you might need 1, 2, or up to 8 backspaces, depending on whether
there is a normal character, a control character (echoed as "^X"),
or a tab on the screen being erased.

The code for figuring out how many backspaces to emit involved a
big array full of action codes.  There were a few impossible
cases.  Sure enough, they were handled with a panic.

I don't know about you, but I think that aesthetically pleasing
crterase processing, though nice, is a bit of a frill, and
probably one of the last things that it's vital for an operating
system kernel to do perfectly.  I would be very, very annoyed
(well, to tell the truth, I'd also be ridiculously amused) if a
machine crashed, dropping several users on the floor and losing
lots of their work, just because some poor schlep tried to hit
the delete key to get rid of some "impossible" character on his
command line.  (No, it never happened, but once I'd seen that
code in the tty driver, I was forever slightly haunted with the
fear that the system was going to crash because of it.  In fact,
hitting the impossible case would have represented a programming
error, not an impossible character.  Still, the panic seemed
needlessly draconian.)

Finally, consider write(1).  On the recipient's screen, it prints
something like

	message from scs on tty23 at 12:34 ...

Now, whatever method it uses to find out what terminal the sender
is logged into (probably ttyname(3)), it's not going to work if
the sender doesn't have one.  How could the sender not have one?
Actually, the burden of proof should be on the author of write(1)
to show why the program can't proceed if the tty is not known,
not on the user to suggest circumstances under which it might be
meaningful not to have one, but to satisfy the skeptics out there
I'll mention a real-world, non-fabricated case: at jobs.
(Actually, this case, or another one involving write(1), came up
on the net just a week ago.)  I sometimes like to send myself
notification that one of my at jobs has finished by including

	echo "at job is done" | write scs

in the at script.  However, at scripts run with no controlling
tty, so write(1) aborts and refuses to send the message, just
because it won't be able to say where it's from.

Just now I tried write(1)ing to myself from within emacs, so that
I could see the exact "message from..." message on my screen.
But since I'm running the Gigantic Neanderthal Ubiquitous emacs,
which runs shell escapes in some kind of weird subshell with
pipes, all I got was the "write: can't find your tty" message in
the *Shell Command Output* window.

Some time ago I wrote my own version of write(1), mostly so that
I could put in the wrinkle that it leave out the "on ttyxx" part,
rather than falling back and punting, if it couldn't determine
it.  (It also prints to several ttys if the user is logged in
multiple times, rather than picking the wrong one.  Now, of
course, in these security-conscious days, private versions of
write(1) don't work well, because they should be setgid.)

So yes, check for all error returns.  And yes, print noisy messages
(unless the error is truly benign, as in the write(1) example).
But think for a minute whether the error really has to be fatal,
whether there's any reason the program can't continue.  If it
can't; if the error means that vital data has been corrupted such
that continuing would be futile or worse; then by all means abort.
But if you can safely continue with little or no effort, please do.

                                            Steve Summit
                                            scs@adam.mit.edu

P.S.  I am aware that there are better ways of decoding numeric
flags than by using a big switch statement as in the first
example.  (In fact, I usually use a lookup table.)

duncan@comp.vuw.ac.nz (Duncan McEwan) (02/12/91)

In article <1991Feb07.013637.6542@convex.com>, Tom Christiansen writes:
>As was mentioned earlier, *every* system call should *always*
>be checked, even if you "know" it can't fail.

I'm not sure if this has been discussed in this thread yet, but
one possibility that I've seen suggested elsewhere is to have
failing system calls generate a signal to the calling process
(SIGSYS2 anyone?).  The default action might be to terminate the
process, though perhaps printing a diagnostic and resuming would
be more compatible with current behaviour.  I can't offhand think
of any particularly nice way of implementing a libc provided default
handler that could do the latter without messing with the compilers
or crt0.o, etc, though.

It would be nice if such a scheme could be implemented entirely
without kernel support (perhaps by putting the signal generating
stuff in the libc wrappers for each system call), but I don't think
that would allow passing a parameter to the handler (which you would
want, to give information about what system call failed).

Hm, thinking about it, this all sounds a bit VMSish, and if its
only aim is to provide protection from programmers who don't check
every syscall, then its probably a non-starter :-)

Comments, suggestions, flames, anyone?

Duncan

hunt@dg-rtp.rtp.dg.com (Greg Hunt) (02/12/91)

In article <1991Feb12.011911.7677@comp.vuw.ac.nz>, duncan@comp.vuw.ac.nz (Duncan McEwan) writes:
> 
> I'm not sure if this has been discussed in this thread yet, but
> one possibility that I've seen suggested elsewhere is to have
> failing system calls generate a signal to the calling process
> (SIGSYS2 anyone?).  The default action might be to terminate the
> process, though perhaps printing a diagnostic and resuming would
> be more compatible with current behaviour.  I can't offhand think
> of any particularly nice way of implementing a libc provided default
> handler that could do the latter without messing with the compilers
> or crt0.o, etc, though.
> 
> It would be nice if such a scheme could be implemented entirely
> without kernel support (perhaps by putting the signal generating
> stuff in the libc wrappers for each system call), but I don't think
> that would allow passing a parameter to the handler (which you would
> want, to give information about what system call failed).
> 
> Hm, thinking about it, this all sounds a bit VMSish, and if its
> only aim is to provide protection from programmers who don't check
> every syscall, then its probably a non-starter :-)
>
> Comments, suggestions, flames, anyone?

I'd rather not see it myself.  I already have code in which I print out
a message when I get SIGSYS (bad argument to system call), and then
ignore the signal, thus returning control to the code that made the
bad call.  It then (supposedly) gets error EINVAL, and since I'm
checking the error returns from my system calls, I then print out
what system call I was trying to do, what error I got, and any other
information I can provide, such as the name of the file I was reading
or writing if it was a file system related call.  I do similar error
reporting for any error from any system call that my code is not
prepared to deal with (you can correctly surmise from this which side
of the "should I check error returns from system calls" debate I'm on).

You mentioned in your second paragraph that you'd want to pass such
information somehow so that the signal handler can print it out for
you.  I think it is easier to handle such errors by checking the
return from the system call.  It would also allow you to print out
a large amount of information that it might be difficult to do with
a single parameter to a signal handler.  For me, coding error handling
where the error occurs certainly makes the code easier to understand.

I agree with your last statement.  It's a non-starter.  That's my
opinion, too.

-- 
Greg Hunt                        Internet: hunt@dg-rtp.rtp.dg.com
DG/UX Kernel Development         UUCP:     {world}!mcnc!rti!dg-rtp!hunt
Data General Corporation
Research Triangle Park, NC, USA  These opinions are mine, not DG's.