[comp.lang.c] Sorting, error handling.

ok@quintus.UUCP (Richard A. O'Keefe) (12/10/87)

Further to the discussion about sorting:
    a couple of people suggested that I should put my money where
    my mouth was and post a sorting routine that they could try.
    Our mailer didn't know how to do this, which is why the delay.
    The first step in testing sorting routines is to find some
    suitable test data.  A typical application for a sorting routine
    is sorting a collection of file names or of word.
    I have posted, and have just seen echoed back on
	comp.sources.misc
    a program called "sample", which selects a random sample of its
    standard input.  (It is of course written in C.)  You can
    generate test data by doing e.g.
	sample -1000 </usr/dict/words
    which generates a random sample of 1000 words in random order.

Further to the discussion about error handling in stdio:
    the *real* reason for the delay was the fact that I was trying to
    make this program properly fail-proof before sending it.  I am
    now embarrassed by all my old code that isn't this thorough.

A vote of thanks to Dave Decot at Hewlett Packard, Cupertino,
for sending the manual page for their experimental ERRCTL(2).
It took me two full evenings (well, I had a lot of spare time
while I was waiting for sdb to munch through some scripts), but
it turned out to be *extremely* easy to implement this facility
on top of SunOS 3.2.  I can't go into too many details, because
that would involve giving away information about someone else's
product.  However, V7 UNIX and 4.1BSD used a similar interface
to system calls, so let's pretend I'm talking about that.

A system call wrapper looks like
    LABEL:
	<ensure the arguments are in a standard place>
	<put the system call number in a standard place>
	<issue the supervisor call instruction>
	<if there is an error, branch to error handler>
	<clean up and return results>

    error handler:
	<move result code to errno>
	<return -1>

This is what already exists.  If you read the errctl document, you'll
recall that HP provide a function
	errctl(new handler) --> returns old handler (like signal())
    new handler can be
	ERR_DFL	-- error action is to set errno and return -1
	ERR_IGN	-- error action is to return -1 without setting errno
	function-- error action is to call
		function(SysCallNo, ErrNo, &RetVal, &ArgumentRecord)
	where SysCallNo identifies the system call, ArgumentRecord
	is probably a varargs pointer to an argument list (it is in
	my implementation), ErrNo is the error being reported, and
	RetVal (initial value -1) can be set to change the result
	the system call will return.  If the handler returns 0, the
	system call will return with value *RetVal; it is up to the
	handler to set errno if it chooses.  If the handler returns
	non-zero, the system call is restarted.

My new wrappers look like
	.short	SysCallNo
    LABEL:
	<ensure the arguments are in a standard place>
	<put the system call number in a standard place>
	<issue the supervisor call instruction>
	<if there is an error, branch to L1>
	<clean up and return results>
    L1:
	lea	LABEL,a0
	jmp	c1error

and the common handler c1error looks at the handler value and works
out what to do.  The space for each function is larger by 6 bytes
(the SysCallNo = 2 and the lea = 4) with NO run-time cost for system
calls which succeed.  This was ***EASY*** to do, and if I had access
to the library sources for a VAX 4.2 or 4.3BSD system (and a SUN to
do the editing on!) I would wager a day's pay that I could make this
work on either of those systems.  (I already know that I could have
done it in VAX 4.1 or PDP-11 V7.)

    I have two comments to make about this error handling scheme.
The first is that I don't quite see the point of ERR_IGN.  And the
reason for that is that I would like to be able to write things
which look to the user like system calls but aren't.  Indeed, a
number of things have moved from section 2 to section 3 in past
years.  I have written a function which lets me write C code that
looks like a system call with respect to errctl().  For argument's
sake, suppose that shmat() were not implemented directly as a
system call, so that someone like me had to put a wrapper around
it.  This is what you'd do:

    #include "errctl.h"

    char *shmatt(shmid, shmaddr, shmflg)
	int shmid;
	char *shmaddr;
	int shmflg;
	{
	    extern char *shmat();
	    int e;
	    char *x;

	    do {
		with_no_errctl(x = shmat(shmid, shmaddr, shmflg));
		if (x != (char*)(-1)) return x;
		e = errno;
	    } while (eraise(SYS_SHMAT, &e, &shmid));
	    return (char*) e;
	}

(The do while loop makes this wrapper restartable.)
Note that I want the existing shmat(), presumably coded assuming
the standard approach to system call errors, to see no change at
all, which is what the with_no_errctl(<expr>) macro does.  But
that means that even if the caller of my shmatt() wrapper has set
errctl(ERR_IGN), errno may yet be changed.  (It just won't be the
result proper to this function.)

    Still, I repeat that it took me two evenings to code and test
this, to the point where I have definitions for all the system
calls in the SunOS 3.2 manual, and have tested that sbrk() and
the basic I/O calls work and do call the handler when expected.
(And that eraise() works.)  This is a very simple interface which
is very easy to implement with essentially 0 cost when you're not
using it.  Even if I never port this to another machine, now I
have a debugging aid which I can use on the Sun to log all failed
system calls.  Thanks again, Dave Decot.

    The problem with error handling in C is that the default is to
IGNORE errors.  This is clearly the wrong thing to do.  ISO Pascal
has no error handling facilities, but anything remotely resembling
a usable Pascal implementation will report errors to you, so that
if something goes wrong in a program, you will know about it.  We
have had problems with a certain source code management system:
when it wrote files across a network it would occasionally lose
blocks WITHOUT TELLING YOU.  (The authors apparently thought that
nothing could ever go wrong with write().)

    errctl() is nice to have, but it still has problems.  For
example, there are system calls where "failure" isn't really an
error:  access(2) for one.  And it is a bit of a pain when you
have remembered to test for an error return in your program to
have to go in and put with_no_errctl() around it.  I once used
to use an Algol compiler where if you said (the equivalent of)
	open(foo);
it would generate a run-time report if the open failed, but if
you said (the equivalent of)
	result = open(foo);
the result assigned to your variable was all the notice you got.
What would an approach like this look like in C?

    At the moment, in C we have
	the hardware signals errors with kill()
	    -- these errors are handled via signal()
	system calls signal errors with -1 and errno
	    -- these errors are handled by explicit tests (or errctl())
	some library functions use the system call interface
	some library functions return NULL for error, or EOF
	    but the value of errno is not defined (e.g. stdio)
	some library functions signal errors
	    -- through a user-defined global-to-the-entire-program
	    -- function called matherr().
This is such an incredible mess that it *almost* persuades me that
I should be using ADA (:-).
	
    When I write code of my own which can detect errors, what is
considered to be the best way of reporting them to the program
which invoked my functions?  Is it considered good style to return
-1 and set errno (often there isn't a good choice of value)?
When should my code write its own messages to stderr, and when
should I just pass an error code to the caller and let it decide?

    I would really like to hear how the experts in this group
recommend that errors should be reported in C.

    Before anyone out there takes me to task for trying to keep this
error handling issue alive:  I am *sick* of utilities which die
mysteriously.  (To avoid offending any particular vendor, I should point
out that we have several different UNIXes here, and that SunOs 3.2 has
fewer bugs than most of the others.  Some of the others are certified
System Vs.)  We haven't got sources for any of these UNICes, and most
of the object files are stripped, so if a utility dumps core we have
on idea whether we did something wrong or what we can do about it.
Invariably the utility in question is written in C.  What can we do
in our own code to prevent this kind of insult to customers/users?