[comp.lang.c] error handling techniques?

alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/03/90)

I'm interested in what approaches people use for error handling, particularly
in general purpose function libraries and large software systems.  If someone
can reference a text or article, that would be good.  

Meanwhile, to give an example of the type of stuff I'm thinking of, here are
some examples.  Say you're writing a function library.  When you detect an
error (the distinction between user and internal errors is ignored here) you
can

1. Set an error code (similar to errno in Unix) and return -1, leaving the
   application to check for specific errors or print out one of a supplied
   set of messages.
2. Call an application-specified error handler, and if none, call a default
   error handler.
3. Print out a message from the package.  (Where this should be displayed
   and how is also a factor.  It may not be possible to write to the user's
   terminal.)

Similarly, say an application discovers that it has some internal error.
Should it

1. Print a message and dump core.
2. Save some transparently-maintained log of user actions to disk along with
   a message indicating what's happened.
3. Provide a traceback of function calls and ask the user whether or not to
   continue.

What other approaches are there?
-- 

-- Alan			       # My aptitude test in high school suggested that
   ..!ames!elroy!alan	       # I should become a forest ranger.  Take my
   alan@elroy.jpl.nasa.gov     # opinions on computers with a grain of salt.

rmartin@clear.com (Bob Martin) (11/03/90)

In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes:
>I'm interested in what approaches people use for error handling, particularly
>in general purpose function libraries and large software systems.  If someone

Alan:

In a large software system the number of places where the code can
detect errors can range into the tens of thousands.  Every call to
malloc should be checked, every IO operation, every place where 
something can go wrong should be checked...  This leads to an 
enormous amount of error checking and error messages.

What I have done in the past to cope with this is to create an
ErrorLog function which will write single line error messages into
an error log file.  The messages consist of the time, process id, and
three numbers.  mod,loc(param).  <mod> is a unique number which
identifies the module in which the error occured.  (by module I
mean .c file.)  <loc> is a number unique to that modules which
identifies the specific call the the ErrorLog function.  Thus if
a module can detect 20 errors, it will do so with 20 separate
<loc> numbers.  Errors of similar types should _not_ use the
same <loc>!   <param> is a piece of useful information (if any)
which can be logged with the error.  For example, when a malloc
returns NULL, I usually put the size that I was attempting to
malloc in the <param>.  errno would be useful in <param> when
appropriate....

Every hour I close the current error log file and open a new one.
At the end of the day I compile then into a summary and eyeball
them to see if anything horrible went wrong.  Software can be written
to automatically scan these logs to see if there are critical errors.

							Hope this helps.
							I welcome discussion.

----------------------------------------------------------------------
R. Martin
rmartin@clear.com
uunet!clrcom!rmartin
----------------------------------------------------------------------

cox@stpstn.UUCP (Brad Cox) (11/04/90)

In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes:
>I'm interested in what approaches people use for error handling, particularly
>in general purpose function libraries and large software systems.  If someone
>can reference a text or article, that would be good.  
>
You can (and should) do a lot better than this even in languages (like
Objective-C or C++) that don't (yet) support exception handling. The same
tricks should be even easier in Smalltalk, but I'll leave this to Smalltalk
people to verify.

In what follows, I'm speaking of my own private research system, not the
commercial Objective-C release, which I'm trying to get changed to adopt
this scheme.

The essence of the idea is to forbid routines that encounter exceptions from 
signaling the exception by returning normally with an abnormal return value.
The idea is the following rule: if a message returns at all, the message has
succeeded at its task and any values returned are valid, useful, and need not
be checked. If it failed because of an exception, it *never* returns normally,
but instead returns to a handler specifically for that exception.

To adopt this scheme, seek out all old code that does this (such as Object 
factory methods like new and instance methods like doesNotUnderstand:, and 
utility stuff like assert()), recode them to invoke a generalized exception 
handling facility, perhaps by calling raiseException(anExceptionCode) or 
(as in my case) [AssertionException raiseException], [OutOfMemoryException 
raiseException], etc, etc.

These calls access a stack of exception handlers, initialized at
startup time with handlers that treat all exceptions in some usefully
generic way, perhaps by printing the exception name and raising an
inspector. This is a stack is so that your code can override the
generic inspector with something more application specific.

If you lift the hood and peek inside, the raiseException is implemented
from setjmp(), and the code that pushes a new exception handler is implemented
from longjmp().

However implemented, however, the scheme you describe, although standardly
supported by nearly all languages, has extremely poor properties for building
large robust software libraries. That is why so many 'modern' languages
(Ada, Eiffel?, CLU, etc) feature 'true' exception handling somewhat as
outlined above.

-- 

Brad Cox; cox@stepstone.com; CI$ 71230,647; 203 426 1875
The Stepstone Corporation; 75 Glen Road; Sandy Hook CT 06482

bertrand@eiffel.UUCP (Bertrand Meyer) (11/07/90)

From <1990Nov2.205831.23696@elroy.jpl.nasa.gov> by alan@cogswell.Jpl.Nasa.Gov
(Alan S. Mazer)
> I'm interested in what approaches people use for error handling, particularly
> in general purpose function libraries and large software systems.  If someone
> can reference a text or article, that would be good.  

	Some of the classic references are the articles by Brian Randell
in the seventies on recovery blocks, continued by several people,
in particular Flaviu Cristian. (Randell is a professor at the University
of Newcastle, and Cristian, who when I last heard was at IBM's Almaden
laboratories, did his PhD with him.) Here are two references among
many (in Refer format):

%A Brian Randell
%T System Structure for Software Fault Tolerance
%J IEEE Transactions on Software Engineering
%V SE-1
%N 2
%D June 1975
%P 220-232

%A Flaviu Cristian
%T On Exceptions, Failures and Errors
%J Technology and Science of Informatics
%V 4
%N 1
%D January 1985
%K TSI

(Cristian also had a paper in IEEE Transactions on SE, but I don't
have the exact reference here. I could find it if needed, though.)
Some of the work around CLU is also interesting, e.g.

%A Barbara A. Liskov
%A Alan Snyder
%T Exception Handling in CLU
%J IEEE Transactions on Software Engineering
%V SE-5
%N 6
%D November 1979
%P 546-558

(I should add that I have strong objections both to the
Randell-Cristian approach and to the CLU exception mechanism which,
however, is certainly less dangerous than Ada's. But all of the above
articles are good reading regardless of whether one agrees with the
stand they take.)

	Let me also, with a total absence of modesty, point at some of
my own work in the context of object-oriented design, in particular
the book ``Object-Oriented Software Construction'' (Prentice-Hall):
Chapter 7, Systematic Approaches to Software Construction (especially
7.10, Coping with Failure), and section 9.3, Dealing with Abnormal
Cases.

	The approach expounded there is based on a theory called Programming
by Contract, which is further developed in a long article with
precisely this title. The article is currently part of the book
``An Eiffel Collection'' published by my company, but will be republished
as a chapter of a Prentice-Hall collective book entitled
``Advances in Object-Oriented Software Engineering'', edited by Dino Mandrioli
and myself. (That book is in press and should be available in a few months.)
-- 
-- Bertrand Meyer
bertrand@eiffel.com

rh@smds.UUCP (Richard Harter) (11/11/90)

In article <1990Nov3.153643.26368@clear.com>, rmartin@clear.com (Bob Martin) writes:
> In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes:
> >I'm interested in what approaches people use for error handling, particularly
> >in general purpose function libraries and large software systems.  If someone

	[See the referenced article, which is commendably well written.]

In our key product, which we assume is mission critical for our users,
we take the strong view that any trapped error is a fatal error.  We try
to arrange that the software fails gracefully and that it produces as
much information about the error as possible.  Our view is that the software
should not fail, so we don't put any bugs in the code. :-)  Seriously,
here are some of the techniques used:

The code is liberally sprinkled with error checks.  Failed validity
checks are fatal; they generate a call to a universal error handler.  
The error handler generates an error report (if possible) and exits.
The code maintains a history of the last 128 function calls in a circular
buffer; this information is dumped in the error report.  Each error type
has its own message number.  System utilities (e.g. storage allocation
and file I/O) are wrapped; the wrappers have their own error service
reports.  Other information (ERRNO where relevant, for example) is
included.  The general view is that a failure is a bug; they aren't
supposed to happen.  If a failure should indeed happen, we want as much
information as possible to find and eliminate the bug.

I would like to see more contributions on this topic.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

amodeo@dataco.UUCP (Roy Amodeo) (11/12/90)

In article <1990Nov3.153643.26368@clear.com> rmartin@clear.com (Bob Martin) writes:
>In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes:
>>I'm interested in what approaches people use for error handling, particularly
>>in general purpose function libraries and large software systems.  If someone
>
>Alan:
>
>In a large software system the number of places where the code can
>detect errors can range into the tens of thousands. ...
>
>What I have done in the past to cope with this is to create an
>ErrorLog function which will write single line error messages into
>an error log file. ...
>                  ...  Errors of similar types should _not_ use the
>same <loc>! ...
>
>Every hour I close the current error log file and open a new one.
>At the end of the day I compile then into a summary and eyeball
>them to see if anything horrible went wrong.  Software can be written
>to automatically scan these logs to see if there are critical errors.
>
>							Hope this helps.
>							I welcome discussion.
You got it.

I have some general comments about your scheme.

The <loc> numbers would have to either be stored in one central place or
you would need an allocation scheme that allocates blocks of numbers to
various subsystems. Either way, this seems like an awful lot of work to
set up.

The manual nature of evaluating whether any serious errors have occurred
bothers me ( unless you're the only one that runs your software ). It would
require a rather intimate knowledge of the entire system. It also
bothers me that the errors are "hidden" in a logfile (again assuming
other people run your software). Out of curiosity, how big do your logfiles
get?

>
>----------------------------------------------------------------------
>R. Martin
>rmartin@clear.com
>uunet!clrcom!rmartin
>----------------------------------------------------------------------

And, in answer to Alan's original posting:

I really like exceptions. I don't use them. Exceptions in C require writing an
exception handling mechanism which I have never had the time to write for
my own "small" programs. There are other systems I use which have used different
error handling mechanisms from Day One and are "too big" to change now.

All the code we write returns 0 on error. ( By never using '0' as an index,
I can usually get away with this. ) Usually failures are trickled up to
the function level where enough is known that they can be handled. The macro
we use to "fail out" of a routine is called "FAILIF" and takes a condition,
an error number and an error parm as parameters. If the condition is true,
the error number is assigned the global variable errno, and the error parm
is assigned to the global variable errparm. In addition, FAIL's behaviour
can be modified to do a little cleanup before returning which solves one of
the problems of multiple returns, although it is not that elegant.

At higher levels, we will check for errors that we do not wish to handle
( like failures from malloc() ) by using fatal assertions. A fatal assertion
asserts that a condition is true (nonzero), otherwise, it prints the
string argument, hex dumps any areas of memory that the user wishes to dump,
prints the errno, the name of the errno, the errparm, the line number, the
file name, and function trace. ( This varies from the standard UNIX assert()
mechanism. ) It then terminates the program. A non-fatal assert is also
available for conditions that must be reported but need not be acted upon.

Code using FAILIF and fatal assertions reads quite easily and is easy to
write. You generally check a condition once, FAILIF or FASSERT it, and
continue, secure that you are dealing with only valid values from here on
in. Assertions should actually be coded in the interface to the routine
because they can be valuable documentation, but we're not that sophisticated
yet.

To reduce code overhead, there are a number of functions whose failure
is almost never handled (fclose, malloc, write, ... ). These functions
are generally wrapped in envelopes that assert the success of the call.
The user can then use the secure call if he wishes to program safely,
or the lower level interface if he can handle the error himself or if
he doesn't care. ( Apathy is the only good reason for ignoring return
codes. )

One of the problems with the trickle-up method of subroutine failure, is
that, often, you do not wish to decide on how fatal the error is at the
lower level, and so the error trickles up to a much higher level where
the severity is understood, but the exact condition which caused the error
is lost. There are also cases where no one level contains all the info
needed for a meaningful error message.

One solution to this is to use a stack of errnos and errparms instead of
single global variables. It also helps to have a user definable error
string that is saved in this stack. As the error gets passed up the call
chain more information is added. If the main program chooses to abort,
the entire error stack can be dumped giving a complete description of the
error. Although this generates really nicely detailed error messages
with very little coding trouble, I have not used it on any programs that
have enough levels of function calling to make it really worthwhile.

Anyway, those are my experiences. And my code is usually a great test suite
for error checking mechanisms!

rba iv

alanf@bruce.cs.monash.OZ.AU (Alan Grant Finlay) (11/12/90)

In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> 
> In our key product, which we assume is mission critical for our users,
> we take the strong view that any trapped error is a fatal error.  We try

This kind of approach involves either a purist conception about what is an
error or a library package which has such a narrow area of application that
applications using the package have highly predictable requirements.  

The problem for the supplier of a (practical) general purpose library package
is that what may be regarded an error by one application is not so regarded
by another.  An error such as "device not ready" for example could mean 
a fatal error (you forgot to buy the hardware) or a minor interuption (you
forgot to turn it on).  A classic example is what to do about a request to
perform an operation upon an object which doesn't exist.  For many systems
the appropriate action is to ignore the operation while for others it is a
fatal error.  One can always provide an additional procedure to check that the
object exists but this procedure may do a lot of work which has to be
repeated when the original operation is subsequently requested.   One way to
reduce the wasted effort is to save the work done by the check routine but
this is likely to be messy and appear complicated to users of the package.
The normal solution is to have some control about what happens when an
"error" occurs.

The subject line is "error handling techniques" and the original posting 
requested information about techniques for library packages.  For the reason
given above this is better called "exception handling techniques".  If you
want to get some insight into the difficulty of classifying errors/warnings/
exceptions have a look at the standard functions for the Icon programming
language.  Icon, which developed from Snobol, has expressions which can
evaluate properly or fail.  The language has control constructs such as
"repeat <exp>" which repeatedly evaluates the expression until it fails.
Most of the standard functions can fail but some also cause program termination
(i.e. an error).  The decision to classify exceptional circumstances as "fail"
or "error" is a difficult and somewhat arbitrary one (as admitted by Ralph 
Griswold, the author of Icon - sorry I don't have a reference for this
statement).

The best language for exception handling that I have come across is the
functional programming language ML.

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (11/12/90)

In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> The code maintains a history of the last 128 function calls in a circular
> buffer; this information is dumped in the error report.

I wouldn't mind doing something like this, but how do you do it?
Have you a set of macros to ease the job, or what?
Is _every_ function call included, or only selected ones?
I tend to write recursive code, the trouble with that is that
when things go wrong all the calls in the buffer tend to be to
the same function (or a small set of mutually recursive functions),
have you found a good way around that?

Something I've done from time to time when procedure calls had to be
sequenced carefully (e.g. a program that generated Fortran code was
not to call place_label() twice without an intervening end_statement())
was to have a DFA transition table for that kind of object and have
the relevant functions do
	if (!permitted_operation[object->state][THISFN]) error(...);
	object->state = next_state[object->state][THISFN];

-- 
The problem about real life is that moving one's knight to QB3
may always be replied to with a lob across the net.  --Alasdair Macintyre.

alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/13/90)

In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> This started out in comp.software-eng, which is where I posted to.  Alan's
> comments showed up comp.lang.c.  I find this somewhat puzzling.  I have
> redirected it back to comp.software-eng.

Actually, I posted the original article to both newsgroups because C is the
language I use most and because things can be done in C that may not be
possible in all the languages represented by the readers of comp.software-eng.

Meanwhile, Alan Findlay writes:

> In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> > In our key product, which we assume is mission critical for our users,
> > we take the strong view that any trapped error is a fatal error.  We try

> This kind of approach involves either a purist conception about what is an
> error or a library package which has such a narrow area of application that
> applications using the package have highly predictable requirements.

Actually, I assume (perhaps I'm wrong) that the author is not describing a
library package, although such an approach might be appropriate in libraries
for very critical applications.  There are some times when to not do it at
all is much better than to do it wrong.

> The problem for the supplier of a (practical) general purpose library package
> is that what may be regarded an error by one application is not so regarded
> by another.

Excellent point and the best reason why a lot of simple schemes are really
inadequate.

> The subject line is "error handling techniques" and the original posting
> requested information about techniques for library packages.  For the reason
> given above this is better called "exception handling techniques".

Actually, the original posting solicited information about techniques for large
systems (principally turn-key type applications) as well as libraries.  And it
was supposed to address regular user errors (application passes bad parameter
to function library, for example) as well as horrible unexpected system
errors.  It's unclear to me how suitable an exceptions approach is to the
former case, although I haven't thought about it a lot.
-- 

-- Alan			       # My aptitude test in high school suggested that
   ..!ames!elroy!alan	       # I should become a forest ranger.  Take my
   alan@elroy.jpl.nasa.gov     # opinions on computers with a grain of salt.

siemsen@sol.usc.edu (Pete Siemsen) (11/13/90)

amodeo@dataco.UUCP (Roy Amodeo) writes:

>One solution to this is to use a stack of errnos and errparms instead of
>single global variables. It also helps to have a user definable error
>string that is saved in this stack. As the error gets passed up the call
>chain more information is added. If the main program chooses to abort,
>the entire error stack can be dumped giving a complete description of the
>error. Although this generates really nicely detailed error messages
>with very little coding trouble, I have not used it on any programs that
>have enough levels of function calling to make it really worthwhile.

I used exactly such a scheme on a project a few years ago.  It worked
very well, but meant that every subroutine call (this was FORTRAN)
took about 6 lines of code.  At any level, a subroutine could decide
to "handle" errors from below, or pass the errors on up (adding it's
own idea of what was wrong first).  Once we got used to the verbosity
(in C, macros would reduce this), we were impressed with how well it
worked.  Whenever an error occurred, you'd get a message something
like

GETREC: READ failed: illegal logical unit number
READFILE: unable to read record from file
COPYFILES: unable to read input file ABC.DEF
MAIN: unable to copy files

which is about as helpful as it gets.  This was one of the smoothest
software projects I have ever worked on.  Of course, there were only
five of us working on the code (for about 9 months), and we all agreed
on the error system before starting.  The customer was very pleased.

-- 
Pete Siemsen                         Pete Siemsen            siemsen@usc.edu
University of Southern California    645 Ohio Ave. #302      (213) 740-7391 (w)
1020 West Jefferson Blvd.            Long Beach, CA 90814    (213) 433-3059 (h)
Los Angeles, CA 90089-0251

rh@smds.UUCP (Richard Harter) (11/13/90)

In article <4246@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes:
> In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> > The code maintains a history of the last 128 function calls in a circular
> > buffer; this information is dumped in the error report.

> I wouldn't mind doing something like this, but how do you do it?
> Have you a set of macros to ease the job, or what?
> Is _every_ function call included, or only selected ones?
> I tend to write recursive code, the trouble with that is that
> when things go wrong all the calls in the buffer tend to be to
> the same function (or a small set of mutually recursive functions),
> have you found a good way around that?

Actually the implementation I use is brutally simple.  I set up a global
array of char pointers and a global integer used as an index into the
array, e.g.

	char *TR[128];
	int TI;

An initialization routine zeroes out the array and sets TI=0;  The standard
include file contains the macro

	#define trace(foo) TR[TI--]=foo;if (TI<0) TI =127;TR[TI]=0

<Note: in this implementation TI is the next location to be filled>
The first statement in each routine (of interest) invokes trace with the
name of the routine, e.g.

	trace("somefunc");

It's not terribly sophisticated, but it's very useful.  As you note the
buffer can fill up with a few names if you are doing a lot of recursion
(or are calling a routine within a loop).  A simple technique for dealing
with this (which I've never bothered to implement) is to add a count array
and increment the count if the pointers are equal.  Something like the
following should work:

	#define trace(foo) \
	if (TR[TI] != foo) {\
		TI--;\
		if (TI<0) TI=127;\
		TR[TI] = foo;\
		TRCNT[TI]=1;\
		}\
	else TRCNT[TI]++

<Note 1: Meaning shift, TI is the index of the slot most recently filled.>
<Note 2: I haven't checked the above code.>

The error dump routine has the requisite code for going through the strings,
getting their lengths, and printing out the array in nice neat columns starting
at the right place.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

rh@smds.UUCP (Richard Harter) (11/13/90)

In article <1990Nov12.184217.25361@elroy.jpl.nasa.gov>, alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes:
> In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes:
> > This started out in comp.software-eng, which is where I posted to.  Alan's
> > comments showed up comp.lang.c.  I find this somewhat puzzling.  I have
> > redirected it back to comp.software-eng.

> Actually, I posted the original article to both newsgroups because C is the
> language I use most and because things can be done in C that may not be
> possible in all the languages represented by the readers of comp.software-eng.

Oh.  That makes sense.  My apologies for confusing the issue.

Continuing from your remarks.  There are actually a number of categories
of software to consider with different error handling issues.  For example,
there are application programs with human interaction, application programs
which are called by other programs, subroutine libraries called by a specific
set programs, and libraries which can called by arbitrary programs.  We can
talk about interface errors and internal errors.  An application program
may be mission critical in the sense that it must continue operating and do
the best that it can despite errors; conversely it may be mission critical
in the sense that it must not produce erroneous results.  It may not be
mission critical; in fact there may be an acceptable level of erroneous
behaviour.  And so on...  From this I would conclude that there are a
number of possible strategies.

One general remark on this topic that I would like to make is that it seems
to me that it is imperative that error handling philosophies be consistent
between components being put together.  For example a 'mission critical'
program should not use routines from a library that dump core or do
something erratic if there is a usage error.  This places strong constraints
on general purpose libraries; IMHO there should be strong constraints.
If a GPL routine or program is unsafe (i.e. it can behave in unpredictable
ways in response to errors in interface usage) it should be labelled as such.
-- 
Richard Harter, Software Maintenance and Development Systems, Inc.
Net address: jjmhome!smds!rh Phone: 508-369-7398 
US Mail: SMDS Inc., PO Box 555, Concord MA 01742
This sentence no verb.  This sentence short.  This signature done.

geoffb@butcombe.inmos.co.uk (Geoff Barrett) (11/13/90)

Does anyone have an example of a properly handled error?