alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/03/90)
I'm interested in what approaches people use for error handling, particularly in general purpose function libraries and large software systems. If someone can reference a text or article, that would be good. Meanwhile, to give an example of the type of stuff I'm thinking of, here are some examples. Say you're writing a function library. When you detect an error (the distinction between user and internal errors is ignored here) you can 1. Set an error code (similar to errno in Unix) and return -1, leaving the application to check for specific errors or print out one of a supplied set of messages. 2. Call an application-specified error handler, and if none, call a default error handler. 3. Print out a message from the package. (Where this should be displayed and how is also a factor. It may not be possible to write to the user's terminal.) Similarly, say an application discovers that it has some internal error. Should it 1. Print a message and dump core. 2. Save some transparently-maintained log of user actions to disk along with a message indicating what's happened. 3. Provide a traceback of function calls and ask the user whether or not to continue. What other approaches are there? -- -- Alan # My aptitude test in high school suggested that ..!ames!elroy!alan # I should become a forest ranger. Take my alan@elroy.jpl.nasa.gov # opinions on computers with a grain of salt.
rmartin@clear.com (Bob Martin) (11/03/90)
In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >I'm interested in what approaches people use for error handling, particularly >in general purpose function libraries and large software systems. If someone Alan: In a large software system the number of places where the code can detect errors can range into the tens of thousands. Every call to malloc should be checked, every IO operation, every place where something can go wrong should be checked... This leads to an enormous amount of error checking and error messages. What I have done in the past to cope with this is to create an ErrorLog function which will write single line error messages into an error log file. The messages consist of the time, process id, and three numbers. mod,loc(param). <mod> is a unique number which identifies the module in which the error occured. (by module I mean .c file.) <loc> is a number unique to that modules which identifies the specific call the the ErrorLog function. Thus if a module can detect 20 errors, it will do so with 20 separate <loc> numbers. Errors of similar types should _not_ use the same <loc>! <param> is a piece of useful information (if any) which can be logged with the error. For example, when a malloc returns NULL, I usually put the size that I was attempting to malloc in the <param>. errno would be useful in <param> when appropriate.... Every hour I close the current error log file and open a new one. At the end of the day I compile then into a summary and eyeball them to see if anything horrible went wrong. Software can be written to automatically scan these logs to see if there are critical errors. Hope this helps. I welcome discussion. ---------------------------------------------------------------------- R. Martin rmartin@clear.com uunet!clrcom!rmartin ----------------------------------------------------------------------
cox@stpstn.UUCP (Brad Cox) (11/04/90)
In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >I'm interested in what approaches people use for error handling, particularly >in general purpose function libraries and large software systems. If someone >can reference a text or article, that would be good. > You can (and should) do a lot better than this even in languages (like Objective-C or C++) that don't (yet) support exception handling. The same tricks should be even easier in Smalltalk, but I'll leave this to Smalltalk people to verify. In what follows, I'm speaking of my own private research system, not the commercial Objective-C release, which I'm trying to get changed to adopt this scheme. The essence of the idea is to forbid routines that encounter exceptions from signaling the exception by returning normally with an abnormal return value. The idea is the following rule: if a message returns at all, the message has succeeded at its task and any values returned are valid, useful, and need not be checked. If it failed because of an exception, it *never* returns normally, but instead returns to a handler specifically for that exception. To adopt this scheme, seek out all old code that does this (such as Object factory methods like new and instance methods like doesNotUnderstand:, and utility stuff like assert()), recode them to invoke a generalized exception handling facility, perhaps by calling raiseException(anExceptionCode) or (as in my case) [AssertionException raiseException], [OutOfMemoryException raiseException], etc, etc. These calls access a stack of exception handlers, initialized at startup time with handlers that treat all exceptions in some usefully generic way, perhaps by printing the exception name and raising an inspector. This is a stack is so that your code can override the generic inspector with something more application specific. If you lift the hood and peek inside, the raiseException is implemented from setjmp(), and the code that pushes a new exception handler is implemented from longjmp(). However implemented, however, the scheme you describe, although standardly supported by nearly all languages, has extremely poor properties for building large robust software libraries. That is why so many 'modern' languages (Ada, Eiffel?, CLU, etc) feature 'true' exception handling somewhat as outlined above. -- Brad Cox; cox@stepstone.com; CI$ 71230,647; 203 426 1875 The Stepstone Corporation; 75 Glen Road; Sandy Hook CT 06482
bertrand@eiffel.UUCP (Bertrand Meyer) (11/07/90)
From <1990Nov2.205831.23696@elroy.jpl.nasa.gov> by alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) > I'm interested in what approaches people use for error handling, particularly > in general purpose function libraries and large software systems. If someone > can reference a text or article, that would be good. Some of the classic references are the articles by Brian Randell in the seventies on recovery blocks, continued by several people, in particular Flaviu Cristian. (Randell is a professor at the University of Newcastle, and Cristian, who when I last heard was at IBM's Almaden laboratories, did his PhD with him.) Here are two references among many (in Refer format): %A Brian Randell %T System Structure for Software Fault Tolerance %J IEEE Transactions on Software Engineering %V SE-1 %N 2 %D June 1975 %P 220-232 %A Flaviu Cristian %T On Exceptions, Failures and Errors %J Technology and Science of Informatics %V 4 %N 1 %D January 1985 %K TSI (Cristian also had a paper in IEEE Transactions on SE, but I don't have the exact reference here. I could find it if needed, though.) Some of the work around CLU is also interesting, e.g. %A Barbara A. Liskov %A Alan Snyder %T Exception Handling in CLU %J IEEE Transactions on Software Engineering %V SE-5 %N 6 %D November 1979 %P 546-558 (I should add that I have strong objections both to the Randell-Cristian approach and to the CLU exception mechanism which, however, is certainly less dangerous than Ada's. But all of the above articles are good reading regardless of whether one agrees with the stand they take.) Let me also, with a total absence of modesty, point at some of my own work in the context of object-oriented design, in particular the book ``Object-Oriented Software Construction'' (Prentice-Hall): Chapter 7, Systematic Approaches to Software Construction (especially 7.10, Coping with Failure), and section 9.3, Dealing with Abnormal Cases. The approach expounded there is based on a theory called Programming by Contract, which is further developed in a long article with precisely this title. The article is currently part of the book ``An Eiffel Collection'' published by my company, but will be republished as a chapter of a Prentice-Hall collective book entitled ``Advances in Object-Oriented Software Engineering'', edited by Dino Mandrioli and myself. (That book is in press and should be available in a few months.) -- -- Bertrand Meyer bertrand@eiffel.com
rh@smds.UUCP (Richard Harter) (11/11/90)
In article <1990Nov3.153643.26368@clear.com>, rmartin@clear.com (Bob Martin) writes: > In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: > >I'm interested in what approaches people use for error handling, particularly > >in general purpose function libraries and large software systems. If someone [See the referenced article, which is commendably well written.] In our key product, which we assume is mission critical for our users, we take the strong view that any trapped error is a fatal error. We try to arrange that the software fails gracefully and that it produces as much information about the error as possible. Our view is that the software should not fail, so we don't put any bugs in the code. :-) Seriously, here are some of the techniques used: The code is liberally sprinkled with error checks. Failed validity checks are fatal; they generate a call to a universal error handler. The error handler generates an error report (if possible) and exits. The code maintains a history of the last 128 function calls in a circular buffer; this information is dumped in the error report. Each error type has its own message number. System utilities (e.g. storage allocation and file I/O) are wrapped; the wrappers have their own error service reports. Other information (ERRNO where relevant, for example) is included. The general view is that a failure is a bug; they aren't supposed to happen. If a failure should indeed happen, we want as much information as possible to find and eliminate the bug. I would like to see more contributions on this topic. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
amodeo@dataco.UUCP (Roy Amodeo) (11/12/90)
In article <1990Nov3.153643.26368@clear.com> rmartin@clear.com (Bob Martin) writes: >In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >>I'm interested in what approaches people use for error handling, particularly >>in general purpose function libraries and large software systems. If someone > >Alan: > >In a large software system the number of places where the code can >detect errors can range into the tens of thousands. ... > >What I have done in the past to cope with this is to create an >ErrorLog function which will write single line error messages into >an error log file. ... > ... Errors of similar types should _not_ use the >same <loc>! ... > >Every hour I close the current error log file and open a new one. >At the end of the day I compile then into a summary and eyeball >them to see if anything horrible went wrong. Software can be written >to automatically scan these logs to see if there are critical errors. > > Hope this helps. > I welcome discussion. You got it. I have some general comments about your scheme. The <loc> numbers would have to either be stored in one central place or you would need an allocation scheme that allocates blocks of numbers to various subsystems. Either way, this seems like an awful lot of work to set up. The manual nature of evaluating whether any serious errors have occurred bothers me ( unless you're the only one that runs your software ). It would require a rather intimate knowledge of the entire system. It also bothers me that the errors are "hidden" in a logfile (again assuming other people run your software). Out of curiosity, how big do your logfiles get? > >---------------------------------------------------------------------- >R. Martin >rmartin@clear.com >uunet!clrcom!rmartin >---------------------------------------------------------------------- And, in answer to Alan's original posting: I really like exceptions. I don't use them. Exceptions in C require writing an exception handling mechanism which I have never had the time to write for my own "small" programs. There are other systems I use which have used different error handling mechanisms from Day One and are "too big" to change now. All the code we write returns 0 on error. ( By never using '0' as an index, I can usually get away with this. ) Usually failures are trickled up to the function level where enough is known that they can be handled. The macro we use to "fail out" of a routine is called "FAILIF" and takes a condition, an error number and an error parm as parameters. If the condition is true, the error number is assigned the global variable errno, and the error parm is assigned to the global variable errparm. In addition, FAIL's behaviour can be modified to do a little cleanup before returning which solves one of the problems of multiple returns, although it is not that elegant. At higher levels, we will check for errors that we do not wish to handle ( like failures from malloc() ) by using fatal assertions. A fatal assertion asserts that a condition is true (nonzero), otherwise, it prints the string argument, hex dumps any areas of memory that the user wishes to dump, prints the errno, the name of the errno, the errparm, the line number, the file name, and function trace. ( This varies from the standard UNIX assert() mechanism. ) It then terminates the program. A non-fatal assert is also available for conditions that must be reported but need not be acted upon. Code using FAILIF and fatal assertions reads quite easily and is easy to write. You generally check a condition once, FAILIF or FASSERT it, and continue, secure that you are dealing with only valid values from here on in. Assertions should actually be coded in the interface to the routine because they can be valuable documentation, but we're not that sophisticated yet. To reduce code overhead, there are a number of functions whose failure is almost never handled (fclose, malloc, write, ... ). These functions are generally wrapped in envelopes that assert the success of the call. The user can then use the secure call if he wishes to program safely, or the lower level interface if he can handle the error himself or if he doesn't care. ( Apathy is the only good reason for ignoring return codes. ) One of the problems with the trickle-up method of subroutine failure, is that, often, you do not wish to decide on how fatal the error is at the lower level, and so the error trickles up to a much higher level where the severity is understood, but the exact condition which caused the error is lost. There are also cases where no one level contains all the info needed for a meaningful error message. One solution to this is to use a stack of errnos and errparms instead of single global variables. It also helps to have a user definable error string that is saved in this stack. As the error gets passed up the call chain more information is added. If the main program chooses to abort, the entire error stack can be dumped giving a complete description of the error. Although this generates really nicely detailed error messages with very little coding trouble, I have not used it on any programs that have enough levels of function calling to make it really worthwhile. Anyway, those are my experiences. And my code is usually a great test suite for error checking mechanisms! rba iv
alanf@bruce.cs.monash.OZ.AU (Alan Grant Finlay) (11/12/90)
In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > In our key product, which we assume is mission critical for our users, > we take the strong view that any trapped error is a fatal error. We try This kind of approach involves either a purist conception about what is an error or a library package which has such a narrow area of application that applications using the package have highly predictable requirements. The problem for the supplier of a (practical) general purpose library package is that what may be regarded an error by one application is not so regarded by another. An error such as "device not ready" for example could mean a fatal error (you forgot to buy the hardware) or a minor interuption (you forgot to turn it on). A classic example is what to do about a request to perform an operation upon an object which doesn't exist. For many systems the appropriate action is to ignore the operation while for others it is a fatal error. One can always provide an additional procedure to check that the object exists but this procedure may do a lot of work which has to be repeated when the original operation is subsequently requested. One way to reduce the wasted effort is to save the work done by the check routine but this is likely to be messy and appear complicated to users of the package. The normal solution is to have some control about what happens when an "error" occurs. The subject line is "error handling techniques" and the original posting requested information about techniques for library packages. For the reason given above this is better called "exception handling techniques". If you want to get some insight into the difficulty of classifying errors/warnings/ exceptions have a look at the standard functions for the Icon programming language. Icon, which developed from Snobol, has expressions which can evaluate properly or fail. The language has control constructs such as "repeat <exp>" which repeatedly evaluates the expression until it fails. Most of the standard functions can fail but some also cause program termination (i.e. an error). The decision to classify exceptional circumstances as "fail" or "error" is a difficult and somewhat arbitrary one (as admitted by Ralph Griswold, the author of Icon - sorry I don't have a reference for this statement). The best language for exception handling that I have come across is the functional programming language ML.
ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (11/12/90)
In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > The code maintains a history of the last 128 function calls in a circular > buffer; this information is dumped in the error report. I wouldn't mind doing something like this, but how do you do it? Have you a set of macros to ease the job, or what? Is _every_ function call included, or only selected ones? I tend to write recursive code, the trouble with that is that when things go wrong all the calls in the buffer tend to be to the same function (or a small set of mutually recursive functions), have you found a good way around that? Something I've done from time to time when procedure calls had to be sequenced carefully (e.g. a program that generated Fortran code was not to call place_label() twice without an intervening end_statement()) was to have a DFA transition table for that kind of object and have the relevant functions do if (!permitted_operation[object->state][THISFN]) error(...); object->state = next_state[object->state][THISFN]; -- The problem about real life is that moving one's knight to QB3 may always be replied to with a lob across the net. --Alasdair Macintyre.
alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/13/90)
In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > This started out in comp.software-eng, which is where I posted to. Alan's > comments showed up comp.lang.c. I find this somewhat puzzling. I have > redirected it back to comp.software-eng. Actually, I posted the original article to both newsgroups because C is the language I use most and because things can be done in C that may not be possible in all the languages represented by the readers of comp.software-eng. Meanwhile, Alan Findlay writes: > In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > In our key product, which we assume is mission critical for our users, > > we take the strong view that any trapped error is a fatal error. We try > This kind of approach involves either a purist conception about what is an > error or a library package which has such a narrow area of application that > applications using the package have highly predictable requirements. Actually, I assume (perhaps I'm wrong) that the author is not describing a library package, although such an approach might be appropriate in libraries for very critical applications. There are some times when to not do it at all is much better than to do it wrong. > The problem for the supplier of a (practical) general purpose library package > is that what may be regarded an error by one application is not so regarded > by another. Excellent point and the best reason why a lot of simple schemes are really inadequate. > The subject line is "error handling techniques" and the original posting > requested information about techniques for library packages. For the reason > given above this is better called "exception handling techniques". Actually, the original posting solicited information about techniques for large systems (principally turn-key type applications) as well as libraries. And it was supposed to address regular user errors (application passes bad parameter to function library, for example) as well as horrible unexpected system errors. It's unclear to me how suitable an exceptions approach is to the former case, although I haven't thought about it a lot. -- -- Alan # My aptitude test in high school suggested that ..!ames!elroy!alan # I should become a forest ranger. Take my alan@elroy.jpl.nasa.gov # opinions on computers with a grain of salt.
siemsen@sol.usc.edu (Pete Siemsen) (11/13/90)
amodeo@dataco.UUCP (Roy Amodeo) writes: >One solution to this is to use a stack of errnos and errparms instead of >single global variables. It also helps to have a user definable error >string that is saved in this stack. As the error gets passed up the call >chain more information is added. If the main program chooses to abort, >the entire error stack can be dumped giving a complete description of the >error. Although this generates really nicely detailed error messages >with very little coding trouble, I have not used it on any programs that >have enough levels of function calling to make it really worthwhile. I used exactly such a scheme on a project a few years ago. It worked very well, but meant that every subroutine call (this was FORTRAN) took about 6 lines of code. At any level, a subroutine could decide to "handle" errors from below, or pass the errors on up (adding it's own idea of what was wrong first). Once we got used to the verbosity (in C, macros would reduce this), we were impressed with how well it worked. Whenever an error occurred, you'd get a message something like GETREC: READ failed: illegal logical unit number READFILE: unable to read record from file COPYFILES: unable to read input file ABC.DEF MAIN: unable to copy files which is about as helpful as it gets. This was one of the smoothest software projects I have ever worked on. Of course, there were only five of us working on the code (for about 9 months), and we all agreed on the error system before starting. The customer was very pleased. -- Pete Siemsen Pete Siemsen siemsen@usc.edu University of Southern California 645 Ohio Ave. #302 (213) 740-7391 (w) 1020 West Jefferson Blvd. Long Beach, CA 90814 (213) 433-3059 (h) Los Angeles, CA 90089-0251
rh@smds.UUCP (Richard Harter) (11/13/90)
In article <4246@goanna.cs.rmit.oz.au>, ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) writes: > In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > The code maintains a history of the last 128 function calls in a circular > > buffer; this information is dumped in the error report. > I wouldn't mind doing something like this, but how do you do it? > Have you a set of macros to ease the job, or what? > Is _every_ function call included, or only selected ones? > I tend to write recursive code, the trouble with that is that > when things go wrong all the calls in the buffer tend to be to > the same function (or a small set of mutually recursive functions), > have you found a good way around that? Actually the implementation I use is brutally simple. I set up a global array of char pointers and a global integer used as an index into the array, e.g. char *TR[128]; int TI; An initialization routine zeroes out the array and sets TI=0; The standard include file contains the macro #define trace(foo) TR[TI--]=foo;if (TI<0) TI =127;TR[TI]=0 <Note: in this implementation TI is the next location to be filled> The first statement in each routine (of interest) invokes trace with the name of the routine, e.g. trace("somefunc"); It's not terribly sophisticated, but it's very useful. As you note the buffer can fill up with a few names if you are doing a lot of recursion (or are calling a routine within a loop). A simple technique for dealing with this (which I've never bothered to implement) is to add a count array and increment the count if the pointers are equal. Something like the following should work: #define trace(foo) \ if (TR[TI] != foo) {\ TI--;\ if (TI<0) TI=127;\ TR[TI] = foo;\ TRCNT[TI]=1;\ }\ else TRCNT[TI]++ <Note 1: Meaning shift, TI is the index of the slot most recently filled.> <Note 2: I haven't checked the above code.> The error dump routine has the requisite code for going through the strings, getting their lengths, and printing out the array in nice neat columns starting at the right place. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
rh@smds.UUCP (Richard Harter) (11/13/90)
In article <1990Nov12.184217.25361@elroy.jpl.nasa.gov>, alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: > In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > This started out in comp.software-eng, which is where I posted to. Alan's > > comments showed up comp.lang.c. I find this somewhat puzzling. I have > > redirected it back to comp.software-eng. > Actually, I posted the original article to both newsgroups because C is the > language I use most and because things can be done in C that may not be > possible in all the languages represented by the readers of comp.software-eng. Oh. That makes sense. My apologies for confusing the issue. Continuing from your remarks. There are actually a number of categories of software to consider with different error handling issues. For example, there are application programs with human interaction, application programs which are called by other programs, subroutine libraries called by a specific set programs, and libraries which can called by arbitrary programs. We can talk about interface errors and internal errors. An application program may be mission critical in the sense that it must continue operating and do the best that it can despite errors; conversely it may be mission critical in the sense that it must not produce erroneous results. It may not be mission critical; in fact there may be an acceptable level of erroneous behaviour. And so on... From this I would conclude that there are a number of possible strategies. One general remark on this topic that I would like to make is that it seems to me that it is imperative that error handling philosophies be consistent between components being put together. For example a 'mission critical' program should not use routines from a library that dump core or do something erratic if there is a usage error. This places strong constraints on general purpose libraries; IMHO there should be strong constraints. If a GPL routine or program is unsafe (i.e. it can behave in unpredictable ways in response to errors in interface usage) it should be labelled as such. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
geoffb@butcombe.inmos.co.uk (Geoff Barrett) (11/13/90)
Does anyone have an example of a properly handled error?