alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/03/90)
I'm interested in what approaches people use for error handling, particularly in general purpose function libraries and large software systems. If someone can reference a text or article, that would be good. Meanwhile, to give an example of the type of stuff I'm thinking of, here are some examples. Say you're writing a function library. When you detect an error (the distinction between user and internal errors is ignored here) you can 1. Set an error code (similar to errno in Unix) and return -1, leaving the application to check for specific errors or print out one of a supplied set of messages. 2. Call an application-specified error handler, and if none, call a default error handler. 3. Print out a message from the package. (Where this should be displayed and how is also a factor. It may not be possible to write to the user's terminal.) Similarly, say an application discovers that it has some internal error. Should it 1. Print a message and dump core. 2. Save some transparently-maintained log of user actions to disk along with a message indicating what's happened. 3. Provide a traceback of function calls and ask the user whether or not to continue. What other approaches are there? -- -- Alan # My aptitude test in high school suggested that ..!ames!elroy!alan # I should become a forest ranger. Take my alan@elroy.jpl.nasa.gov # opinions on computers with a grain of salt.
rmartin@clear.com (Bob Martin) (11/03/90)
In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >I'm interested in what approaches people use for error handling, particularly >in general purpose function libraries and large software systems. If someone Alan: In a large software system the number of places where the code can detect errors can range into the tens of thousands. Every call to malloc should be checked, every IO operation, every place where something can go wrong should be checked... This leads to an enormous amount of error checking and error messages. What I have done in the past to cope with this is to create an ErrorLog function which will write single line error messages into an error log file. The messages consist of the time, process id, and three numbers. mod,loc(param). <mod> is a unique number which identifies the module in which the error occured. (by module I mean .c file.) <loc> is a number unique to that modules which identifies the specific call the the ErrorLog function. Thus if a module can detect 20 errors, it will do so with 20 separate <loc> numbers. Errors of similar types should _not_ use the same <loc>! <param> is a piece of useful information (if any) which can be logged with the error. For example, when a malloc returns NULL, I usually put the size that I was attempting to malloc in the <param>. errno would be useful in <param> when appropriate.... Every hour I close the current error log file and open a new one. At the end of the day I compile then into a summary and eyeball them to see if anything horrible went wrong. Software can be written to automatically scan these logs to see if there are critical errors. Hope this helps. I welcome discussion. ---------------------------------------------------------------------- R. Martin rmartin@clear.com uunet!clrcom!rmartin ----------------------------------------------------------------------
cox@stpstn.UUCP (Brad Cox) (11/04/90)
In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >I'm interested in what approaches people use for error handling, particularly >in general purpose function libraries and large software systems. If someone >can reference a text or article, that would be good. > You can (and should) do a lot better than this even in languages (like Objective-C or C++) that don't (yet) support exception handling. The same tricks should be even easier in Smalltalk, but I'll leave this to Smalltalk people to verify. In what follows, I'm speaking of my own private research system, not the commercial Objective-C release, which I'm trying to get changed to adopt this scheme. The essence of the idea is to forbid routines that encounter exceptions from signaling the exception by returning normally with an abnormal return value. The idea is the following rule: if a message returns at all, the message has succeeded at its task and any values returned are valid, useful, and need not be checked. If it failed because of an exception, it *never* returns normally, but instead returns to a handler specifically for that exception. To adopt this scheme, seek out all old code that does this (such as Object factory methods like new and instance methods like doesNotUnderstand:, and utility stuff like assert()), recode them to invoke a generalized exception handling facility, perhaps by calling raiseException(anExceptionCode) or (as in my case) [AssertionException raiseException], [OutOfMemoryException raiseException], etc, etc. These calls access a stack of exception handlers, initialized at startup time with handlers that treat all exceptions in some usefully generic way, perhaps by printing the exception name and raising an inspector. This is a stack is so that your code can override the generic inspector with something more application specific. If you lift the hood and peek inside, the raiseException is implemented from setjmp(), and the code that pushes a new exception handler is implemented from longjmp(). However implemented, however, the scheme you describe, although standardly supported by nearly all languages, has extremely poor properties for building large robust software libraries. That is why so many 'modern' languages (Ada, Eiffel?, CLU, etc) feature 'true' exception handling somewhat as outlined above. -- Brad Cox; cox@stepstone.com; CI$ 71230,647; 203 426 1875 The Stepstone Corporation; 75 Glen Road; Sandy Hook CT 06482
bertrand@eiffel.UUCP (Bertrand Meyer) (11/07/90)
From <1990Nov2.205831.23696@elroy.jpl.nasa.gov> by alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) > I'm interested in what approaches people use for error handling, particularly > in general purpose function libraries and large software systems. If someone > can reference a text or article, that would be good. Some of the classic references are the articles by Brian Randell in the seventies on recovery blocks, continued by several people, in particular Flaviu Cristian. (Randell is a professor at the University of Newcastle, and Cristian, who when I last heard was at IBM's Almaden laboratories, did his PhD with him.) Here are two references among many (in Refer format): %A Brian Randell %T System Structure for Software Fault Tolerance %J IEEE Transactions on Software Engineering %V SE-1 %N 2 %D June 1975 %P 220-232 %A Flaviu Cristian %T On Exceptions, Failures and Errors %J Technology and Science of Informatics %V 4 %N 1 %D January 1985 %K TSI (Cristian also had a paper in IEEE Transactions on SE, but I don't have the exact reference here. I could find it if needed, though.) Some of the work around CLU is also interesting, e.g. %A Barbara A. Liskov %A Alan Snyder %T Exception Handling in CLU %J IEEE Transactions on Software Engineering %V SE-5 %N 6 %D November 1979 %P 546-558 (I should add that I have strong objections both to the Randell-Cristian approach and to the CLU exception mechanism which, however, is certainly less dangerous than Ada's. But all of the above articles are good reading regardless of whether one agrees with the stand they take.) Let me also, with a total absence of modesty, point at some of my own work in the context of object-oriented design, in particular the book ``Object-Oriented Software Construction'' (Prentice-Hall): Chapter 7, Systematic Approaches to Software Construction (especially 7.10, Coping with Failure), and section 9.3, Dealing with Abnormal Cases. The approach expounded there is based on a theory called Programming by Contract, which is further developed in a long article with precisely this title. The article is currently part of the book ``An Eiffel Collection'' published by my company, but will be republished as a chapter of a Prentice-Hall collective book entitled ``Advances in Object-Oriented Software Engineering'', edited by Dino Mandrioli and myself. (That book is in press and should be available in a few months.) -- -- Bertrand Meyer bertrand@eiffel.com
rh@smds.UUCP (Richard Harter) (11/11/90)
In article <1990Nov3.153643.26368@clear.com>, rmartin@clear.com (Bob Martin) writes: > In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: > >I'm interested in what approaches people use for error handling, particularly > >in general purpose function libraries and large software systems. If someone [See the referenced article, which is commendably well written.] In our key product, which we assume is mission critical for our users, we take the strong view that any trapped error is a fatal error. We try to arrange that the software fails gracefully and that it produces as much information about the error as possible. Our view is that the software should not fail, so we don't put any bugs in the code. :-) Seriously, here are some of the techniques used: The code is liberally sprinkled with error checks. Failed validity checks are fatal; they generate a call to a universal error handler. The error handler generates an error report (if possible) and exits. The code maintains a history of the last 128 function calls in a circular buffer; this information is dumped in the error report. Each error type has its own message number. System utilities (e.g. storage allocation and file I/O) are wrapped; the wrappers have their own error service reports. Other information (ERRNO where relevant, for example) is included. The general view is that a failure is a bug; they aren't supposed to happen. If a failure should indeed happen, we want as much information as possible to find and eliminate the bug. I would like to see more contributions on this topic. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
amodeo@dataco.UUCP (Roy Amodeo) (11/12/90)
In article <1990Nov3.153643.26368@clear.com> rmartin@clear.com (Bob Martin) writes: >In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >>I'm interested in what approaches people use for error handling, particularly >>in general purpose function libraries and large software systems. If someone > >Alan: > >In a large software system the number of places where the code can >detect errors can range into the tens of thousands. ... > >What I have done in the past to cope with this is to create an >ErrorLog function which will write single line error messages into >an error log file. ... > ... Errors of similar types should _not_ use the >same <loc>! ... > >Every hour I close the current error log file and open a new one. >At the end of the day I compile then into a summary and eyeball >them to see if anything horrible went wrong. Software can be written >to automatically scan these logs to see if there are critical errors. > > Hope this helps. > I welcome discussion. You got it. I have some general comments about your scheme. The <loc> numbers would have to either be stored in one central place or you would need an allocation scheme that allocates blocks of numbers to various subsystems. Either way, this seems like an awful lot of work to set up. The manual nature of evaluating whether any serious errors have occurred bothers me ( unless you're the only one that runs your software ). It would require a rather intimate knowledge of the entire system. It also bothers me that the errors are "hidden" in a logfile (again assuming other people run your software). Out of curiosity, how big do your logfiles get? > >---------------------------------------------------------------------- >R. Martin >rmartin@clear.com >uunet!clrcom!rmartin >---------------------------------------------------------------------- And, in answer to Alan's original posting: I really like exceptions. I don't use them. Exceptions in C require writing an exception handling mechanism which I have never had the time to write for my own "small" programs. There are other systems I use which have used different error handling mechanisms from Day One and are "too big" to change now. All the code we write returns 0 on error. ( By never using '0' as an index, I can usually get away with this. ) Usually failures are trickled up to the function level where enough is known that they can be handled. The macro we use to "fail out" of a routine is called "FAILIF" and takes a condition, an error number and an error parm as parameters. If the condition is true, the error number is assigned the global variable errno, and the error parm is assigned to the global variable errparm. In addition, FAIL's behaviour can be modified to do a little cleanup before returning which solves one of the problems of multiple returns, although it is not that elegant. At higher levels, we will check for errors that we do not wish to handle ( like failures from malloc() ) by using fatal assertions. A fatal assertion asserts that a condition is true (nonzero), otherwise, it prints the string argument, hex dumps any areas of memory that the user wishes to dump, prints the errno, the name of the errno, the errparm, the line number, the file name, and function trace. ( This varies from the standard UNIX assert() mechanism. ) It then terminates the program. A non-fatal assert is also available for conditions that must be reported but need not be acted upon. Code using FAILIF and fatal assertions reads quite easily and is easy to write. You generally check a condition once, FAILIF or FASSERT it, and continue, secure that you are dealing with only valid values from here on in. Assertions should actually be coded in the interface to the routine because they can be valuable documentation, but we're not that sophisticated yet. To reduce code overhead, there are a number of functions whose failure is almost never handled (fclose, malloc, write, ... ). These functions are generally wrapped in envelopes that assert the success of the call. The user can then use the secure call if he wishes to program safely, or the lower level interface if he can handle the error himself or if he doesn't care. ( Apathy is the only good reason for ignoring return codes. ) One of the problems with the trickle-up method of subroutine failure, is that, often, you do not wish to decide on how fatal the error is at the lower level, and so the error trickles up to a much higher level where the severity is understood, but the exact condition which caused the error is lost. There are also cases where no one level contains all the info needed for a meaningful error message. One solution to this is to use a stack of errnos and errparms instead of single global variables. It also helps to have a user definable error string that is saved in this stack. As the error gets passed up the call chain more information is added. If the main program chooses to abort, the entire error stack can be dumped giving a complete description of the error. Although this generates really nicely detailed error messages with very little coding trouble, I have not used it on any programs that have enough levels of function calling to make it really worthwhile. Anyway, those are my experiences. And my code is usually a great test suite for error checking mechanisms! rba iv
rh@smds.UUCP (Richard Harter) (11/12/90)
In article <3335@bruce.cs.monash.OZ.AU>, alanf@bruce.cs.monash.OZ.AU (Alan Grant Finlay) writes: > In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > In our key product, which we assume is mission critical for our users, > > we take the strong view that any trapped error is a fatal error. > This kind of approach involves either a purist conception about what is an > error or a library package which has such a narrow area of application that > applications using the package have highly predictable requirements. This started out in comp.software-eng, which is where I posted to. Alan's comments showed up comp.lang.c. I find this somewhat puzzling. I have redirected it back to comp.software-eng. ---- Alan continues with an extensive discussion of exception handling problems in library packages with references to ICON and ML. I'm afraid I find the quoted remarks a trifle puzzling. There is nothing particularly startling about the concept that a program or collection of programs *must not* produce incorrect results. (Writing such programs is left as an exercise for the reader. :-)) -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) (11/13/90)
In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > This started out in comp.software-eng, which is where I posted to. Alan's > comments showed up comp.lang.c. I find this somewhat puzzling. I have > redirected it back to comp.software-eng. Actually, I posted the original article to both newsgroups because C is the language I use most and because things can be done in C that may not be possible in all the languages represented by the readers of comp.software-eng. Meanwhile, Alan Findlay writes: > In article <234@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > In our key product, which we assume is mission critical for our users, > > we take the strong view that any trapped error is a fatal error. We try > This kind of approach involves either a purist conception about what is an > error or a library package which has such a narrow area of application that > applications using the package have highly predictable requirements. Actually, I assume (perhaps I'm wrong) that the author is not describing a library package, although such an approach might be appropriate in libraries for very critical applications. There are some times when to not do it at all is much better than to do it wrong. > The problem for the supplier of a (practical) general purpose library package > is that what may be regarded an error by one application is not so regarded > by another. Excellent point and the best reason why a lot of simple schemes are really inadequate. > The subject line is "error handling techniques" and the original posting > requested information about techniques for library packages. For the reason > given above this is better called "exception handling techniques". Actually, the original posting solicited information about techniques for large systems (principally turn-key type applications) as well as libraries. And it was supposed to address regular user errors (application passes bad parameter to function library, for example) as well as horrible unexpected system errors. It's unclear to me how suitable an exceptions approach is to the former case, although I haven't thought about it a lot. -- -- Alan # My aptitude test in high school suggested that ..!ames!elroy!alan # I should become a forest ranger. Take my alan@elroy.jpl.nasa.gov # opinions on computers with a grain of salt.
siemsen@sol.usc.edu (Pete Siemsen) (11/13/90)
amodeo@dataco.UUCP (Roy Amodeo) writes: >One solution to this is to use a stack of errnos and errparms instead of >single global variables. It also helps to have a user definable error >string that is saved in this stack. As the error gets passed up the call >chain more information is added. If the main program chooses to abort, >the entire error stack can be dumped giving a complete description of the >error. Although this generates really nicely detailed error messages >with very little coding trouble, I have not used it on any programs that >have enough levels of function calling to make it really worthwhile. I used exactly such a scheme on a project a few years ago. It worked very well, but meant that every subroutine call (this was FORTRAN) took about 6 lines of code. At any level, a subroutine could decide to "handle" errors from below, or pass the errors on up (adding it's own idea of what was wrong first). Once we got used to the verbosity (in C, macros would reduce this), we were impressed with how well it worked. Whenever an error occurred, you'd get a message something like GETREC: READ failed: illegal logical unit number READFILE: unable to read record from file COPYFILES: unable to read input file ABC.DEF MAIN: unable to copy files which is about as helpful as it gets. This was one of the smoothest software projects I have ever worked on. Of course, there were only five of us working on the code (for about 9 months), and we all agreed on the error system before starting. The customer was very pleased. -- Pete Siemsen Pete Siemsen siemsen@usc.edu University of Southern California 645 Ohio Ave. #302 (213) 740-7391 (w) 1020 West Jefferson Blvd. Long Beach, CA 90814 (213) 433-3059 (h) Los Angeles, CA 90089-0251
rh@smds.UUCP (Richard Harter) (11/13/90)
In article <1990Nov12.184217.25361@elroy.jpl.nasa.gov>, alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: > In article <238@smds.UUCP>, rh@smds.UUCP (Richard Harter) writes: > > This started out in comp.software-eng, which is where I posted to. Alan's > > comments showed up comp.lang.c. I find this somewhat puzzling. I have > > redirected it back to comp.software-eng. > Actually, I posted the original article to both newsgroups because C is the > language I use most and because things can be done in C that may not be > possible in all the languages represented by the readers of comp.software-eng. Oh. That makes sense. My apologies for confusing the issue. Continuing from your remarks. There are actually a number of categories of software to consider with different error handling issues. For example, there are application programs with human interaction, application programs which are called by other programs, subroutine libraries called by a specific set programs, and libraries which can called by arbitrary programs. We can talk about interface errors and internal errors. An application program may be mission critical in the sense that it must continue operating and do the best that it can despite errors; conversely it may be mission critical in the sense that it must not produce erroneous results. It may not be mission critical; in fact there may be an acceptable level of erroneous behaviour. And so on... From this I would conclude that there are a number of possible strategies. One general remark on this topic that I would like to make is that it seems to me that it is imperative that error handling philosophies be consistent between components being put together. For example a 'mission critical' program should not use routines from a library that dump core or do something erratic if there is a usage error. This places strong constraints on general purpose libraries; IMHO there should be strong constraints. If a GPL routine or program is unsafe (i.e. it can behave in unpredictable ways in response to errors in interface usage) it should be labelled as such. -- Richard Harter, Software Maintenance and Development Systems, Inc. Net address: jjmhome!smds!rh Phone: 508-369-7398 US Mail: SMDS Inc., PO Box 555, Concord MA 01742 This sentence no verb. This sentence short. This signature done.
geoffb@butcombe.inmos.co.uk (Geoff Barrett) (11/13/90)
Does anyone have an example of a properly handled error?
martin@mwtech.UUCP (Martin Weitzel) (11/15/90)
In article <234@smds.UUCP> rh@smds.UUCP (Richard Harter) writes: >In article <1990Nov3.153643.26368@clear.com>, rmartin@clear.com (Bob Martin) writes: >> In article <1990Nov2.205831.23696@elroy.jpl.nasa.gov> alan@cogswell.Jpl.Nasa.Gov (Alan S. Mazer) writes: >> >I'm interested in what approaches people use for error handling, particularly >> >in general purpose function libraries and large software systems. If someone > > [See the referenced article, which is commendably well written.] > >In our key product, which we assume is mission critical for our users, >we take the strong view that any trapped error is a fatal error. We try >to arrange that the software fails gracefully and that it produces as >much information about the error as possible. [...] I share this view. But I think it is an oversight (which also made it into ANSI-C), that there is no (portable) way to fail gracefully if the stack runs into the heap. (You can detect if the heap would run into the stack, as malloc and friends fail in this case, but not the other way round.) Any solutions? >I would like to see more contributions on this topic. You've just read mine :-) P.S.: If you supply `Followup-To:'-lines, please double check for typos. "inews" just complained about the missing news group comp.softwar-eng and didn't fail in the most "graceful" way :-/. ^^ -- Martin Weitzel, email: martin@mwtech.UUCP, voice: 49-(0)6151-6 56 83
baeder@shamu.cadence.com (D. Scott Baeder; x299) (11/16/90)
In article <28078@usc>, siemsen@sol.usc.edu (Pete Siemsen) writes: |> amodeo@dataco.UUCP (Roy Amodeo) writes: |> |> >One solution to this is to use a stack of errnos and errparms instead of |> >single global variables. It also helps to have a user definable error |> |> (in C, macros would reduce this), we were impressed with how well it |> worked. Whenever an error occurred, you'd get a message something |> like |> I also worked (in fortran) on a large physical design (CAD) program that used a similar process, but only had error numbers. We even went so far as to have a coding scheme for the numbers, and a DB of text for each number that could optionally be printed. Made updates to the text easier. error number scheme was of the form "SMMMMEE" where S=severity (0-9), M=module number ( each subroutine had a unique number assigned), and EE was the error number with in the routine. This can add overhead, etc. BUT we also had a fortran macro processor which also aids portability, etc. This WAS a large system with over 250K LOC, and about 1000 routines broken into about 10 individual programs, and library routines (like error lookup, etc. The Key is discipline, and making this type of decision during the planning and design stage. Any one recognize the system?? scott
emery@linus.mitre.org (David Emery) (11/16/90)
One of the major problems with stack-based error systems (and with Unix errno) is that they don't work in a multi-threaded environment. Consider errno as a simple example, and look at what can (and often does, if you're not careful) happen (actions happen in "linear order" shown): thread 1 thread 2 fd = open ("afile"); result = open ("not_afile"); /* open fails */ /* errno is EBADF */ if (result) { result = write (fd, buf); /* read fails */ /* errno is EFBIG */ switch (errno) { /* now errno is EBADF, * a value not recognized * as a result from open()!! */ This has been discovered (and rediscovered) by people writing multitasking Ada programs on Unix, as well as by people using various C threads packages. This is one of the major reasons that the POSIX Ada binding uses exceptions rather than errno. The alternative is to keep errno (or the error stack) as thread/task-specific data, but that gets messy. In general, it's a good argument against keeping error information as state, rather than returning it when a function call returns, particularly when designing software packages that might be called from Ada or other concurrent systems. dave emery
perry@apollo.HP.COM (Jim Perry) (11/17/90)
In article <EMERY.90Nov15160448@aries.linus.mitre.org> emery@linus.mitre.org (David Emery) writes: >One of the major problems with stack-based error systems (and with >Unix errno) is that they don't work in a multi-threaded environment. > [example deleted] >This has been discovered (and rediscovered) by people writing >multitasking Ada programs on Unix, as well as by people using various >C threads packages. This is one of the major reasons that the POSIX >Ada binding uses exceptions rather than errno. The alternative is to >keep errno (or the error stack) as thread/task-specific data, but that >gets messy. In general, it's a good argument against keeping error >information as state, rather than returning it when a function call >returns, particularly when designing software packages that might be >called from Ada or other concurrent systems. Good point, but if you're writing code intended to be thread-safe under UNIX you have to assume that the system beneath you supports you; in particular you either have to know that the threading system task-swaps errno, or be very careful in all contexts that require access to it, e.g. copying it to a thread-local shadow yourself (under a mutex lock). The latter of course does get messy, but by the same token if you're trying to be thread-safe in a world where you can't trust the environment (malloc for one is not generally reentrant), you're going to be doing messy code (or at least reinventing wheels). On the whole UNIX is currently hostile to multithreaded packages on many fronts. I agree, errno was/is not a particular far-sighted mechanism, but it's probably late in the game to start expecting write() to throw exceptions (and I far prefer exceptions to passing error information up and down on function calls). - Jim Perry perry@apollo.hp.com HP/Apollo, Chelmsford MA This particularly rapid, unintelligible patter Isn't generally heard, and if it is it doesn't matter!
gefrank@sdrc.UUCP (Frank Glandorf) (11/21/90)
What's the opinion on global vs. local error codes. By global I mean a central list of codes in an include file. By local I mean a list of codes which is meaningful only in the context of the procedure which returns the code. A central list can be difficult to maintain but it is nice to have only one place to look for error codes. On the other hand local codes can be sequential and allow one to use "FORTRAN: computed gotos" :-) -Frank