bernhold@red8 (David E. Bernholdt) (05/29/91)
In article <28787@uflorida.cis.ufl.EDU> I wrote: >How do you handle errors in large (Fortran) codes? >I'm interested in everything from the grand philosophy to the small >details. For example: I thought of another question or two: o Do you make any attempt to deal with I/O errors beyond a success/failure/end-of-file level? Portably? Do you attempt to analyze the source of the error and recover (if possible) or do you just try to give a more informative error message? o Did you adopt a particular style for error messages -- like, say VMS or unix or something else? Why? o Do you make any attempt to use existing error codes/routines -- for example the C-preprocessor include file /usr/include/sys/errno.h on unix machines, or the error handling routines in VMS? [Sorry if the examples are VMS & unix biased -- that's what I have experience with. How do other systems help deal with these things?] thanks again... -- David Bernholdt bernhold@qtp.ufl.edu Quantum Theory Project bernhold@ufpine.bitnet University of Florida Gainesville, FL 32611 904/392 6365
ralph@uhheph.phys.hawaii.edu (Ralph Becker-Szendy) (05/29/91)
In article <28791@uflorida.cis.ufl.EDU> bernhold@red8 (David E. Bernholdt) asks a lot of questions about error handling. Here are a few things I do. Hope it brings up a lively discussion ... I just wrote (over the last two years) a graphics display program with >23000 lines of FORTRAN (not counting existing subroutine libraries which only had to be fixed up a little bit to be linked against). It is used by about a dozen people, with at least some computer knowledge. It is used only on VMS, because I haven't got any time to port it. For that reason, the mechanism of error handling uses VMS system routines, and the format of some messages is modelled after VMS messages (not that I like them, but I like consistency). > How do you handle errors in large (Fortran) codes? If the program is short and simple, don't care. Just blow up right away. The typical case is a program which is written, used twice, and deleted 10 minutes later. Total waste of time to worry about error handling. If the program is large (as in: error can not be tracked immediately), or if it is used for a long time (as in: I have forgotten what the code does), or is used by a large number of people (large is >1), or is used by the non- initiated, trap and gracefully handle errors which can be trapped, and dump on the others. The best approach is: Make sure they don't happen. A stupid example: Never divide by anything without explicitly checking for divide by zero first. Or (if you don't care about the result and don't want any abort to happen) do something like Z = X / MAX(Y, 0.0001). Same for exponentiation and inverse angle functions. If an error happens, try to localize the damage. Avoid just blowing up. The user may have invested a lot of time in that run of the program. Example: My program is command-line driven. A certain command needs to read a calibration file the first time it is invoked, then sets a flag that the file has been read. If it gets an error when reading the file, it gives the user as much information as possible (message: Can't read blabla, system error message is foobar), and then informs the user "This command has failed, and will probably fail if you try it again, unless you quickly fix up the calibration file named SCREWY.CAL", but doesn't set the flag that the file has been read it successfully. Then it pops him back to the command-line input stage. If the user absolutely NEEDS that command, he can abort and do something about the problem, or maybe live without that command. Something I find usefull is to classify errors as: WARNING: This may or may not be what you wanted. Think about it. Please confirm the command. Example: You ask for the mean of a histogram which has no entries. You'll get the answer zero, and a warning that the answer makes no sense anyhow. In the same category: Most FORTRAN compilers treat output overflow in a numeric FORMAT entry as an error. I prefer to just output stars. The user will see the stars, and know that something is awfully big. That's the point where using the system- or language-supplied error handler is important. TRAPABLE PROBLEMS: End-of-input-file is the classical example. Non-existing files, commands issued when they can not be executed. Those just have to be taken care of. In many cases the trick is to write a function which requires user input in a loop; example: Get input file name from user, try to OPEN it. If the open fails, loop back and get another file name. Also in this category is interrupt (control-C on VMS for example), which I trap and handle gracefully but only in commands which take very long. In these cases I also give an advance warning like "This fit may take several minutes, if you hit Control C it will abort and you will be thrown back to the previous result.". Control-C which is not trapped by one of the commands is eventually ignored with a message like "Control-C hit but ignored, use Control-Y to abort". ERROR: Something happened while executing a subfunction (a command in my case). Tell the user everything you know about the error, and tell him what actions were NOT taken. Then continue. If the user doesn't like that behaviour he can always hit Control-Y or something like that. If the error is expected once in while (like disk full on write, or no memory when allocating) a short message is sufficient, otherwise give as much information as possible. FATAL: The case where a known or unexpected error can not be handled. Examples: Can't allocate the display in a graphical program. Read error on an important input file. If the error is expected to happen once in a while (like IO errors on tape input), I try to have a customized error message "Input error when reading input data from tape", followed by as much of an error dump as I can get from the OS. But always output as much information as possible in these cases. BUG: Things which shall not happen. My programs are littered with checks which shall never fire, but sometimes do. In this case the best error message is simply "implementation error, call Ralph", followed by as much dump as possible. I find the VMS error handler to be very useful for this. I have my own error message files (linked into the program), and like this I can handle application errors just like OS problems. Typical examples (MSG_xxx are codes for my own error messages, some particulars of the VMS routines were omitted for simplicity): STATUS = SYS$xxx(yyy) ! Some system function just shall not fail IF (.NOT.STATUS) THEN CALL LIB$SIGNAL(MSG_INTERNAL, 'xxxing the yyy failed') CALL LIB$SIGNAL(STATUS) STOP 'Everything has gone to hell' ENDIF 1 STATUS = MYxxx(yyy) ! A subroutine of my own, which may sometimes fail IF (STATUS.EQ.MSG_BAD) THEN PRINT *, 'Please enter another file name, anchovy-breath' do something GOTO 1 ! I don't actually use GOTO, instead a DO WHILE (.NOT.SUCCESS) ... ELSEIF (.NOT.STATUS) THEN the internal error output, like above ENDIF IF (ABS(Y).LT.0.0001) THEN CALL LIB$SIGNAL(MSG_BLOWUP, 'Internal error, Y is zero but shouldnt' PRINT *, Y and anything else you might think is important CALL LIB$STOP(MSG_INTERNAL, 'Internal error, shoot the programmer.') ENDIF Z = X / Y Now answers to the questions David asked: >o Should subprograms which return error codes be subroutines with the > error code as one of the arguments, or integer functions which return > the error code? I prefer an integer function which returns an error code. Like that, if I don't care about the error (sometimes one really doesn't) one can call it as a subroutine. Also, it is consistent with most OS routines (both VMS and Unix, except they have different conventions). In many cases a routine can just pass the error right through, like INTEGER FUNCTION DOxxx STATUS = DOyyy(...) IF (.NOT.STATUS) THEN DOxxx = STATUS RETURN ! Let the caller worry about it ENDIF >o What if it is logical for the computational result of a subprogram to > be returned by a function mechanism -- which gets precedence? I think still return an integer error code, for consistency. It is a typing chore, but it really helps to check the return code after every call. Exception: Simple numerical routines, analog to sine and cosine, which just shall not fail. >o Should routines even return error codes, or should they just abort on > the spot? Or perhaps they should all call a single error handler -- > such as BLAS-2, BLAS-3, LAPACK calling XERBLA? Or should it return an > error code to the caller and let the caller do error recovery or > graceful shutdown if it wants -- possibly all the way back up to the > main program? Absolutely return error codes. Or at least leave them in COMMON. Everything else is anti-social. If every simple error leads to an abort the user's will soon hate the program. If every package has their own error handler I'll end up using none of them because it's too much of a pain. >o How do you deal with operating systems that don't provide any way to > print a traceback of the subroutine stack? Boycott them. Buy better ones. Failing that, output an error message which states where you were and what the call parameters were. Maybe the programmer can figure it out ... >o Do you have different severity levels to warning/error messages? Absolutely. And let the user know what kind of error he is dealing with. The words "WARNING", "PLEASE BE CAREFULL", "SIDE EFFECT" and "FATAL INTERNAL ERROR" let the user know what he is dealing with. >o Should all error message be generated by a single routine -- perhaps > referenced by numeric codes, as many of the unix networking codes tdo > (see nntp or ftp, for example)? What about printing of relevant > variables if you do this? I think one can do both. First print everything you think is important. Then call the centralized handler with the error code. >o How do these ideas correlate with the level of support you provide for > the software and/or the knowledge level of the user community? Obviously, if the user community is (1) large (2) uneducated, one has to be a lot more picky in handling errors. For experts, I can just say "Hessian matrix not positive definite, treat result with suspicion." >o What references discuss these aspects of programming? I wonder about that myself. >o Do you make any attempt to deal with I/O errors beyond a > success/failure/end-of-file level? Portably? Do you attempt to > analyze the source of the error and recover (if possible) or do you > just try to give a more informative error message? As outlined above, depends. On vitally important IO, there is usually nothing one can do. Except "premature EOF on input", "file does not exist" and "disk full / quota error on output" there is nothing one can recover from, usually. On tapes one can always print a warning, and retry a few more times. Usually that munches some input data. I know no portable way to trap IO errors. >o Did you adopt a particular style for error messages -- like, say VMS > or unix or something else? Why? One-liner VMS-style, just for consistency, since the program only runs under VMS. Usually there is more information before the VMS-style message, indicating the details of what happened. Should I port the program to Unix, I would probably remove the VMS-ism, but leave the structure alone. >o Do you make any attempt to use existing error codes/routines -- for > example the C-preprocessor include file /usr/include/sys/errno.h on > unix machines, or the error handling routines in VMS? I use the VMS error handling routines extensively: LIB$ESTABLISH, LIB$SIGNAL, LIB$STOP, LIB$MATCH_COND, plus ASTs on Control C and Control W. The ASTs set a flag in common. Routines which know how to be interrupted (or how to refresh one of the screens) check these flags periodically, with a fallback at the command line input level. I have my own messages compiled with SET MESSAGE. Makes life much easier, and allows me to treat system/library routines just like my own routines. It is however a major hurdle preventing me from porting to a >>>faster<<< platform (the other one is the absence of something as nifty as the CLI$ routines under Unix!). -- Ralph Becker-Szendy UHHEPG=24742::RALPH (HEPNet,SPAN) University of Hawaii RALPH@UHHEPG.PHYS.HAWAII.EDU High Energy Physics Group RALPH@UHHEPG.BITNET Watanabe Hall #203, 2505 Correa Road, Honolulu, HI 96822 (808)956-2931
maine@altair.dfrf.nasa.gov (Richard Maine) (05/30/91)
On 29 May 91 05:26:29 GMT, ralph@uhheph.phys.hawaii.edu (Ralph Becker-Szendy) said: ...[discussion of error handling approach]... Ralph> I prefer an integer function which returns an error code. Like Ralph> that, if I don't care about the error (sometimes one really Ralph> doesn't) one can call it as a subroutine. I just want to point out that calling a function as a subroutine is neither standard nor portable. Probably "everyone" including Ralph knows this, and it was pretty explicit that Ralph's approach was VMS oriented in general, but this particular construct was not explicitly labelled as non-portable. I don't mean this followup as a criticism of the approach. I just want to make sure that it is explicitly clear which parts are system-dependent and which parts are portable. The VMS system calls are pretty obvious without saying, but this part might be missed by some. -- -- Richard Maine maine@altair.dfrf.nasa.gov