[comp.lang.fortran] Error handling in large programs -- another question

bernhold@red8 (David E. Bernholdt) (05/29/91)

In article <28787@uflorida.cis.ufl.EDU> I wrote:
>How do you handle errors in large (Fortran) codes?

>I'm interested in everything from the grand philosophy to the small
>details.  For example:

I thought of another question or two:

o  Do you make any attempt to deal with I/O errors beyond a
   success/failure/end-of-file level?  Portably?  Do you attempt to
   analyze the source of the error and recover (if possible) or do you
   just try to give a more informative error message?

o  Did you adopt a particular style for error messages -- like, say VMS
   or unix or something else?  Why?

o  Do you make any attempt to use existing error codes/routines -- for
   example the C-preprocessor include file /usr/include/sys/errno.h on
   unix machines, or the error handling routines in VMS?

[Sorry if the examples are VMS & unix biased -- that's what I have
experience with.  How do other systems help deal with these things?]

thanks again...
-- 
David Bernholdt			bernhold@qtp.ufl.edu
Quantum Theory Project		bernhold@ufpine.bitnet
University of Florida
Gainesville, FL  32611		904/392 6365

ralph@uhheph.phys.hawaii.edu (Ralph Becker-Szendy) (05/29/91)

In article <28791@uflorida.cis.ufl.EDU> bernhold@red8 (David E. Bernholdt)
asks a lot of questions about error handling. Here are a few things I do. Hope
it brings up a lively discussion ...

I just wrote (over the last two years) a graphics display program with >23000
lines of FORTRAN (not counting existing subroutine libraries which only had to
be fixed up a little bit to be linked against). It is used by about a dozen
people, with at least some computer knowledge. It is used only on VMS, because
I haven't got any time to port it. For that reason, the mechanism of error
handling uses VMS system routines, and the format of some messages is modelled
after VMS messages (not that I like them, but I like consistency).

> How do you handle errors in large (Fortran) codes?
If the program is short and simple, don't care. Just blow up right away. The
typical case is a program which is written, used twice, and deleted 10 minutes
later. Total waste of time to worry about error handling. 

If the program is large (as in: error can not be tracked immediately), or if
it is used for a long time (as in: I have forgotten what the code does), or is
used by a large number of people (large is >1), or is used by the non-
initiated, trap and gracefully handle errors which can be trapped, and dump on
the others. The best approach is: Make sure they don't happen. A stupid
example: Never divide by anything without explicitly checking for divide by
zero first. Or (if you don't care about the result and don't want any abort to
happen) do something like Z = X / MAX(Y, 0.0001). Same for exponentiation and
inverse angle functions.

If an error happens, try to localize the damage. Avoid just blowing up. The
user may have invested a lot of time in that run of the program. Example: My
program is command-line driven. A certain command needs to read a calibration
file the first time it is invoked, then sets a flag that the file has been
read. If it gets an error when reading the file, it gives the user as much
information as possible (message: Can't read blabla, system error message is
foobar), and then informs the user "This command has failed, and will probably
fail if you try it again, unless you quickly fix up the calibration file named
SCREWY.CAL", but doesn't set the flag that the file has been read it
successfully. Then it pops him back to the command-line input stage.  If the
user absolutely NEEDS that command, he can abort and do something about the
problem, or maybe live without that command.

Something I find usefull is to classify errors as:
WARNING: This may or may not be what you wanted. Think about it. Please
  confirm the command. Example: You ask for the mean of a histogram which has
  no entries. You'll get the answer zero, and a warning that the answer makes
  no sense anyhow. In the same category: Most FORTRAN compilers treat output
  overflow in a numeric FORMAT entry as an error. I prefer to just output
  stars. The user will see the stars, and know that something is awfully big.
  That's the point where using the system- or language-supplied error handler
  is  important.
TRAPABLE PROBLEMS: End-of-input-file is the classical example. Non-existing
  files, commands issued when they can not be executed. Those just have to
  be taken care of. In many cases the trick is to write a function which
  requires user input in a loop; example: Get input file name from user,
  try to OPEN it. If the open fails, loop back and get another file name.
  Also in this category is interrupt (control-C on VMS for example), which
  I trap and handle gracefully but only in commands which take very long.
  In these cases I also give an advance warning like "This fit may take
  several minutes, if you hit Control C it will abort and you will be thrown
  back to the previous result.". Control-C which is not trapped by one
  of the commands is eventually ignored with a message like "Control-C hit
  but ignored, use Control-Y to abort".
ERROR: Something happened while executing a subfunction (a command in my
  case). Tell the user everything you know about the error, and tell him what
  actions were NOT taken. Then continue. If the user doesn't like that
  behaviour he can always hit Control-Y or something like that. If the error
  is expected once in while (like disk full on write, or no memory when
  allocating) a short message is sufficient, otherwise give as much information
  as possible.
FATAL: The case where a known or unexpected error can not be handled.
  Examples: Can't allocate the display in a graphical program. Read error
  on an important input file. If the error is expected to happen once in a
  while (like IO errors on tape input), I try to have a customized error
  message "Input error when reading input data from tape", followed by
  as much of an error dump as I can get from the OS. But always output
  as much information as possible in these cases.
BUG: Things which shall not happen. My programs are littered with checks which
  shall never fire, but sometimes do. In this case the best error message
  is simply "implementation error, call Ralph", followed by as much dump
  as possible.

I find the VMS error handler to be very useful for this. I have my own error
message files (linked into the program), and like this I can handle
application errors just like OS problems. Typical examples (MSG_xxx are codes
for my own error messages, some particulars of the VMS routines were omitted
for simplicity):

STATUS = SYS$xxx(yyy)    ! Some system function just shall not fail
IF (.NOT.STATUS) THEN
     CALL LIB$SIGNAL(MSG_INTERNAL, 'xxxing the yyy failed')
     CALL LIB$SIGNAL(STATUS)
     STOP 'Everything has gone to hell'
ENDIF

1 STATUS = MYxxx(yyy) ! A subroutine of my own, which may sometimes fail
IF (STATUS.EQ.MSG_BAD) THEN
     PRINT *, 'Please enter another file name, anchovy-breath'
     do something
     GOTO 1 ! I don't actually use GOTO, instead a DO WHILE (.NOT.SUCCESS) ...
ELSEIF (.NOT.STATUS) THEN
     the internal error output, like above
ENDIF

IF (ABS(Y).LT.0.0001) THEN
     CALL LIB$SIGNAL(MSG_BLOWUP, 'Internal error, Y is zero but shouldnt'
     PRINT *, Y and anything else you might think is important
     CALL LIB$STOP(MSG_INTERNAL, 'Internal error, shoot the programmer.')
ENDIF
Z = X / Y

Now answers to the questions David asked:
>o  Should subprograms which return error codes be subroutines with the
>   error code as one of the arguments, or integer functions which return
>   the error code?  
I prefer an integer function which returns an error code. Like that, if I
don't care about the error (sometimes one really doesn't) one can call it
as a subroutine. Also, it is consistent with most OS routines (both VMS
and Unix, except they have different conventions). In many cases a routine
can just pass the error right through, like
INTEGER FUNCTION DOxxx
STATUS = DOyyy(...)
IF (.NOT.STATUS) THEN
     DOxxx = STATUS
     RETURN ! Let the caller worry about it
ENDIF

>o  What if it is logical for the computational result of a subprogram to
>   be returned by a function mechanism -- which gets precedence?
I think still return an integer error code, for consistency. It is a typing
chore, but it really helps to check the return code after every call.
Exception: Simple numerical routines, analog to sine and cosine, which just
shall not fail.

>o  Should routines even return error codes, or should they just abort on
>   the spot?  Or perhaps they should all call a single error handler --
>   such as BLAS-2, BLAS-3, LAPACK calling XERBLA? Or should it return an
>   error code to the caller and let the caller do error recovery or
>   graceful shutdown if it wants -- possibly all the way back up to the
>   main program? 
Absolutely return error codes. Or at least leave them in COMMON. Everything
else is anti-social. If every simple error leads to an abort the user's will
soon hate the program. If every package has their own error handler I'll end
up using none of them because it's too much of a pain.

>o  How do you deal with operating systems that don't provide any way to
>   print a traceback of the subroutine stack?
Boycott them. Buy better ones. Failing that, output an error message which
states where you were and what the call parameters were. Maybe the programmer
can figure it out ...

>o  Do you have different severity levels to warning/error messages?
Absolutely. And let the user know what kind of error he is dealing with.
The words "WARNING", "PLEASE BE CAREFULL", "SIDE EFFECT" and "FATAL INTERNAL
ERROR" let the user know what he is dealing with.

>o  Should all error message be generated by a single routine -- perhaps
>   referenced by numeric codes, as many of the unix networking codes tdo
>   (see nntp or ftp, for example)?  What about printing of relevant
>   variables if you do this?
I think one can do both. First print everything you think is important. Then
call the centralized handler with the error code.

>o  How do these ideas correlate with the level of support you provide for
>   the software and/or the  knowledge level of the user community?
Obviously, if the user community is (1) large (2) uneducated, one has to be
a lot more picky in handling errors. For experts, I can just say "Hessian
matrix not positive definite, treat result with suspicion."

>o  What references discuss these aspects of programming?
I wonder about that myself.

>o  Do you make any attempt to deal with I/O errors beyond a
>   success/failure/end-of-file level?  Portably?  Do you attempt to
>   analyze the source of the error and recover (if possible) or do you
>   just try to give a more informative error message?
As outlined above, depends. On vitally important IO, there is usually nothing
one can do. Except "premature EOF on input", "file does not exist" and "disk
full / quota error on output" there is nothing one can recover from, usually.
On tapes one can always print a warning, and retry a few more times. Usually
that munches some input data. I know no portable way to trap IO errors. 

>o  Did you adopt a particular style for error messages -- like, say VMS
>   or unix or something else?  Why?
One-liner VMS-style, just for consistency, since the program only runs under
VMS. Usually there is more information before the VMS-style message,
indicating the details of what happened. Should I port the program to Unix, I
would probably remove the VMS-ism, but leave the structure alone.

>o  Do you make any attempt to use existing error codes/routines -- for
>   example the C-preprocessor include file /usr/include/sys/errno.h on
>   unix machines, or the error handling routines in VMS?

I use the VMS error handling routines extensively: LIB$ESTABLISH, LIB$SIGNAL,
LIB$STOP, LIB$MATCH_COND, plus ASTs on Control C and Control W. The ASTs set a
flag in common. Routines which know how to be interrupted (or how to refresh
one of the screens) check these flags periodically, with a fallback at the
command line input level. I have my own messages compiled with SET MESSAGE.

Makes life much easier, and allows me to treat system/library routines just
like my own routines. It is however a major hurdle preventing me from porting
to a >>>faster<<< platform (the other one is the absence of something as nifty
as the CLI$ routines under Unix!).

-- 
Ralph Becker-Szendy                          UHHEPG=24742::RALPH (HEPNet,SPAN)
University of Hawaii                              RALPH@UHHEPG.PHYS.HAWAII.EDU
High Energy Physics Group                                  RALPH@UHHEPG.BITNET
Watanabe Hall #203, 2505 Correa Road, Honolulu, HI 96822         (808)956-2931

maine@altair.dfrf.nasa.gov (Richard Maine) (05/30/91)

On 29 May 91 05:26:29 GMT, ralph@uhheph.phys.hawaii.edu (Ralph Becker-Szendy) said:
     ...[discussion of error handling approach]...

Ralph> I prefer an integer function which returns an error code. Like
Ralph> that, if I don't care about the error (sometimes one really
Ralph> doesn't) one can call it as a subroutine.

I just want to point out that calling a function as a subroutine is
neither standard nor portable.  Probably "everyone" including Ralph
knows this, and it was pretty explicit that Ralph's approach was
VMS oriented in general, but this particular construct was not
explicitly labelled as non-portable.

I don't mean this followup as a criticism of the approach.  I just
want to make sure that it is explicitly clear which parts are
system-dependent and which parts are portable.  The VMS system
calls are pretty obvious without saying, but this part might be
missed by some.


--
--
Richard Maine
maine@altair.dfrf.nasa.gov