zhang@buast7.bu.edu (Zhang Yun Fei) (01/25/91)
This is a summary to the following question I posted to the net yesterday. I got solutions/suggestions from the following netters. khb@Eng.Sun.COM borcherb@turing.cs.rpi.edu mckie@sky.arc.nasa.gov larry@pylos.cchem.berkeley.edu carlo@nu.uchicago.edu As it seems a rather popular problem among the number crunchers. I summarize the answers in the rest of this message. Thanks to all of you who kindly responded my question. I appriciate it. %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Yun Fei Zhang % E-mail: % % Astronomy Department % SPAN: east::"zhang@buast0.bu.edu" % % Boston University % BITNET: zhang@buasta % % 725 Commonwealth Ave. % INTnet: zhang@buast0.bu.edu % % Boston, MA 02215 % zhang@bu-ast.bu.edu % %--------------------------------------------------------------------------% % TEL: (617)-353-8917 % % TELEX: 95-1289 BOS UNIV BSN % % TELEFAX: (617)-353-5704 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% ----------------------------------------------------------------------------- ORIGINAL QUESTION POSTED: I have a question about the error handler of SUN FORTRAN. The question is how to locate the location where an arithmetic error occurs in a program. This is especially helpful as I am writing a computation-intensive code. On VAX/VMS machines, the code will crash when it encounter these arithmetic error and tell the user where it occures. However, on SUNs, it only shows a message at the end of the job says something like the following: > Warning: the following IEEE floating-point arithmetic exceptions > occurred in this program and were never cleared: > Inexact; Division by Zero; Invalid Operand; ..... My question is how the determine the point(s) in the code where such indicated arithmetic exceptions happened. Try to modifying the IEEE error handler (e.g. sigfpe_ieee, etc.) seems a possible approach. But it involves changes in these lower level routine, which I am reluctant to try. Is there any other option I can have to archive the some goal? (e.g., compiletion/linking options or software tools). %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% SOLUTIONS: khb@Eng.Sun.COM pointed the correct direction for a solution in the first respons received: >From f77 code, as mentioned in the Numerical Computation Guide, and the Fortran User's Guide i = ieee_handler("set","common",%val(2)) ! aka SIGFPE_ABORT ! if you use the .h ! file with mathincludes will cause execution to stop on divide by zero, operations on NaNs etc. If you want to catch inexact, you can do that too (ask for it by name). #### And borcherb@turing.cs.rpi.edu point out the followings: Try man f77_ieee_environment for documentation on how to do this. Unfortunately, I believe that on SUN-4's, the handler doesn't actually report the address at which the exception happened. I'm told that this is because of the pipelined nature of the SPARC processor. At any rate, I was unable to get this working on a SUN-4. However, my code does stop the program as soon as the SIGFPE signal is sent to it. #### % The most pragmatic approach, I think, is from mckie@sky.arc.nasa.gov as % he wrote: On our Sun, there were so many people who had the same question as you about how to find ieee errors in a fortran program that I set up a man page to try to explain it. I'll include a part of that man page below. It seems a bit strange in comparison to your vms experiences, but after you've done it a few times, it's not too bad. And the ieee approach is more flexible & more under your control. -Bill McKie NASA Ames Research Center mckie@sky.arc.nasa.gov ============================================================= SYNOPSIS The following is an abbreviated description of how to use the Sun DBX debugger to find where floating point errors are occurring in a fortran program. DESCRIPTION Step 1. Add the following statements to the program's main module: external handler call ieee_handler('set','common',handler) The "external" statement is a declaration, and should appear in the preliminary non-executable statements section of the main program source code. The call to "ieee_handler" should be placed into the main program as one of the first execut- able statements. Step 2. Add the following "stub" subroutine to the main program source code: subroutine handler(i1,i2,i3,i4) end The subroutine name "handler" is arbitrary, but must be the same in the subroutine statement, the external statement, and the call ieee_handler statement's 3rd argument. Step 3. Compile the program as usual, but everywhere the f77 command is used, include the "-g" option in the f77 command line. E.g. if the program is entirely in the file prog.f (includ- ing the handler subroutine), then the following could be used: f77 -g -o prog prog.f Step 4. The program is now ready to run, and it could simply be run as usual using the "prog" command. However, to find the place where a floating point error is occurring, the dbx debugger utility is used to control execution of the pro- gram. This is how it is run: dbx dbx> debug prog dbx> catch FPE dbx> run signal FPE in <routine name> at line n in file <file_name> dbx> quit The "dbx>" are prompts from the debugger. The line follow- ing the "run" command is output by the debugger, and is a clue as to where the error occurred. Step 5. Edit the file <file_name> and move to line n to see where the error was occurring. SEE ALSO dbx(1) dbxtool(1) f77(1) LIMITATIONS The above description demonstrates only a small subset of the dbx debugger's capabilities. See the dbx user's manual for more information on what dbx can do. #### % larry@pylos.cchem.berkeley.edu contrbuted another way around as: I'm not sure that I understand what you are asking, but here is the code that I use to make the default behavior similar to what you describe - die on divide by zero, etc. It requires the use of the C-preprocessor on a Sun to make it a 'compile time option'. #if ERROR && SUN #include <f77/f77_floatingpoint.h> external error_handler integer ieeer,ieee_handler,error_handler #endif cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc c set up the error handler to barf on all exceptions and then clean c the inexact exceptions which occur with any floating point operation c these should be the first exectuable line in your code. #if SUN && ERROR ieeer=ieee_handler('set','all',error_handler) ieeer=ieee_handler('clear','inexact',error_handler) #endif c a separate function #if SUN && ERROR c error handler that is called by the IEEE package on the sun integer function error_handler(sig,code,sigcontext) integer sig,code,sigcontext(5) character label*16 if (loc(code).eq.212) label='overflow' if (loc(code).eq.208) label='invalid' if (loc(code).eq.204) label='underflow' if (loc(code).eq.200) label='division' if (loc(code).eq.196) label='inexact' write(*,*) 'IEEE exception code ',loc(code), 2 ' ( ',label(1:lnblnk(label)),' ) occured at pc ',sigcontext(4) c any error processing can be done here. I just choose to kill the c program gracefully call abort(' IEEE exception code - Program Halted') stop end #endif ######## % carlo@nu.uchicago.edu shows me a similar wit, which can track the call routine to the break point: Hi. The following method is the one that I have had to resort to. It's a bit of a kludge, but at least it allows identification of the routine within which the exception occurred, and the problem can usually be identified using dbx: program foobawooba external handler common /debug1/nlevel common /debug2/ stack(20) character stack*25 data nlevel/1/stack/'main',19*''/ ieeer=ieee_handler('set','common',handler) c 'common' handles invalid, overflow, and division exceptions --- see c "man ieee_handler". handler is an external routine shown at the end c of this example. This line will call handler whenever a 'common' c exception occurs. [ some code here] call haha1 [ more code] stop end subroutine haha1 common /debug1/nlevel common /debug2/ stack(20) . call haha2 . stack(nlevel)='' nlevel=nlevel-1 return end subroutine haha2 common /debug1/nlevel common /debug2/ stack(20) character stack*25 nlevel=nlevel+1 stack(nlevel)='haha2' . stack(nlevel)='' nlevel=nlevel-1 return end integer function handler ( sig, code, sigcontext ) common /debug1/nlevel common /debug2/ stack(20) character stack*25 integer sig integer code integer sigcontext(5) write(6,*) 'Bomb! Here comes a stack dump:' do 1 i=1,nlevel write(6,*) stack(i) 1 continue write(6,*) 'Number of levels:',nlevel call abort end The effect of all these (admittedly ugly and machine specific) gymnastics is that the routine in which the exception occurred is pinpointed by the array 'stack' and the variable 'nlevel'. Since execution is halted by means of 'call abort', all the debugging information is still available, and the problem may be identified (if the debugger was active) by examining the guilty routine. The ass paining part of all this is that the lines affecting 'stack' and 'nlevel' must be included in every routine in the program. I'm not that happy with it, but it's the best I've been able to do. One might wish that it would dawn on the *%&#!!!? C programmers who developed Fortran for the Sun that for the purposes of scientific programming, failure to *automatically* halt on division by zero is a bug, not a feature :-(> . I hope this helps. Carlo Graziani #####