pabong@gonzo.eta.com (Paul A. Bong) (12/03/88)
Occasionally, a program will die with the following error: ?(sh) "./fs_dump" - unable to unwind stack because of invalid stack frame (process manager/process fault manager) I am running Aegis 9.7 on 3000's and 4000's. Some of the programs are compiled at Aegis 9.2 level but some are at 9.7 when this problem occurs. Any help would be appreciated greatly. Thanks -pab pabong@gonzo.eta.com pabong@aring.eta.com pabong@uring.eta.com To Err is Human, to Fubar Requires a Supercomputer!!
krowitz@RICHTER.MIT.EDU (David Krowitz) (12/10/88)
That particular error message means that your program has managed to trash its stack (the memory area where temporary variables and subroutine return addresses are kept). Normally, when a program dies, you will get a traceback of where the program was executing when it trapped the error. This is done by the OS taking hold of the stack of the process that got the error and then looking at the entries in the stack to find the route which the program took to get to the point where the error occurred. A sample traceback (for a program which I control-Q'd out of) looks like this: $ dspst -a ?(sh) "/com/dspst" - process quit (OS/fault handler) In routine "GDM_$$DN3000_VALIDATE". $ tb process quit (OS/fault handler) In routine "GDM_$$DN3000_VALIDATE" Called from "SHOW_VALUE" line 569 Called from "SHOW_LINE" line 645 Called from "BAR_CHART_$DISPLAY" line 1740 Called from "DSPST" line 816 Called from "PM_$CALL" It is possible for a program to overwrite its own stack area, thereby destroying the list of subroutine return addresses and making it impossible to trace where the error occurred. Common methods of trashing your stack are: 1) calling a subroutine which allocates a very large array or arrays [the stack contains 256KB of space], 2) overflowing the bounds of an existing array, 3) calling a subroutine with 16-bit integer arguments when the subroutine actually takes (and modifies) 32-bit arguments [in some rare instances]. #1 and #2 are the most common causes, and they will only show up when your input data just happens (by chance) to cause the program to use those memory locations which overflow into one of the subroutine return address on the stack. If they overflow into other variables on the stack, then you may never see the error. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter@eddie.mit.edu krowitz%richter@athena.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
dbfunk@ICAEN.UIOWA.EDU (David B. Funk) (12/10/88)
WRT posting <914@nic.MR.NET> from pabong@gonzo.eta.com (Paul A. Bong): > Occasionally, a program will die with the following error: > >?(sh) "./fs_dump" - unable to unwind stack because of invalid stack frame > (process manager/process fault manager) When a process faults (access violation, odd address trap, divide by zero, etc) the process manager tries to walk back thru the process' stack so as to be able to store a trace-back, look for resources that need to be released, and do general cleaning up. It does this by looking for various land-marks that are stored on the stack, return addresses, stack base pointers, etc. If the program was very badly behaved and wiped out the information stored on its stack before dieing, then this information can be lost. When this happens, the fault manager gives you the above error message. So when you see that error message, it indicates a process that was writing values in memory in places that it shouldn't have. There are many ways this can happen, here are some of the more common: 1) The number of arguments in a procedure call don't match the number of arguments in the procedure's parameter list. IE calling a subroutine with 3 arguments when it expects 4. 2) The type of arguments in a procedure call don't match the type of arguments in the procedure's parameter list. IE calling a subroutine with simple variables when it expects arrays. 3) Indexing outside the bounds of an array. IE trying to store 20 values in an array that was dimensioned with 10 elements. Using a negative index in an array. 4) Using an uninitialized or corrupted pointer. Usually caused by a coding error in pointer usage, but good pointer code can go wrong when a pointer is damaged. Pointer corruption can be caused by one of the preceeding problems or by using a pointer returned by a system call without first checking the returned status to see if the call worked. 5) The stack frame was already messed up by the previous running of a miscreant program. This is very rare but possible. Try running the offending program in a fresh shell. These kinds of problems can be very hard to catch because they destroy the incriminating evidence. Dave Funk University of Iowa