[comp.sys.apollo] Problems with invalid stack frame

pabong@gonzo.eta.com (Paul A. Bong) (12/03/88)

	Occasionally, a program will die with the following error:

?(sh) "./fs_dump" - unable to unwind stack because of invalid stack frame
  (process manager/process fault manager)

  I am running Aegis 9.7 on 3000's and 4000's.  Some of the programs are
compiled at Aegis 9.2 level but some are at 9.7 when this problem occurs.

	Any help would be appreciated greatly.

Thanks

-pab


pabong@gonzo.eta.com	pabong@aring.eta.com	pabong@uring.eta.com

     To Err is Human, to Fubar Requires a Supercomputer!!

krowitz@RICHTER.MIT.EDU (David Krowitz) (12/10/88)

That particular error message means that your program has managed to
trash its stack (the memory area where temporary variables and
subroutine return addresses are kept). Normally, when a program
dies, you will get a traceback of where the program was executing
when it trapped the error. This is done by the OS taking hold of
the stack of the process that got the error and then looking at
the entries in the stack to find the route which the program
took to get to the point where the error occurred.

A sample traceback (for a program which I control-Q'd out of) looks
like this:

$ dspst -a
?(sh) "/com/dspst" - process quit (OS/fault handler)
In routine "GDM_$$DN3000_VALIDATE".
$ tb
process quit (OS/fault handler)
In routine "GDM_$$DN3000_VALIDATE"
Called from "SHOW_VALUE" line 569
Called from "SHOW_LINE" line 645
Called from "BAR_CHART_$DISPLAY" line 1740
Called from "DSPST" line 816
Called from "PM_$CALL"

It is possible for a program to overwrite its own stack area,
thereby destroying the list of subroutine return addresses
and making it impossible to trace where the error occurred.
Common methods of trashing your stack are: 1) calling a
subroutine which allocates a very large array or arrays
[the stack contains 256KB of space], 2) overflowing the
bounds of an existing array, 3) calling a subroutine
with 16-bit integer arguments when the subroutine actually
takes (and modifies) 32-bit arguments [in some rare
instances]. #1 and #2 are the most common causes, and
they will only show up when your input data just happens
(by chance) to cause the program to use those memory
locations which overflow into one of the subroutine
return address on the stack. If they overflow into other
variables on the stack, then you may never see the
error.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter@eddie.mit.edu
krowitz%richter@athena.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

dbfunk@ICAEN.UIOWA.EDU (David B. Funk) (12/10/88)

WRT posting <914@nic.MR.NET> from pabong@gonzo.eta.com (Paul A. Bong):

>	Occasionally, a program will die with the following error:
>
>?(sh) "./fs_dump" - unable to unwind stack because of invalid stack frame
>  (process manager/process fault manager)

When a process faults (access violation, odd address trap, divide by zero, etc)
the process manager tries to walk back thru the process' stack so as to be
able to store a trace-back, look for resources that need to be released, and
do general cleaning up. It does this by looking for various land-marks that are
stored on the stack, return addresses, stack base pointers, etc. If the program
was very badly behaved and wiped out the information stored on its stack
before dieing, then this information can be lost. When this happens, the fault
manager gives you the above error message.
    So when you see that error message, it indicates a process that was writing
values in memory in places that it shouldn't have. There are many ways this can
happen, here are some of the more common:

1)  The number of arguments in a procedure call don't match the number
        of arguments in the procedure's parameter list. IE calling a subroutine
        with 3 arguments when it expects 4.

2)  The type of arguments in a procedure call don't match the type of
        arguments in the procedure's parameter list. IE calling a subroutine
        with simple variables when it expects arrays.

3)  Indexing outside the bounds of an array. IE trying to store 20 values in
        an array that was dimensioned with 10 elements. Using a negative index
        in an array.

4)  Using an uninitialized or corrupted pointer. Usually caused by a coding
        error in pointer usage, but good pointer code can go wrong when a
        pointer is damaged. Pointer corruption can be caused by one of the
        preceeding problems or by using a pointer returned by a system call
        without first checking the returned status to see if the call worked.

5)  The stack frame was already messed up by the previous running of a
        miscreant program. This is very rare but possible. Try running the
        offending program in a fresh shell.

These kinds of problems can be very hard to catch because they destroy
the incriminating evidence.

Dave Funk
University of Iowa