[comp.sys.apollo] Mystery Error

dave@jplopto.uucp (Dave Hayes) (04/21/89)

I get this sporadically unrepeatable message from various programs I run 
under the AEGIS 9.7 shell. Specifically:

- unable to unwind stack bacause of invalid stack frame

This occurs rarely, and is not a majro problem, but it IS an annoyance. Any 
of you folks have any idea what this really means? 

============================================================================
Opinions expressed here are my own and not necessarily those of my employer. 
=-=-=-=-=-=-=-=-=-=-=-=-=<<<<<([Dave Hayes])>>>>>=-=-=-=-=-=-=-=-=-=-=-=-=-=
dave%jplopto@jpl-mil.jpl.nasa.gov | Jet Propulsion Laboratory   M/S 300-329                
{cit-vax,ames}!elroy!jplopto!dave | 4800 Oak Grove Drive Pasadena, CA 91109   
BIX:  dhayes                      | (818) 354-1910
      "Self justification is worse than the original transgression."
============================================================================

krowitz@RICHTER.MIT.EDU (David Krowitz) (04/21/89)

yup! "unable to unwind stack" means that somehow your program has
trashed its stack, gotten a fatal error, and then been unable to 
decipher the stack (because you trashed it) in order to give you
a traceback of where the error occurred. Normally, when a program
gets an error you get a traceback that looks something like:

process quit (OS/fault handler)
In routine "VFMT_$FORMAT_NUMBER" line 171
Called from "VFMT_$MAIN" line 552
Called from "VFMT_$S" line 67
Called from "VFMT_STREAMS_ASM:PROCEDURE$"
Called from "LIST_DIR" line 1871
Called from "LIST_DIRECTORY" line 1946
Called from "PM_$CALL"

which tells you where the error occurred. The system gets this by
examining the stack and finding all of the subroutine calls that
were stored on it. The stack also contains local variables (ie.
variables which are not passed in the subroutine call and which
are not in global memory (ie. not in a COMMON block for you
Fortran programmers)). If your program accidentatlly overflows an
array you can wind up clobbering your list of subroutine calls in
addition to your local variables.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter@eddie.mit.edu
krowitz%richter@athena.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

dente@s2.uucp (Colin Dente) (04/22/89)

In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes:
>I get this sporadically unrepeatable message from various programs I run 
>under the AEGIS 9.7 shell. Specifically:
>
>- unable to unwind stack bacause of invalid stack frame
>
>This occurs rarely, and is not a majro problem, but it IS an annoyance. Any 
>of you folks have any idea what this really means? 

Well - to put in my 2p's worth (this is England, after all...)

I usually get this error running DPCE (the IBM PC emulator) on a node without
much disk space - and I usually stop getting it if I either move to a node with
more free space, or clear out some junk on the node I got problems on.  For 
this reason I've always assumed that it means you've run out of virtual memory
(i.e. no more room in /sys/node_data/proc_dir (or whatever)) - and this seems
to be consistent with what the error says it is - i.e. stack gets garbaged 'cos
there ain't no room for it - though, having said that - something like:

_Stuff this - I'm off - I can't allocate any more stack space_

might seem more appropriate.

The important thing is that this will be intermittent if you've got a 
reasonable amount of free space on the disk - as it'll only occur if you've
got a lot of processes gobbling up swap space.

Anyone who knows what they're talking about (as opposed to me who just babbles
incoherently) care to comment?

>      "Self justification is worse than the original transgression."
How true!

Colin

=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
| Colin Dente                      | JANET: dente%s2@uk.ac.man.cs.ux          |
| Dept. of Electrical Engineering  | ARPA:  dente%s2%man.cs.ux@ukacrl.BITNET  |
| University of Manchester         | UUCP:  ...!mcvax!ukc!man.cs.ux!s2!dente  |
| England                          |                                          |
|-----------------------------------------------------------------------------|
|   =======================================================================   |
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=

markl@neptune.AMD.COM (Mark Luedtke) (04/22/89)

In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes:
>- unable to unwind stack bacause of invalid stack frame

I ran into this while debugging some code the other day.  It turns out that I
was attempting to delete a node that was pointed to by a pointer which was off
in never-never land.  The OS error made it impossible to execute a trace-back
and made the bug very hard to find.

Of course, fixing my buggy code made it go away :-)

I'm also interresting in what happened here.

Mark Luedtke    markl@neptune.amd.com    (512)462-5278

r_miller@apollo.COM (Roger Miller) (04/25/89)

David Krowitz has correctly explained the "unwable to unwind stack"
status as resulting from a trashed stack.  This usually results from
writing outside array bounds or through bogus pointers.  Sooner or
later this will generally cause a program fault, and it is during
the processing of this fault that the trashed stack is noticed and
reported.

Unfortunately the unwinding problem hides the first fault, which is
the one you are really interested in.  But all is not lost.  Try
"tb -full":

    $ tb -f
    Process        672 (parent 450, group 0)
    Time           89/04/24.17:38(EDT)
    Program        /r_miller/tb/test.bin
    Status         0304000B: unable to unwind stack because of invalid stack frame (process manager/process fault manager)
    No traceback available
    
    Proc2 Uid       42D4A42B.B0008B6C
    Parent Process  450 (42D2BF40.80008B6C)
    Process Group   0 (42D2BF40.80008B6C)
    Fault Status    00120011: access violation (OS/fault handler)   <-----
    Access Addr     00000000                                        <-----
    IR              C001
    Acc. Info       B041
    User Fault PC   000081EA                                        <-----
    D0-D3:          FFFFFFFF 00004000 00000000 00000000
    D4-D7:          031CFE00 FFFF0000 00000000 035012CC
    A0-A3:          000081EC 0320001C 031D8C04 03180094
    A4-A7:          031CFDF4 00010000 00000000 031CF600
    Supervisor ECB  00000000
    Supervisor SR   0000
    Supervisor PC   00000000

and, lo and behold, there is the original fault.  It was an access violation
at address 81EA, attempting to access address 0.  Where is 81EA?  Ask dde:

    dde> des -loc `va(81ea)
    statement "\\`image(1)\test\trash\15"
    start address: 000081E8
    end address: 000081EB
    file: "//divali/r_miller/tb/test.pas" (89/04/24.16:59)

Now you're well on your way to debugging the problem.

dave@jplopto.uucp (Dave Hayes) (04/26/89)

This is a multiple response. Thanks to all who have responded...I'm really
glad to know that i'm not alone!

David Krowitz:
> yup! "unable to unwind stack" means that somehow your program has
> trashed its stack, gotten a fatal error, and then been unable to 
> decipher the stack (because you trashed it) 

That would explain the errors that my users keep getting, tho. We are 
using Mentor Graphics tools, and their EXPAND tool is the only one that they
see this error in. I guess I should take that up with them.

But how does this explain that I've gotten this error from shell commands 
like ARGS and WHILE? Sometimes some of the commands in /com give me this 
error. Aren't they debugged sufficiently to not overrun their stacks?

Roger Miller:
> and, lo and behold, there is the original fault.  It was an access violation
> at address 81EA, attempting to access address 0.  Where is 81EA?  Ask dde:
> [...bunch of stuff...]
> Now you're well on your way to debugging the problem.

Am I? What is DDE? What does all that garbage mean? 

============================================================================
Opinions expressed here are my own and not necessarily those of my employer. 
=-=-=-=-=-=-=-=-=-=-=-=-=<<<<<([Dave Hayes])>>>>>=-=-=-=-=-=-=-=-=-=-=-=-=-=
dave%jplopto@jpl-mil.jpl.nasa.gov | Jet Propulsion Laboratory   M/S 300-329                
{cit-vax,ames}!elroy!jplopto!dave | 4800 Oak Grove Drive Pasadena, CA 91109   
BIX:  dhayes                      | (818) 354-1910
      "What can the tiger catch in the dark recesses of his own lair?"
============================================================================

markl@neptune.AMD.COM (Mark Luedtke) (04/26/89)

In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes:
>- unable to unwind stack bacause of invalid stack frame

I got this message (same OS) while trying to debug some buggy code the other
day.  I was trying to delete an object but the pointer to it was in never-
never land.  Caused tracebace to not operate, and the bug was very hard to
find because of it.

A pretty ungraceful handling of whatever situation caused it, i'ld say.

Mark Luedtke    markl@neptune.amd.com    (512)462-5278

wescott@LNIC1.HPRC.UH.EDU (Andrew M. Wescott) (04/26/89)

I'll bet DDE is the Distributed Debugging Environment.  Do a
man on dde and get "Domain Distributed Debugging Environment
Reference" (011024-A00).

Andrew Wescott
University of Houston 
Department of Chemical Engineering

r_miller@apollo.COM (Roger Miller) (04/27/89)

In reply to Dave Hayes (and of interest to other SR9.7 users):

>   > Now you're well on your way to debugging the problem.
>   Am I? What is DDE? What does all that garbage mean?

Sorry, I should have pointed out that this was all based on SR10.  DDE
(Distributed Debugging Environment) is the SR10 debugger; it replaces
debug.  The "-full" option to tb is also new at SR10.

Under SR9.7 the tools aren't as sharp.  You can run the program under
debug to catch the original fault before the stack unwinder is invoked.
But if the stack has been corrupted the debugger won't be able to figure
out where it is either.  It will probably say something like

    (process_5) access violation (OS/fault handler)
    *** Error: Stack pointers invalid; cannot establish valid environment.

Your best bet is probably to use db to find the address where the fault
occurs:

    $ db -g test.bin
    ... startup stuff deleted ...
    access violation (OS/fault handler)
    F     120011
    8076: 4E75                              <----

This tells you that the access violation occurred at address 8076. (The
contents of this address, 4E75, is an RTS instruction.  A routine probably
overwrote its return address on the stack, then faulted trying to return.)
To locate address 8076 go back to debug.  With the VA command and some trial
and error you should be able to find the routine containing it.  Then
"BREAK -VA 16#8076" and look in the source display for the line marked with
a breakpoint symbol.

>   But how does this explain that I've gotten this error from shell commands
>   like ARGS and WHILE?

This is just a guess, but since programs run in the same process as the
shell at SR9.7 it is possible for a stray write to clobber some of the
shell's part of the stack.  The offending program may be long gone by the
time the problem shows up.  (SR10 adopts the Unix model of invoking each
program in a separate process, so this is no longer a problem.)

danny@idacom.UUCP (Danny Wilson) (04/27/89)

In article <15546@elroy.Jpl.Nasa.Gov>, dave@jplopto.uucp (Dave Hayes) writes:
> 
> That would explain the errors that my users keep getting, tho. We are 
> using Mentor Graphics tools, and their EXPAND tool is the only one that they
> see this error in. I guess I should take that up with them.

A stack frame error *may* happen if the sheets that are being compiled
into the design file have buggy BLM's (Behavior language models).
That is, if your people are writing their own BLM code and it is
not quite up to snuff. Another possibility is if you use third-party
models (like Logic Automation or Quadtree) and there are bugs in them.

I have *never* seen a stack frame error during Expand in over 
5 years of using expand... I would expect it more during a simulation
run with QuickSim.

> 
> But how does this explain that I've gotten this error from shell commands 
> like ARGS and WHILE? Sometimes some of the commands in /com give me this 
> error. Aren't they debugged sufficiently to not overrun their stacks?
> 

This sounds like something is seriously flaky in your system - not
the application programs you are using. I'd give the hotline a call.

> Am I? What is DDE? What does all that garbage mean? 

DDE is a debugger that is very useful for writing programs. Of course,
the gentleman was assuming you were developing software rather
than hardware. (if you really get into custom BLM development
the debugger is *quite* useful [Mentor Graphics sites before 
version 7.0 must use the Apollo SR9.7 - 'debug' program rather
than DDE])

-- 
Danny Wilson
IDACOM Electronics		danny@idacom.uucp
Edmonton, Alberta		alberta!idacom!danny
C A N A D A

rtw@lzfmd.att.com (R. T. Wurth) (04/27/89)

In article <633@idacom.UUCP>, danny@idacom.UUCP (Danny Wilson) writes:
> In article <15546@elroy.Jpl.Nasa.Gov>, dave@jplopto.uucp (Dave Hayes) writes:
> > 
> > That would explain the errors that my users keep getting, tho. We are 
> > using Mentor Graphics tools, and their EXPAND tool is the only one that they
> > see this error in. I guess I should take that up with them.
> 
> A stack frame error *may* happen if the sheets that are being compiled
> into the design file have buggy BLM's (Behavior language models).
> That is, if your people are writing their own BLM code and it is
> not quite up to snuff. Another possibility is if you use third-party
> models (like Logic Automation or Quadtree) and there are bugs in them.
> 
> I have *never* seen a stack frame error during Expand in over 
> 5 years of using expand... I would expect it more during a simulation
> run with QuickSim.
> Danny Wilson
 ...
> IDACOM Electronics		danny@idacom.uucp
> Edmonton, Alberta		alberta!idacom!danny
> C A N A D A

One particularly bad problem that burns us every time is that the
supplied models for programmable logic (PALs, (PAL is a trademark of
MMI/AMD) in particular) blow up when confronted with "skinny" JEDEC
formatted fuse information.  "Fat JEDEC" explicitly specifies every
fuse.  "Skinny JEDEC" uses a special operator at the start of the
JEDEC file to specify a default state and then includes only those
lines of a "fat JEDEC" file containing at least one fuse not in the
default state.

The versions of PALASM (PALASM is a trademark of MMI/AMD) that we
use (on a PC and on our VAX host running the UNIX\*(Rg System V
operating system (UNIX is a trademark of AT&T)) all produce skinny
JEDEC.  Someone hacked up the PALASM source code from a very old
PALASM release to produce "fat" JEDEC on our workstations, but it is a
very old version that doesn't support all devices.  Our current
workaround is to download the "skinny" JEDEC to our programmer, and
then upload it from the programmer, since the programmer (DATA I/O)
alwasy puts out "fat" JEDEC.

I don't recall which tool blows up, but when it does, it produce no
error message, it just faults back to the shell.

	Rich Wurth / lzfmd!rtw OR rtw@lzfmd.ATT.COM
	AT&T-Bell Labs / LZ 1H-303 / 201 576 6332
	307 Middletown-Lincroft Rd. / Lincroft, NJ  07738

krowitz@RICHTER.MIT.EDU (David Krowitz) (04/27/89)

Ok, guys, here's the scoop in full public view ...

If your program trashes its stack, or if another program
running in the same process (ala SR9 or SR10 with the
inprocess switch set) trashes the stack, /com/tb will
fail. There is no way around it. The information that
tb prints out is stored on the stack, and once you've
trashed the stack there ain't noth'un to print. Debugging
programs which have messed up the stack is intrinsically
difficult because of the nature of the error -- it erases
the debugging info. The common causes of a bad stack are:

1) referencing a variable via a pointer which has not
   been set up correctly. Both reading *and* writing to
   variables via a bad pointer can result in trashing the
   stack (because not only can a bad pointer result in
   your program writing over existing info on the stack,
   but your program will read values that re not legitimate,
   and will use those values (pointers, loop indices, etc)
   in further computation which can go out of control).

2) mismatched argument lists in calls to external subroutines
   and functions. If you pass an 16-bit integer to a
   subroutine which thinks it is a 32-bit integer and the
   subroutine stores a value into that variable, then you
   will have trashed your stack. Subroutine and function
   arguments are passed on the stack. In C, they are passed
   by value (ie. the actual value of the variable is on the
   stack), so if you write a 32-bit value into a 16-bit space,
   you wipe out whatever followed it on the stack. Since
   the return address for the subroutine is also stored on
   the stack, you can destroy it and your subroutine will
   return to never-never land rather than to the calling
   program. With Pascal and Fortran, subroutine arguments
   are passed by reference (ie. the address of the variable
   is on the stack), so if you write a 32-bit value to a
   16-bit space, you don't trash the stack immediately. BUT!
   Subroutines tore their local variables on the stack in
   addition to the arguments that were passed into them. If
   a subroutine calls a second subroutine passing some local
   variables to that 2nd subroutine, and if that second
   subroutine then writes a 32-bit value to one of those
   local variables, and if that local variable was actually
   a 16-bit value, THEN the 2nd subroutine will trash the
   stack of the FIRST subroutine! (and the first subroutine
   will return to never-never land as a result of an action
   by the second subroutine ... niffty error, eh?)

3) of course, there's always the good old standby of
   allocating an array which is smaller that what you actually
   use (or of messing up the calculation of an loop index and
   simply writing outside of the allocated array). If the
   array is a local variable in a subroutine, then it was
   allocated on the stack, and writing outside of the bounds
   of the array will overwrite the stack (and of course
   reading outside of the bounds will give you garbage values
   which can cause something else to go haywire). If the
   array is statically allocated (a global variable, a Fortran
   COMMON block, etc), you can still trash the stack if you
   go far enough outside the bounds of the array.

How do you debug one of these monsters? First, you recognize
the symptoms (can't unwind stack ...). Then you start putting
print statements (or breakpoints with the debugger) scattered
about your program so you can see how far the program got
before it blew up. Use the -dba or -dbs switches along with
the -subchk switch to look for array overflows. Note that
turning debugging on/off can change the error because the
addition of the debugging info to the executable program will
change where the various variables are located in memory. You
can cause the error to go away altogether by turning on the
debugging info (because the program winds up trashing some
debugging info rather than one of its own variables or the
stack). Once you have the problem narrowed down to a subroutine,
check everything listed above. PRINT OUT the values of ALL of
your critical variables (pointer vales, loop indices, etc).
Check them just prior to their use. They may be ok when you
enter the routine, but get clobbered by some other code before
they are used. Adding print statments to your code to track its
progress and to check you variables and pointers is crude, but
effective. Some of the nastier error sequences will even confuse

the debugger into thinking that the error is elsewhere.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter@eddie.mit.edu
krowitz%richter@athena.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

Maybe this should be a talk at the next ADUS conference?

markl%neptune.AMD.COM%amdcad%ames@mailrus.UUCP (Mark Luedtke) (04/27/89)

Thanks for the clear explanation.  The thing that makes this a 'mystery
error' is that I have done these things before without that problem.  It
only occurs when the specific operation runs into the stack, which is not
common, but when it occurs, everything kills the stack (the old theory of
localization).  My program bug, found just as you mentioned, didn't take long
to find, it was just ugly, and repetative.  I'm sure there is some way Apollo
could handle this more elegantly.

Mark Luedtke    markl@neptune.amd.com    (512)462-5278

dmuntz@caen.engin.umich.edu (Dan Muntz) (04/29/89)

To generate your own 'Mystery Error' compile and run this program:

main ()
{
 main ();
}

The resulting error: "unable to unwind..."
is a bit cryptic and difficult to track down due to reasons
mentioned by others.

  -Dan M.
  dmuntz@caen.engin.umich.edu

krowitz@RICHTER.MIT.EDU (David Krowitz) (05/03/89)

If I understand your C code (and I am not much of a C 
programmer), you are calling the main program recursively
with no ending condition. Each recursive call to the
program will "push" the return address of the call onto
the stack. Since there is no ending condition to the
recursive calls (ie. the subroutine calls never return),
the stack eventually gets completely filled with these
return addresses, and the stack overflows and trashes
something else (like maybe the program itself) for a
change. 

It's the same basic problem ... a program which writes
over itself destroys the debugging info along with
everything else.


 -- David Krowitz

krowitz@richter.mit.edu   (18.83.0.109)
krowitz%richter@eddie.mit.edu
krowitz%richter@athena.mit.edu
krowitz%richter.mit.edu@mitvma.bitnet
(in order of decreasing preference)

squires@hpcvlx.HP.COM (Matt Squires) (05/04/89)

/ comp.sys.apollo / krowitz@RICHTER.MIT.EDU (David Krowitz) /

> If I understand your C code (and I am not much of a C 
> programmer) you are calling the main program recursively
> with no ending condition. Each recursive call to the
> program will "push" the return address of the call onto
> the stack. Since there is no ending condition to the
> recursive calls (ie. the subroutine calls never return),
> the stack eventually gets completely filled with these
> return addresses, and the stack overflows and trashes
> something else (like maybe the program itself) for a
> change. 

If I understand machine acrhitectures (and I am not much of
a machine archeologist) then ``sufficiently powerful (1)'' processors
cause a hardware trap on stack overflow (2).  One should hope that
processes would not be able overflow their stack and then scribble on
data, bss, or text segments.

mcs

(1)  The Intel 8088 is not sufficiently powerful, while I assume Hpollo
	workstation processors are.

(2)  Tanenbaum, Andrew S., _Operating Systems_ Copyright 1987
	Prentice-Hall, Inc.  pp. 229

GBOPOLY1@NUSVM.BITNET (fclim) (05/09/89)

Hi,
     Has anybody successfully port GCC to Apollo boxes?  If so, what
do you get (how corny) if you compile and run the code
             main() { main(); }
This piece of code is tail-recursive and from what I heard, GCC is a
tail-recursive compiler.  Will the stack still overflow?


fclim          --- gbopoly1 % nusvm.bitnet @ cunyvm.cuny.edu
computer centre
singapore polytechnic
dover road
singapore 0513.