dave@jplopto.uucp (Dave Hayes) (04/21/89)
I get this sporadically unrepeatable message from various programs I run under the AEGIS 9.7 shell. Specifically: - unable to unwind stack bacause of invalid stack frame This occurs rarely, and is not a majro problem, but it IS an annoyance. Any of you folks have any idea what this really means? ============================================================================ Opinions expressed here are my own and not necessarily those of my employer. =-=-=-=-=-=-=-=-=-=-=-=-=<<<<<([Dave Hayes])>>>>>=-=-=-=-=-=-=-=-=-=-=-=-=-= dave%jplopto@jpl-mil.jpl.nasa.gov | Jet Propulsion Laboratory M/S 300-329 {cit-vax,ames}!elroy!jplopto!dave | 4800 Oak Grove Drive Pasadena, CA 91109 BIX: dhayes | (818) 354-1910 "Self justification is worse than the original transgression." ============================================================================
krowitz@RICHTER.MIT.EDU (David Krowitz) (04/21/89)
yup! "unable to unwind stack" means that somehow your program has trashed its stack, gotten a fatal error, and then been unable to decipher the stack (because you trashed it) in order to give you a traceback of where the error occurred. Normally, when a program gets an error you get a traceback that looks something like: process quit (OS/fault handler) In routine "VFMT_$FORMAT_NUMBER" line 171 Called from "VFMT_$MAIN" line 552 Called from "VFMT_$S" line 67 Called from "VFMT_STREAMS_ASM:PROCEDURE$" Called from "LIST_DIR" line 1871 Called from "LIST_DIRECTORY" line 1946 Called from "PM_$CALL" which tells you where the error occurred. The system gets this by examining the stack and finding all of the subroutine calls that were stored on it. The stack also contains local variables (ie. variables which are not passed in the subroutine call and which are not in global memory (ie. not in a COMMON block for you Fortran programmers)). If your program accidentatlly overflows an array you can wind up clobbering your list of subroutine calls in addition to your local variables. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter@eddie.mit.edu krowitz%richter@athena.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
dente@s2.uucp (Colin Dente) (04/22/89)
In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes: >I get this sporadically unrepeatable message from various programs I run >under the AEGIS 9.7 shell. Specifically: > >- unable to unwind stack bacause of invalid stack frame > >This occurs rarely, and is not a majro problem, but it IS an annoyance. Any >of you folks have any idea what this really means? Well - to put in my 2p's worth (this is England, after all...) I usually get this error running DPCE (the IBM PC emulator) on a node without much disk space - and I usually stop getting it if I either move to a node with more free space, or clear out some junk on the node I got problems on. For this reason I've always assumed that it means you've run out of virtual memory (i.e. no more room in /sys/node_data/proc_dir (or whatever)) - and this seems to be consistent with what the error says it is - i.e. stack gets garbaged 'cos there ain't no room for it - though, having said that - something like: _Stuff this - I'm off - I can't allocate any more stack space_ might seem more appropriate. The important thing is that this will be intermittent if you've got a reasonable amount of free space on the disk - as it'll only occur if you've got a lot of processes gobbling up swap space. Anyone who knows what they're talking about (as opposed to me who just babbles incoherently) care to comment? > "Self justification is worse than the original transgression." How true! Colin =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= | Colin Dente | JANET: dente%s2@uk.ac.man.cs.ux | | Dept. of Electrical Engineering | ARPA: dente%s2%man.cs.ux@ukacrl.BITNET | | University of Manchester | UUCP: ...!mcvax!ukc!man.cs.ux!s2!dente | | England | | |-----------------------------------------------------------------------------| | ======================================================================= | =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
markl@neptune.AMD.COM (Mark Luedtke) (04/22/89)
In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes: >- unable to unwind stack bacause of invalid stack frame I ran into this while debugging some code the other day. It turns out that I was attempting to delete a node that was pointed to by a pointer which was off in never-never land. The OS error made it impossible to execute a trace-back and made the bug very hard to find. Of course, fixing my buggy code made it go away :-) I'm also interresting in what happened here. Mark Luedtke markl@neptune.amd.com (512)462-5278
r_miller@apollo.COM (Roger Miller) (04/25/89)
David Krowitz has correctly explained the "unwable to unwind stack" status as resulting from a trashed stack. This usually results from writing outside array bounds or through bogus pointers. Sooner or later this will generally cause a program fault, and it is during the processing of this fault that the trashed stack is noticed and reported. Unfortunately the unwinding problem hides the first fault, which is the one you are really interested in. But all is not lost. Try "tb -full": $ tb -f Process 672 (parent 450, group 0) Time 89/04/24.17:38(EDT) Program /r_miller/tb/test.bin Status 0304000B: unable to unwind stack because of invalid stack frame (process manager/process fault manager) No traceback available Proc2 Uid 42D4A42B.B0008B6C Parent Process 450 (42D2BF40.80008B6C) Process Group 0 (42D2BF40.80008B6C) Fault Status 00120011: access violation (OS/fault handler) <----- Access Addr 00000000 <----- IR C001 Acc. Info B041 User Fault PC 000081EA <----- D0-D3: FFFFFFFF 00004000 00000000 00000000 D4-D7: 031CFE00 FFFF0000 00000000 035012CC A0-A3: 000081EC 0320001C 031D8C04 03180094 A4-A7: 031CFDF4 00010000 00000000 031CF600 Supervisor ECB 00000000 Supervisor SR 0000 Supervisor PC 00000000 and, lo and behold, there is the original fault. It was an access violation at address 81EA, attempting to access address 0. Where is 81EA? Ask dde: dde> des -loc `va(81ea) statement "\\`image(1)\test\trash\15" start address: 000081E8 end address: 000081EB file: "//divali/r_miller/tb/test.pas" (89/04/24.16:59) Now you're well on your way to debugging the problem.
dave@jplopto.uucp (Dave Hayes) (04/26/89)
This is a multiple response. Thanks to all who have responded...I'm really glad to know that i'm not alone! David Krowitz: > yup! "unable to unwind stack" means that somehow your program has > trashed its stack, gotten a fatal error, and then been unable to > decipher the stack (because you trashed it) That would explain the errors that my users keep getting, tho. We are using Mentor Graphics tools, and their EXPAND tool is the only one that they see this error in. I guess I should take that up with them. But how does this explain that I've gotten this error from shell commands like ARGS and WHILE? Sometimes some of the commands in /com give me this error. Aren't they debugged sufficiently to not overrun their stacks? Roger Miller: > and, lo and behold, there is the original fault. It was an access violation > at address 81EA, attempting to access address 0. Where is 81EA? Ask dde: > [...bunch of stuff...] > Now you're well on your way to debugging the problem. Am I? What is DDE? What does all that garbage mean? ============================================================================ Opinions expressed here are my own and not necessarily those of my employer. =-=-=-=-=-=-=-=-=-=-=-=-=<<<<<([Dave Hayes])>>>>>=-=-=-=-=-=-=-=-=-=-=-=-=-= dave%jplopto@jpl-mil.jpl.nasa.gov | Jet Propulsion Laboratory M/S 300-329 {cit-vax,ames}!elroy!jplopto!dave | 4800 Oak Grove Drive Pasadena, CA 91109 BIX: dhayes | (818) 354-1910 "What can the tiger catch in the dark recesses of his own lair?" ============================================================================
markl@neptune.AMD.COM (Mark Luedtke) (04/26/89)
In article <15461@elroy.Jpl.Nasa.Gov> dave%jplopto@jpl-mil.jpl.nasa.gov writes: >- unable to unwind stack bacause of invalid stack frame I got this message (same OS) while trying to debug some buggy code the other day. I was trying to delete an object but the pointer to it was in never- never land. Caused tracebace to not operate, and the bug was very hard to find because of it. A pretty ungraceful handling of whatever situation caused it, i'ld say. Mark Luedtke markl@neptune.amd.com (512)462-5278
wescott@LNIC1.HPRC.UH.EDU (Andrew M. Wescott) (04/26/89)
I'll bet DDE is the Distributed Debugging Environment. Do a man on dde and get "Domain Distributed Debugging Environment Reference" (011024-A00). Andrew Wescott University of Houston Department of Chemical Engineering
r_miller@apollo.COM (Roger Miller) (04/27/89)
In reply to Dave Hayes (and of interest to other SR9.7 users): > > Now you're well on your way to debugging the problem. > Am I? What is DDE? What does all that garbage mean? Sorry, I should have pointed out that this was all based on SR10. DDE (Distributed Debugging Environment) is the SR10 debugger; it replaces debug. The "-full" option to tb is also new at SR10. Under SR9.7 the tools aren't as sharp. You can run the program under debug to catch the original fault before the stack unwinder is invoked. But if the stack has been corrupted the debugger won't be able to figure out where it is either. It will probably say something like (process_5) access violation (OS/fault handler) *** Error: Stack pointers invalid; cannot establish valid environment. Your best bet is probably to use db to find the address where the fault occurs: $ db -g test.bin ... startup stuff deleted ... access violation (OS/fault handler) F 120011 8076: 4E75 <---- This tells you that the access violation occurred at address 8076. (The contents of this address, 4E75, is an RTS instruction. A routine probably overwrote its return address on the stack, then faulted trying to return.) To locate address 8076 go back to debug. With the VA command and some trial and error you should be able to find the routine containing it. Then "BREAK -VA 16#8076" and look in the source display for the line marked with a breakpoint symbol. > But how does this explain that I've gotten this error from shell commands > like ARGS and WHILE? This is just a guess, but since programs run in the same process as the shell at SR9.7 it is possible for a stray write to clobber some of the shell's part of the stack. The offending program may be long gone by the time the problem shows up. (SR10 adopts the Unix model of invoking each program in a separate process, so this is no longer a problem.)
danny@idacom.UUCP (Danny Wilson) (04/27/89)
In article <15546@elroy.Jpl.Nasa.Gov>, dave@jplopto.uucp (Dave Hayes) writes: > > That would explain the errors that my users keep getting, tho. We are > using Mentor Graphics tools, and their EXPAND tool is the only one that they > see this error in. I guess I should take that up with them. A stack frame error *may* happen if the sheets that are being compiled into the design file have buggy BLM's (Behavior language models). That is, if your people are writing their own BLM code and it is not quite up to snuff. Another possibility is if you use third-party models (like Logic Automation or Quadtree) and there are bugs in them. I have *never* seen a stack frame error during Expand in over 5 years of using expand... I would expect it more during a simulation run with QuickSim. > > But how does this explain that I've gotten this error from shell commands > like ARGS and WHILE? Sometimes some of the commands in /com give me this > error. Aren't they debugged sufficiently to not overrun their stacks? > This sounds like something is seriously flaky in your system - not the application programs you are using. I'd give the hotline a call. > Am I? What is DDE? What does all that garbage mean? DDE is a debugger that is very useful for writing programs. Of course, the gentleman was assuming you were developing software rather than hardware. (if you really get into custom BLM development the debugger is *quite* useful [Mentor Graphics sites before version 7.0 must use the Apollo SR9.7 - 'debug' program rather than DDE]) -- Danny Wilson IDACOM Electronics danny@idacom.uucp Edmonton, Alberta alberta!idacom!danny C A N A D A
rtw@lzfmd.att.com (R. T. Wurth) (04/27/89)
In article <633@idacom.UUCP>, danny@idacom.UUCP (Danny Wilson) writes: > In article <15546@elroy.Jpl.Nasa.Gov>, dave@jplopto.uucp (Dave Hayes) writes: > > > > That would explain the errors that my users keep getting, tho. We are > > using Mentor Graphics tools, and their EXPAND tool is the only one that they > > see this error in. I guess I should take that up with them. > > A stack frame error *may* happen if the sheets that are being compiled > into the design file have buggy BLM's (Behavior language models). > That is, if your people are writing their own BLM code and it is > not quite up to snuff. Another possibility is if you use third-party > models (like Logic Automation or Quadtree) and there are bugs in them. > > I have *never* seen a stack frame error during Expand in over > 5 years of using expand... I would expect it more during a simulation > run with QuickSim. > Danny Wilson ... > IDACOM Electronics danny@idacom.uucp > Edmonton, Alberta alberta!idacom!danny > C A N A D A One particularly bad problem that burns us every time is that the supplied models for programmable logic (PALs, (PAL is a trademark of MMI/AMD) in particular) blow up when confronted with "skinny" JEDEC formatted fuse information. "Fat JEDEC" explicitly specifies every fuse. "Skinny JEDEC" uses a special operator at the start of the JEDEC file to specify a default state and then includes only those lines of a "fat JEDEC" file containing at least one fuse not in the default state. The versions of PALASM (PALASM is a trademark of MMI/AMD) that we use (on a PC and on our VAX host running the UNIX\*(Rg System V operating system (UNIX is a trademark of AT&T)) all produce skinny JEDEC. Someone hacked up the PALASM source code from a very old PALASM release to produce "fat" JEDEC on our workstations, but it is a very old version that doesn't support all devices. Our current workaround is to download the "skinny" JEDEC to our programmer, and then upload it from the programmer, since the programmer (DATA I/O) alwasy puts out "fat" JEDEC. I don't recall which tool blows up, but when it does, it produce no error message, it just faults back to the shell. Rich Wurth / lzfmd!rtw OR rtw@lzfmd.ATT.COM AT&T-Bell Labs / LZ 1H-303 / 201 576 6332 307 Middletown-Lincroft Rd. / Lincroft, NJ 07738
krowitz@RICHTER.MIT.EDU (David Krowitz) (04/27/89)
Ok, guys, here's the scoop in full public view ... If your program trashes its stack, or if another program running in the same process (ala SR9 or SR10 with the inprocess switch set) trashes the stack, /com/tb will fail. There is no way around it. The information that tb prints out is stored on the stack, and once you've trashed the stack there ain't noth'un to print. Debugging programs which have messed up the stack is intrinsically difficult because of the nature of the error -- it erases the debugging info. The common causes of a bad stack are: 1) referencing a variable via a pointer which has not been set up correctly. Both reading *and* writing to variables via a bad pointer can result in trashing the stack (because not only can a bad pointer result in your program writing over existing info on the stack, but your program will read values that re not legitimate, and will use those values (pointers, loop indices, etc) in further computation which can go out of control). 2) mismatched argument lists in calls to external subroutines and functions. If you pass an 16-bit integer to a subroutine which thinks it is a 32-bit integer and the subroutine stores a value into that variable, then you will have trashed your stack. Subroutine and function arguments are passed on the stack. In C, they are passed by value (ie. the actual value of the variable is on the stack), so if you write a 32-bit value into a 16-bit space, you wipe out whatever followed it on the stack. Since the return address for the subroutine is also stored on the stack, you can destroy it and your subroutine will return to never-never land rather than to the calling program. With Pascal and Fortran, subroutine arguments are passed by reference (ie. the address of the variable is on the stack), so if you write a 32-bit value to a 16-bit space, you don't trash the stack immediately. BUT! Subroutines tore their local variables on the stack in addition to the arguments that were passed into them. If a subroutine calls a second subroutine passing some local variables to that 2nd subroutine, and if that second subroutine then writes a 32-bit value to one of those local variables, and if that local variable was actually a 16-bit value, THEN the 2nd subroutine will trash the stack of the FIRST subroutine! (and the first subroutine will return to never-never land as a result of an action by the second subroutine ... niffty error, eh?) 3) of course, there's always the good old standby of allocating an array which is smaller that what you actually use (or of messing up the calculation of an loop index and simply writing outside of the allocated array). If the array is a local variable in a subroutine, then it was allocated on the stack, and writing outside of the bounds of the array will overwrite the stack (and of course reading outside of the bounds will give you garbage values which can cause something else to go haywire). If the array is statically allocated (a global variable, a Fortran COMMON block, etc), you can still trash the stack if you go far enough outside the bounds of the array. How do you debug one of these monsters? First, you recognize the symptoms (can't unwind stack ...). Then you start putting print statements (or breakpoints with the debugger) scattered about your program so you can see how far the program got before it blew up. Use the -dba or -dbs switches along with the -subchk switch to look for array overflows. Note that turning debugging on/off can change the error because the addition of the debugging info to the executable program will change where the various variables are located in memory. You can cause the error to go away altogether by turning on the debugging info (because the program winds up trashing some debugging info rather than one of its own variables or the stack). Once you have the problem narrowed down to a subroutine, check everything listed above. PRINT OUT the values of ALL of your critical variables (pointer vales, loop indices, etc). Check them just prior to their use. They may be ok when you enter the routine, but get clobbered by some other code before they are used. Adding print statments to your code to track its progress and to check you variables and pointers is crude, but effective. Some of the nastier error sequences will even confuse the debugger into thinking that the error is elsewhere. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter@eddie.mit.edu krowitz%richter@athena.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference) Maybe this should be a talk at the next ADUS conference?
markl%neptune.AMD.COM%amdcad%ames@mailrus.UUCP (Mark Luedtke) (04/27/89)
Thanks for the clear explanation. The thing that makes this a 'mystery error' is that I have done these things before without that problem. It only occurs when the specific operation runs into the stack, which is not common, but when it occurs, everything kills the stack (the old theory of localization). My program bug, found just as you mentioned, didn't take long to find, it was just ugly, and repetative. I'm sure there is some way Apollo could handle this more elegantly. Mark Luedtke markl@neptune.amd.com (512)462-5278
dmuntz@caen.engin.umich.edu (Dan Muntz) (04/29/89)
To generate your own 'Mystery Error' compile and run this program:
main ()
{
main ();
}
The resulting error: "unable to unwind..."
is a bit cryptic and difficult to track down due to reasons
mentioned by others.
-Dan M.
dmuntz@caen.engin.umich.edu
krowitz@RICHTER.MIT.EDU (David Krowitz) (05/03/89)
If I understand your C code (and I am not much of a C programmer), you are calling the main program recursively with no ending condition. Each recursive call to the program will "push" the return address of the call onto the stack. Since there is no ending condition to the recursive calls (ie. the subroutine calls never return), the stack eventually gets completely filled with these return addresses, and the stack overflows and trashes something else (like maybe the program itself) for a change. It's the same basic problem ... a program which writes over itself destroys the debugging info along with everything else. -- David Krowitz krowitz@richter.mit.edu (18.83.0.109) krowitz%richter@eddie.mit.edu krowitz%richter@athena.mit.edu krowitz%richter.mit.edu@mitvma.bitnet (in order of decreasing preference)
squires@hpcvlx.HP.COM (Matt Squires) (05/04/89)
/ comp.sys.apollo / krowitz@RICHTER.MIT.EDU (David Krowitz) / > If I understand your C code (and I am not much of a C > programmer) you are calling the main program recursively > with no ending condition. Each recursive call to the > program will "push" the return address of the call onto > the stack. Since there is no ending condition to the > recursive calls (ie. the subroutine calls never return), > the stack eventually gets completely filled with these > return addresses, and the stack overflows and trashes > something else (like maybe the program itself) for a > change. If I understand machine acrhitectures (and I am not much of a machine archeologist) then ``sufficiently powerful (1)'' processors cause a hardware trap on stack overflow (2). One should hope that processes would not be able overflow their stack and then scribble on data, bss, or text segments. mcs (1) The Intel 8088 is not sufficiently powerful, while I assume Hpollo workstation processors are. (2) Tanenbaum, Andrew S., _Operating Systems_ Copyright 1987 Prentice-Hall, Inc. pp. 229
GBOPOLY1@NUSVM.BITNET (fclim) (05/09/89)
Hi, Has anybody successfully port GCC to Apollo boxes? If so, what do you get (how corny) if you compile and run the code main() { main(); } This piece of code is tail-recursive and from what I heard, GCC is a tail-recursive compiler. Will the stack still overflow? fclim --- gbopoly1 % nusvm.bitnet @ cunyvm.cuny.edu computer centre singapore polytechnic dover road singapore 0513.