[comp.lang.c] Longjmping back, and back again; Coroutines in C

jerker@enea.se (Jerker W}gberg) (11/18/89)

I am trying to implement coroutines using plain C. My
application is not time critical, so there is no need for speed,
being portable is far more important. It would be great if there
was a way to switch stacks of the processes using just C.

I have figured out a way to implement this that works fine on a PC
with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that
instead of actually switching stacks, I use the "real" stack but
swap it in and out of malloced memory.

1. Can somebody explain why this fails on a SUN4. This is the first
   program that I ever have tried on a SUN4 so I don't have any
   idea what happens inside the CPU. Could it be that the SUN4 have
   some peculiar registers that must be restored but isn't below?

2. Are there more machines out there that probably will throw up
   when executing this code, apart from CPU's where the stack is
   growing upward instead of downward. This can be taken care of,
   but I did not want to clutter the example below.

3. Does anyone know of a better way to do coroutines in C ?

Like to test it yourself ? Here is a barebones version:

------------------------------ cut here -------------------------------------
/*
** Test of coroutines.
** Outputs :
**	func1
**	func2
**	func1
**	.
**	.
**      ad inf
*/

#include <stdio.h>
#include <setjmp.h>
#include <malloc.h>

char *stack_bottom;
jmp_buf main_buf;
int cc;			/* Current coroutine */

func_1()
    {
    int foo[50];   /* To get different stack depths at dispatch. If the	*/
    		   /* stack depths are equal at dispatch the program	*/
		   /* just continues with func2 on my SUN4. If stacks 	*/
		   /* are unequal, I get segmentation fault.		*/
    for (;;)
        {
	printf("func1\n");
	dispatch();
	}
    }

func_2()
    {
    for (;;)
        {
	printf("func2\n");
	dispatch();
	}
    }

struct sav		/* Control block for each coroutine		*/
    {
    char *stack;	/* Pointer to saved stack			*/
    int stack_len;	/* Lenght of stack				*/
    int (*func)();	/* Pointer to each coroutines start address	*/
    jmp_buf retbuf;	/* SP and PC of each coroutine			*/
    } sav[2] = 
    	{
    	  {NULL, 0, func_1}
	, {NULL, 0, func_2}
	};

main()
    {
    char stack_mark;

    stack_bottom = &stack_mark;

    /*
    ** Here we go. Run a coroutine until it dispatches then
    ** run the other.
    */

    for (;;cc ^= 1)
	{
	if (setjmp(main_buf) == 0)
	    if (sav[cc].stack == NULL)
		sav[cc].func();			/* First time	*/
	    else
		longjmp(sav[cc].retbuf, 1);	/* Continuation	*/
        }
    }

dispatch()
    {
    char stack_mark;
    jmp_buf disp_buf;

    if (setjmp(disp_buf) == 0)
	{

	/*
	** Save the jmp_buf so that we can return.
	** Save the stack into mallocated memory and
	** longjmp back to main
	*/

	memcpy(sav[cc].retbuf, disp_buf, sizeof(jmp_buf));
	sav[cc].stack_len = stack_bottom - &stack_mark;
	sav[cc].stack = malloc(sav[cc].stack_len);
	memcpy(sav[cc].stack, &stack_mark, sav[cc].stack_len);
	longjmp(main_buf, 1);
	}
    else
	{

	/*
	** The longjmp from main set SP ok but there is just
	** garbage from the previous coroutine on the stack. Copy
	** the stack from previous run and return to caller of
	** dispatch.
	*/

	memcpy(&stack_mark, sav[cc].stack, sav[cc].stack_len);
	free(sav[cc].stack);
	}
    }

chris@mimsy.umd.edu (Chris Torek) (11/21/89)

In article <457@enea.se> jerker@enea.se (Jerker W}gberg) writes:
>instead of actually switching stacks, I use the "real" stack but
>swap it in and out of malloced memory.
>
>1. Can somebody explain why this fails on a SUN4.

I suspect it defers writing into the stack until absolutely necessary.
You thus save the data before it exists.

Your technique will also fail on machines that have several simultaneously
active stacks (e.g., Pyramid) or no stack (probably no such machines run
C).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

flee@shire.cs.psu.edu (Felix Lee) (11/21/89)

Jerker W}gberg <jerker@helios.se> wrote:
> The idea is that instead of actually switching stacks, I use the
> "real" stack but swap it in and out of malloced memory.
> 1. Can somebody explain why this fails on a SUN4.

The Sun-4 has a curious hack known as register windows.  Rather than
putting things on the stack, it keeps everything in registers.  Each
subroutine gets its own set of 24 registers called a register window.
For passing arguments, register windows overlap: 8 output registers in
your window become input registers when you call another subroutine.

Space is reserved on the stack for saving all the register windows,
but they aren't written out until necessary.  A Sun-4 might have, say,
half a dozen register windows, which comes to 96 registers.  Flushing
all the registers to memory is normally done by the kernel, whenever a
context switch occurs, or when you run out of register windows.

There's (undocumented?) kernel traps to save/restore your register
windows that I can't seem to find right now.
--
Felix Lee	flee@shire.cs.psu.edu	*!psuvax1!flee

tim@proton.amd.com (Tim Olson) (11/21/89)

In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes:
| 
| I am trying to implement coroutines using plain C. My
| application is not time critical, so there is no need for speed,
| being portable is far more important. It would be great if there
| was a way to switch stacks of the processes using just C.

Sorry, but there just isn't a way that is portable to all
architectures; there is almost always some assembly-language involved.

| 2. Are there more machines out there that probably will throw up
|    when executing this code, apart from CPU's where the stack is
|    growing upward instead of downward. This can be taken care of,
|    but I did not want to clutter the example below.

Here's an example of one:  The Am29000 (32-bit RISC processor) uses a
large number of registers as a "stack cache", and spills and fills
these to/from an area in memory when it overflows.  Another (separate)
memory stack holds non-scalar data, such as arrays and large
structures.  Your routine would only attempt to swap the memory stack,
leaving the register stack cache alone.

	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

richard@aiai.ed.ac.uk (Richard Tobin) (11/22/89)

In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes:
>I am trying to implement coroutines using plain C. 

Tricky!

>instead of actually switching stacks, I use the "real" stack but
>swap it in and out of malloced memory.

Aaargh!  What a hack!

>1. Can somebody explain why this fails on a SUN4. 

Sun 4s have "register windows" which means that the variables that
would be near the top of the stack are instead in registers.

As a quick hack with a remote chance of working, you could try burying
your stack copying code at the bottom of about 20 trivial procedures,
so that the relevant register windows will have been flushed.  This is
just a guess, and may well not work.

>3. Does anyone know of a better way to do coroutines in C ?

Sun OS 4 has a library for coroutines (which they refer to as threads).

One of my more inventive friends produced some code for coroutines
that worked under Berkeley-like Unixes; it worked by using sigstack(),
signal() and kill() to start executing on a malloc()ed stack.

-- Richard
-- 
Richard Tobin,                       JANET: R.Tobin@uk.ac.ed             
AI Applications Institute,           ARPA:  R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk
Edinburgh University.                UUCP:  ...!ukc!ed.ac.uk!R.Tobin

djones@megatest.UUCP (Dave Jones) (11/22/89)

From article <457@enea.se>, by jerker@enea.se (Jerker W}gberg):

> I am trying to implement coroutines using plain C. My
> application is not time critical, so there is no need for speed,
> being portable is far more important. It would be great if there
> was a way to switch stacks of the processes using just C.

True. But there ain't. Any way you go will require some machine-dependent
code, even if it's all written in C. (I've got an all-C coroutine
package that works on Sun3's, but I know it would fail on a Sun4.)

> I have figured out a way to implement this that works fine on a PC
> with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that
> instead of actually switching stacks, I use the "real" stack but
> swap it in and out of malloced memory.

(That's spelled "mallocked".) Anyway, on a SUN4, not all of the stack
is located in contiguous main memory.  The stack is implemented partially
in "register-windows". Read the SPARC Architecture Manual if that
sounds interesting.

It's probably not trivial to implement coroutines on a SPARC. Not
having tried it, I don't know. (There is a move afoot to put one
on my desk, but I have to clean the desk first, so that's a big
hurdle.)  I think there is some kind of coroutine package that comes
with the machine. Check it out.

desnoyer@apple.com (Peter Desnoyers) (11/23/89)

In article <28112@amdcad.AMD.COM> tim@proton.amd.com (Tim Olson) writes:
> In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes:
> | 
> | I am trying to implement coroutines using plain C. My
> | application is not time critical, so there is no need for speed,
> | being portable is far more important. It would be great if there
> | was a way to switch stacks of the processes using just C.
> 
> Sorry, but there just isn't a way that is portable to all
> architectures; there is almost always some assembly-language involved.

You can swap the stack without using assembler by either:

  + using pseudo-variables for registers (I've only seen this feature
    in Turbo C for the peecee)

  + /* KLUDGE ALERT */ swapping the stack pointer (and usually base
    ptr as well) field in a jmpbuf. E.g.:

      if (setjmp( buf) != 0)
        task_to_spawn()
      buf[x] = malloc(stacksize);
      buf[y] = buf[x] {+-} zzz;
      longjmp(buf);

This is even more non-portable than assembler, as it is almost guaranteed 
not to be portable between compatible compilers on the same machine, or 
even between upgrades of the same compiler. If you don't have an 
assembler, however, it may be the only way. 

>>[CPUs that will give problems with this?]
> Here's an example of one:  The Am29000 (32-bit RISC processor) uses a
> large number of registers as a "stack cache", and spills and fills
> these to/from an area in memory when it overflows.  

Problems like this (and register arguments) can (usually) be handled in a 
non-portable way with dummy arguments and dummy local variables, forcing 
the values you care about to be spilled to memory.

                                      Peter Desnoyers
                                      Apple ATG
                                      (408) 974-4469

karl@haddock.ima.isc.com (Karl Heuer) (11/24/89)

In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes:
>I am trying to implement coroutines using plain C.  My application is not
>time critical, so there is no need for speed, being portable is far more
>important.

The portable way to implement coroutines in C is analogous to the portable way
to implement while-loops in Fortran.  You emulate them with goto statements.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

kenmoore@unix.cis.pitt.edu (Kenneth L Moore) (11/28/89)

In article <1989Nov21.120938.9200@psuvax1.cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes:
==>Jerker W}gberg <jerker@helios.se==> wrote:
==>==> The idea is that instead of actually switching stacks, I use the
==>==> "real" stack but swap it in and out of malloced memory.
==>==> 1. Can somebody explain why this fails on a SUN4.
==>
==>The Sun-4 has a curious hack known as register windows.  Rather than
==>putting things on the stack, it keeps everything in registers.  Each
==>subroutine gets its own set of 24 registers called a register window.
==>For passing arguments, register windows overlap: 8 output registers in
==>your window become input registers when you call another subroutine.
==>
==>--
==>Felix Lee	flee@shire.cs.psu.edu	*!psuvax1!flee

Yup. This is state of the art computater (sic) architecture. This idea
arose along with RISC (Reduced Instruction Set Computer) but can be used
on RISC or CISC (Complex Instruction Set Computer) machines.

Remember that RISC is the result of a statistical study that showed that
99% of the instructions used on a CISC machine were a sub-set of some 30
(out of maybe 256? on an IBM 360) commands. Also, 10 instructions
accounted for 80% and 21 accounted for 95%.

Another aspect that became apparent during these studies was that much
of the overhead in a processor was consumed in keeping track of
subroutine calls and returns.

The original RISC-I guys had a lot of left over chip area and decided
that the thing to do with the extra area was to make extra registers.
And they shrewdly decided to use the registers to facilitate subroutine
handling.

C programmers, keep these facts in mind as they will dominate computer
architecture in the near term.

Sign me:

KK    KK  EEEEEEEE  NN     NN  MM       MM   OOOOO    OOOOO   RRRRRR   EEEEEEEE
KK   KK   EEEEEEEE  NNN    NN  MMM     MMM  OOOOOOO  OOOOOOO  RR   RR  EEEEEEEE
KK  KK    EE        NNNN   NN  MMMM   MMMM  OO   OO  OO   OO  RR  RR   EE
KKKKK     EEEEE     NN NN  NN  MM MM MM MM  OO   OO  OO   OO  RRRRR    EEEEE
KKKKK     EEEEE     NN  NN NN  MM  MMM  MM  OO   OO  OO   OO  RR RR    EEEEE
KK  KK    EE        NN   NNNN  MM   M   MM  OO   OO  OO   OO  RR  RR   EE
KK   KK   EEEEEEEE  NN    NNN  MM       MM  OOOOOOO  OOOOOOO  RR   RR  EEEEEEEE
KK    KK  EEEEEEEE  NN     NN  MM       MM   OOOOO    OOOOO   RR    RR EEEEEEEE

kenmoore@unix.cis.pitt.edu (Kenneth L. Moore)

(Sorry about the long signature but inews won't recognise my .signature...
even after "chmod 777 .signature". Tsk. Tsk.)

meissner@dg-rtp.dg.com (Michael Meissner) (11/29/89)

In article <28112@amdcad.AMD.COM> tim@proton.amd.com (Tim Olson) writes:

|  In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes:
|  | 
|  | I am trying to implement coroutines using plain C. My
|  | application is not time critical, so there is no need for speed,
|  | being portable is far more important. It would be great if there
|  | was a way to switch stacks of the processes using just C.
|  
|  Sorry, but there just isn't a way that is portable to all
|  architectures; there is almost always some assembly-language involved.
|  
|  | 2. Are there more machines out there that probably will throw up
|  |    when executing this code, apart from CPU's where the stack is
|  |    growing upward instead of downward. This can be taken care of,
|  |    but I did not want to clutter the example below.
|  
|  Here's an example of one:  The Am29000 (32-bit RISC processor) uses a
|  large number of registers as a "stack cache", and spills and fills
|  these to/from an area in memory when it overflows.  Another (separate)
|  memory stack holds non-scalar data, such as arrays and large
|  structures.  Your routine would only attempt to swap the memory stack,
|  leaving the register stack cache alone.

Another machine is the Data General MV running either the propriatary
operating systems (AOS/VS and AOS/VS-II) or the native UNIX (DG/UX).
The machine has a hardware stack which grows upwards instead of
downwards.  In addition to a hardware stack pointer and frame pointer,
it has a stack base and stack limit pointer.  The high level languages
all use the stack base pointer as an index into a per-task storage
area.  The size of per-task storage area varies depending on how many
different languages are used, and what revisions of the libraries are
used.  If the program only has one main task (or thread in
Mach-speak), and doesn't call sbrk directly, malloc'ed memory is
retreived by decrementing the stack limit pointer.
--

Michael Meissner, Data General.
Until 12/15:	meissner@dg-rtp.DG.COM
After 12/15:	meissner@osf.org

ckd@bu-pub.bu.edu (Christopher Davis) (12/03/89)

Kenneth> (Sorry about the long signature but inews won't recognise my
Kenneth> .signature...  even after "chmod 777 .signature". Tsk. Tsk.)

Awww... po' baby... couldn't get his big penis-substitute .sig added in for
him automatically...
-- 
 Christopher Davis, BU SMG '90  <ckd@bu-pub.bu.edu> <smghy6c@buacca.bitnet>
"Many verbal attacks are part of someone's aim to establish their rank in a
 dominance hierarchy, the same sort of behavior common among nesting fowl."
                                     --Daniel Mocsny <dmocsny@uceng.UC.EDU>

ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/08/89)

jerker@enea.se (Jerker W}gberg) writes:

>I am trying to implement coroutines using plain C. My
>application is not time critical, so there is no need for speed,
>being portable is far more important. It would be great if there
>was a way to switch stacks of the processes using just C.

>I have figured out a way to implement this that works fine on a PC
>with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that
>instead of actually switching stacks, I use the "real" stack but
>swap it in and out of malloced memory.

>1. Can somebody explain why this fails on a SUN4. This is the first
>   program that I ever have tried on a SUN4 so I don't have any
>   idea what happens inside the CPU. Could it be that the SUN4 have
>   some peculiar registers that must be restored but isn't below?

The right question is: can you explain why you expect this to work
on any machine?
This will fail on the Sun4 for more than one reason. It can't be made to
work because of a feature called register windowing. The processor takes
care of saving and restoring registers when a procedure is called. It
uses internal memory for this. Only when it runs out of internal memory
the registers are actually saved on the stack (you have to reserve space
for it.)

>2. Are there more machines out there that probably will throw up
>   when executing this code, apart from CPU's where the stack is
>   growing upward instead of downward. This can be taken care of,
>   but I did not want to clutter the example below.

There is no way I know of to implement coroutines in a portable way.
The peculiarities of each machine force you to change a small part
of your program. For instance: stacks don't always grow from high
to low addresses. This is not the way to do it.

>3. Does anyone know of a better way to do coroutines in C ?

Carefully isolate the machine dependencies, and write a small part in
assembler. Useful knowledge for the Sun4:

	trap #3

flushes the register window. Don't expect a very efficient implementation
of context switching on a SPARC though. There is a lot of context on these
machines, 8 * 16 registers in the worst case. I'm working on this one
(coroutines for the SUN4) but I've not completely figured out how everything
is done. (I'm porting some stuff from a Sun3)

Ge' Weijers

Ge' Weijers                                    Internet/UUCP: ge@cs.kun.nl
Faculty of Mathematics and Computer Science,   (uunet.uu.net!cs.kun.nl!ge)
University of Nijmegen, Toernooiveld 1         
6525 ED Nijmegen, the Netherlands              tel. +3180612483 (UTC-2)

ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/08/89)

kenmoore@unix.cis.pitt.edu (Kenneth L Moore) writes:

>Yup. This is state of the art computater (sic) architecture. This idea
>arose along with RISC (Reduced Instruction Set Computer) but can be used
>on RISC or CISC (Complex Instruction Set Computer) machines.

And a sorry state it is. I'd rather have twice the number of registers,
a store-multiple-registers instruction, and NO register windows.
It's just as fast, because you mostly save registers that contain
garbage. The SPARC was designed with the assumption in mind that nobody
uses recursion anyway. This might be true for C programmers.

>Remember that RISC is the result of a statistical study that showed that
>99% of the instructions used on a CISC machine were a sub-set of some 30
>(out of maybe 256? on an IBM 360) commands. Also, 10 instructions
>accounted for 80% and 21 accounted for 95%.

This has nothing to to with register windows.

>Another aspect that became apparent during these studies was that much
>of the overhead in a processor was consumed in keeping track of
>subroutine calls and returns.

Recent studies have also shown that you can do the same in software,
using a simple analysis.

>The original RISC-I guys had a lot of left over chip area and decided
>that the thing to do with the extra area was to make extra registers.
>And they shrewdly decided to use the registers to facilitate subroutine
>handling.

It was a good idea as an experiment. It's a pain if your recursion goes
500 deep. But because the SPARC has no other way to save a lot of
registers in reasonable time, you're stuck with it.
It also introduces quite a bit of overhead when the OS has to switch
contexts.

>C programmers, keep these facts in mind as they will dominate computer
>architecture in the near term.

Prolog/ML/other implementors keep these facts in mind, as they will 
limit the performance of your programs.
I don't think the SPARC is bad, it's just not as good as other processors
for some uses.

Ge' Weijers

Ge' Weijers                                    Internet/UUCP: ge@cs.kun.nl
Faculty of Mathematics and Computer Science,   (uunet.uu.net!cs.kun.nl!ge)
University of Nijmegen, Toernooiveld 1         
6525 ED Nijmegen, the Netherlands              tel. +3180612483 (UTC-2)

tim@nucleus.amd.com (Tim Olson) (12/09/89)

In article <576@kunivv1.sci.kun.nl> ge@kunivv1.sci.kun.nl (Ge' Weijers) writes:
| kenmoore@unix.cis.pitt.edu (Kenneth L Moore) writes:
| 
| >Yup. This is state of the art computater (sic) architecture. This idea
| >arose along with RISC (Reduced Instruction Set Computer) but can be used
| >on RISC or CISC (Complex Instruction Set Computer) machines.
| 
| And a sorry state it is. I'd rather have twice the number of registers,
| a store-multiple-registers instruction, and NO register windows.
| It's just as fast, because you mostly save registers that contain
| garbage. The SPARC was designed with the assumption in mind that nobody
| uses recursion anyway. This might be true for C programmers.

You should check out the Am29000.  It has 192 visible registers,
load-multiple and store-multiple instructions, and the register
windowing scheme is implemented in software, so that the allocated
register window size is exactly tailored to the frame size required by
the function.

| >Remember that RISC is the result of a statistical study that showed that
| >99% of the instructions used on a CISC machine were a sub-set of some 30
| >(out of maybe 256? on an IBM 360) commands. Also, 10 instructions
| >accounted for 80% and 21 accounted for 95%.
| 
| This has nothing to to with register windows.

True.

| >Another aspect that became apparent during these studies was that much
| >of the overhead in a processor was consumed in keeping track of
| >subroutine calls and returns.
| 
| Recent studies have also shown that you can do the same in software,
| using a simple analysis.

It isn't simple.  You need inter-procedural register allocation
("universal"), and even that is a static analysis -- it doesn't take
into account the dynamic nature of the call tree at runtime, like
hardware register windows do.

| >The original RISC-I guys had a lot of left over chip area and decided
| >that the thing to do with the extra area was to make extra registers.
| >And they shrewdly decided to use the registers to facilitate subroutine
| >handling.
| 
| It was a good idea as an experiment. It's a pain if your recursion goes
| 500 deep. But because the SPARC has no other way to save a lot of
| registers in reasonable time, you're stuck with it.
| It also introduces quite a bit of overhead when the OS has to switch
| contexts.

Actually, recursion isn't the problem -- it's large, quick changes in
the stack depth.  In fact, many times recursive routines are better on
register-windowed machines, because they only have to save some cached
portion of the stack if they overflow the cache.  Consider a dynamic call
chain (of a recursive routine) which looks like:

      1
      	2			2
	  3   3   3   3   3   3   3
	    4   4   4   4   4       4

a non-register-windowed machine must save and restore registers on
each function call and return, while the register-windowed machine
could run without any register state saved or restored to/from memory.



	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/11/89)

tim@nucleus.amd.com (Tim Olson) writes:
>You should check out the Am29000.  It has 192 visible registers,
>load-multiple and store-multiple instructions, and the register
>windowing scheme is implemented in software, so that the allocated
>register window size is exactly tailored to the frame size required by
>the function.

This sounds a lot better. One of the problems I have with the SPARC is that
it saves 16 registers, even in the simplest routines. This gives 
pedestrian performance compared to routines that don't overflow the register
window set. Especially on useless routines calculating Fibonacci numbers,
or (worse) functional programs doing spine evaluation on a list.
P.S. how long does context switching take, with a full window?

>It isn't simple.  You need inter-procedural register allocation
>("universal"), and even that is a static analysis -- it doesn't take
>into account the dynamic nature of the call tree at runtime, like
>hardware register windows do.

If you have register windows that only save the minimum amount of registers
necessary then in the worst case you do the same amount of work as
if you use a store-multiple instruction to save your registers.
I think my opinion was based on the SPARC only, not a bad design, but
suboptimal for certain uses. A shame you don't see many 29000 systems around,
except in the controller market. I recently saw a FDDI controller board for
the PC, featuring a 29000 and quite a bit of memory. Looks like the
equivalent of a rubber dingy with a Rolls Royce Merlin engine :-)
(the PC being the dingy, of course)
Interprocedural analysis is not too hard for the kind on (non-typical)
programming languages I have to deal with. It's positively easy, when compared
to C. You don't have to invalidate any register contents after a procedure
call, as there are no global variables.

>Actually, recursion isn't the problem -- it's large, quick changes in
>the stack depth.  In fact, many times recursive routines are better on
>register-windowed machines, because they only have to save some cached
>portion of the stack if they overflow the cache.  Consider a dynamic call
>chain (of a recursive routine) which looks like:

>      1
>      	2			2
>	  3   3   3   3   3   3   3
>	    4   4   4   4   4       4

>a non-register-windowed machine must save and restore registers on
>each function call and return, while the register-windowed machine
>could run without any register state saved or restored to/from memory.

4-deep recursion is hardly any recursion. I'm talking about 30-300,000
deep. If you store 6 unnecessary longwords/call, you use to much stack,
and too many cycles. The performance of the SPARC drops a factor of 2 if
you have a lot of recursion.

I think we agree.

>	-- Tim Olson
>	Advanced Micro Devices
>	(tim@amd.com)

Ge' Weijers

Ge' Weijers                                    Internet/UUCP: ge@cs.kun.nl
Faculty of Mathematics and Computer Science,   (uunet.uu.net!cs.kun.nl!ge)
University of Nijmegen, Toernooiveld 1         
6525 ED Nijmegen, the Netherlands              tel. +3180612483 (UTC-2)