jerker@enea.se (Jerker W}gberg) (11/18/89)
I am trying to implement coroutines using plain C. My application is not time critical, so there is no need for speed, being portable is far more important. It would be great if there was a way to switch stacks of the processes using just C. I have figured out a way to implement this that works fine on a PC with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that instead of actually switching stacks, I use the "real" stack but swap it in and out of malloced memory. 1. Can somebody explain why this fails on a SUN4. This is the first program that I ever have tried on a SUN4 so I don't have any idea what happens inside the CPU. Could it be that the SUN4 have some peculiar registers that must be restored but isn't below? 2. Are there more machines out there that probably will throw up when executing this code, apart from CPU's where the stack is growing upward instead of downward. This can be taken care of, but I did not want to clutter the example below. 3. Does anyone know of a better way to do coroutines in C ? Like to test it yourself ? Here is a barebones version: ------------------------------ cut here ------------------------------------- /* ** Test of coroutines. ** Outputs : ** func1 ** func2 ** func1 ** . ** . ** ad inf */ #include <stdio.h> #include <setjmp.h> #include <malloc.h> char *stack_bottom; jmp_buf main_buf; int cc; /* Current coroutine */ func_1() { int foo[50]; /* To get different stack depths at dispatch. If the */ /* stack depths are equal at dispatch the program */ /* just continues with func2 on my SUN4. If stacks */ /* are unequal, I get segmentation fault. */ for (;;) { printf("func1\n"); dispatch(); } } func_2() { for (;;) { printf("func2\n"); dispatch(); } } struct sav /* Control block for each coroutine */ { char *stack; /* Pointer to saved stack */ int stack_len; /* Lenght of stack */ int (*func)(); /* Pointer to each coroutines start address */ jmp_buf retbuf; /* SP and PC of each coroutine */ } sav[2] = { {NULL, 0, func_1} , {NULL, 0, func_2} }; main() { char stack_mark; stack_bottom = &stack_mark; /* ** Here we go. Run a coroutine until it dispatches then ** run the other. */ for (;;cc ^= 1) { if (setjmp(main_buf) == 0) if (sav[cc].stack == NULL) sav[cc].func(); /* First time */ else longjmp(sav[cc].retbuf, 1); /* Continuation */ } } dispatch() { char stack_mark; jmp_buf disp_buf; if (setjmp(disp_buf) == 0) { /* ** Save the jmp_buf so that we can return. ** Save the stack into mallocated memory and ** longjmp back to main */ memcpy(sav[cc].retbuf, disp_buf, sizeof(jmp_buf)); sav[cc].stack_len = stack_bottom - &stack_mark; sav[cc].stack = malloc(sav[cc].stack_len); memcpy(sav[cc].stack, &stack_mark, sav[cc].stack_len); longjmp(main_buf, 1); } else { /* ** The longjmp from main set SP ok but there is just ** garbage from the previous coroutine on the stack. Copy ** the stack from previous run and return to caller of ** dispatch. */ memcpy(&stack_mark, sav[cc].stack, sav[cc].stack_len); free(sav[cc].stack); } }
chris@mimsy.umd.edu (Chris Torek) (11/21/89)
In article <457@enea.se> jerker@enea.se (Jerker W}gberg) writes: >instead of actually switching stacks, I use the "real" stack but >swap it in and out of malloced memory. > >1. Can somebody explain why this fails on a SUN4. I suspect it defers writing into the stack until absolutely necessary. You thus save the data before it exists. Your technique will also fail on machines that have several simultaneously active stacks (e.g., Pyramid) or no stack (probably no such machines run C). -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
flee@shire.cs.psu.edu (Felix Lee) (11/21/89)
Jerker W}gberg <jerker@helios.se> wrote: > The idea is that instead of actually switching stacks, I use the > "real" stack but swap it in and out of malloced memory. > 1. Can somebody explain why this fails on a SUN4. The Sun-4 has a curious hack known as register windows. Rather than putting things on the stack, it keeps everything in registers. Each subroutine gets its own set of 24 registers called a register window. For passing arguments, register windows overlap: 8 output registers in your window become input registers when you call another subroutine. Space is reserved on the stack for saving all the register windows, but they aren't written out until necessary. A Sun-4 might have, say, half a dozen register windows, which comes to 96 registers. Flushing all the registers to memory is normally done by the kernel, whenever a context switch occurs, or when you run out of register windows. There's (undocumented?) kernel traps to save/restore your register windows that I can't seem to find right now. -- Felix Lee flee@shire.cs.psu.edu *!psuvax1!flee
tim@proton.amd.com (Tim Olson) (11/21/89)
In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes: | | I am trying to implement coroutines using plain C. My | application is not time critical, so there is no need for speed, | being portable is far more important. It would be great if there | was a way to switch stacks of the processes using just C. Sorry, but there just isn't a way that is portable to all architectures; there is almost always some assembly-language involved. | 2. Are there more machines out there that probably will throw up | when executing this code, apart from CPU's where the stack is | growing upward instead of downward. This can be taken care of, | but I did not want to clutter the example below. Here's an example of one: The Am29000 (32-bit RISC processor) uses a large number of registers as a "stack cache", and spills and fills these to/from an area in memory when it overflows. Another (separate) memory stack holds non-scalar data, such as arrays and large structures. Your routine would only attempt to swap the memory stack, leaving the register stack cache alone. -- Tim Olson Advanced Micro Devices (tim@amd.com)
richard@aiai.ed.ac.uk (Richard Tobin) (11/22/89)
In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes: >I am trying to implement coroutines using plain C. Tricky! >instead of actually switching stacks, I use the "real" stack but >swap it in and out of malloced memory. Aaargh! What a hack! >1. Can somebody explain why this fails on a SUN4. Sun 4s have "register windows" which means that the variables that would be near the top of the stack are instead in registers. As a quick hack with a remote chance of working, you could try burying your stack copying code at the bottom of about 20 trivial procedures, so that the relevant register windows will have been flushed. This is just a guess, and may well not work. >3. Does anyone know of a better way to do coroutines in C ? Sun OS 4 has a library for coroutines (which they refer to as threads). One of my more inventive friends produced some code for coroutines that worked under Berkeley-like Unixes; it worked by using sigstack(), signal() and kill() to start executing on a malloc()ed stack. -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
djones@megatest.UUCP (Dave Jones) (11/22/89)
From article <457@enea.se>, by jerker@enea.se (Jerker W}gberg): > I am trying to implement coroutines using plain C. My > application is not time critical, so there is no need for speed, > being portable is far more important. It would be great if there > was a way to switch stacks of the processes using just C. True. But there ain't. Any way you go will require some machine-dependent code, even if it's all written in C. (I've got an all-C coroutine package that works on Sun3's, but I know it would fail on a Sun4.) > I have figured out a way to implement this that works fine on a PC > with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that > instead of actually switching stacks, I use the "real" stack but > swap it in and out of malloced memory. (That's spelled "mallocked".) Anyway, on a SUN4, not all of the stack is located in contiguous main memory. The stack is implemented partially in "register-windows". Read the SPARC Architecture Manual if that sounds interesting. It's probably not trivial to implement coroutines on a SPARC. Not having tried it, I don't know. (There is a move afoot to put one on my desk, but I have to clean the desk first, so that's a big hurdle.) I think there is some kind of coroutine package that comes with the machine. Check it out.
desnoyer@apple.com (Peter Desnoyers) (11/23/89)
In article <28112@amdcad.AMD.COM> tim@proton.amd.com (Tim Olson) writes: > In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes: > | > | I am trying to implement coroutines using plain C. My > | application is not time critical, so there is no need for speed, > | being portable is far more important. It would be great if there > | was a way to switch stacks of the processes using just C. > > Sorry, but there just isn't a way that is portable to all > architectures; there is almost always some assembly-language involved. You can swap the stack without using assembler by either: + using pseudo-variables for registers (I've only seen this feature in Turbo C for the peecee) + /* KLUDGE ALERT */ swapping the stack pointer (and usually base ptr as well) field in a jmpbuf. E.g.: if (setjmp( buf) != 0) task_to_spawn() buf[x] = malloc(stacksize); buf[y] = buf[x] {+-} zzz; longjmp(buf); This is even more non-portable than assembler, as it is almost guaranteed not to be portable between compatible compilers on the same machine, or even between upgrades of the same compiler. If you don't have an assembler, however, it may be the only way. >>[CPUs that will give problems with this?] > Here's an example of one: The Am29000 (32-bit RISC processor) uses a > large number of registers as a "stack cache", and spills and fills > these to/from an area in memory when it overflows. Problems like this (and register arguments) can (usually) be handled in a non-portable way with dummy arguments and dummy local variables, forcing the values you care about to be spilled to memory. Peter Desnoyers Apple ATG (408) 974-4469
karl@haddock.ima.isc.com (Karl Heuer) (11/24/89)
In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes: >I am trying to implement coroutines using plain C. My application is not >time critical, so there is no need for speed, being portable is far more >important. The portable way to implement coroutines in C is analogous to the portable way to implement while-loops in Fortran. You emulate them with goto statements. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
kenmoore@unix.cis.pitt.edu (Kenneth L Moore) (11/28/89)
In article <1989Nov21.120938.9200@psuvax1.cs.psu.edu> flee@shire.cs.psu.edu (Felix Lee) writes:
==>Jerker W}gberg <jerker@helios.se==> wrote:
==>==> The idea is that instead of actually switching stacks, I use the
==>==> "real" stack but swap it in and out of malloced memory.
==>==> 1. Can somebody explain why this fails on a SUN4.
==>
==>The Sun-4 has a curious hack known as register windows. Rather than
==>putting things on the stack, it keeps everything in registers. Each
==>subroutine gets its own set of 24 registers called a register window.
==>For passing arguments, register windows overlap: 8 output registers in
==>your window become input registers when you call another subroutine.
==>
==>--
==>Felix Lee flee@shire.cs.psu.edu *!psuvax1!flee
Yup. This is state of the art computater (sic) architecture. This idea
arose along with RISC (Reduced Instruction Set Computer) but can be used
on RISC or CISC (Complex Instruction Set Computer) machines.
Remember that RISC is the result of a statistical study that showed that
99% of the instructions used on a CISC machine were a sub-set of some 30
(out of maybe 256? on an IBM 360) commands. Also, 10 instructions
accounted for 80% and 21 accounted for 95%.
Another aspect that became apparent during these studies was that much
of the overhead in a processor was consumed in keeping track of
subroutine calls and returns.
The original RISC-I guys had a lot of left over chip area and decided
that the thing to do with the extra area was to make extra registers.
And they shrewdly decided to use the registers to facilitate subroutine
handling.
C programmers, keep these facts in mind as they will dominate computer
architecture in the near term.
Sign me:
KK KK EEEEEEEE NN NN MM MM OOOOO OOOOO RRRRRR EEEEEEEE
KK KK EEEEEEEE NNN NN MMM MMM OOOOOOO OOOOOOO RR RR EEEEEEEE
KK KK EE NNNN NN MMMM MMMM OO OO OO OO RR RR EE
KKKKK EEEEE NN NN NN MM MM MM MM OO OO OO OO RRRRR EEEEE
KKKKK EEEEE NN NN NN MM MMM MM OO OO OO OO RR RR EEEEE
KK KK EE NN NNNN MM M MM OO OO OO OO RR RR EE
KK KK EEEEEEEE NN NNN MM MM OOOOOOO OOOOOOO RR RR EEEEEEEE
KK KK EEEEEEEE NN NN MM MM OOOOO OOOOO RR RR EEEEEEEE
kenmoore@unix.cis.pitt.edu (Kenneth L. Moore)
(Sorry about the long signature but inews won't recognise my .signature...
even after "chmod 777 .signature". Tsk. Tsk.)
meissner@dg-rtp.dg.com (Michael Meissner) (11/29/89)
In article <28112@amdcad.AMD.COM> tim@proton.amd.com (Tim Olson) writes: | In article <457@enea.se> jerker@helios.se (Jerker W}gberg) writes: | | | | I am trying to implement coroutines using plain C. My | | application is not time critical, so there is no need for speed, | | being portable is far more important. It would be great if there | | was a way to switch stacks of the processes using just C. | | Sorry, but there just isn't a way that is portable to all | architectures; there is almost always some assembly-language involved. | | | 2. Are there more machines out there that probably will throw up | | when executing this code, apart from CPU's where the stack is | | growing upward instead of downward. This can be taken care of, | | but I did not want to clutter the example below. | | Here's an example of one: The Am29000 (32-bit RISC processor) uses a | large number of registers as a "stack cache", and spills and fills | these to/from an area in memory when it overflows. Another (separate) | memory stack holds non-scalar data, such as arrays and large | structures. Your routine would only attempt to swap the memory stack, | leaving the register stack cache alone. Another machine is the Data General MV running either the propriatary operating systems (AOS/VS and AOS/VS-II) or the native UNIX (DG/UX). The machine has a hardware stack which grows upwards instead of downwards. In addition to a hardware stack pointer and frame pointer, it has a stack base and stack limit pointer. The high level languages all use the stack base pointer as an index into a per-task storage area. The size of per-task storage area varies depending on how many different languages are used, and what revisions of the libraries are used. If the program only has one main task (or thread in Mach-speak), and doesn't call sbrk directly, malloc'ed memory is retreived by decrementing the stack limit pointer. -- Michael Meissner, Data General. Until 12/15: meissner@dg-rtp.DG.COM After 12/15: meissner@osf.org
ckd@bu-pub.bu.edu (Christopher Davis) (12/03/89)
Kenneth> (Sorry about the long signature but inews won't recognise my Kenneth> .signature... even after "chmod 777 .signature". Tsk. Tsk.) Awww... po' baby... couldn't get his big penis-substitute .sig added in for him automatically... -- Christopher Davis, BU SMG '90 <ckd@bu-pub.bu.edu> <smghy6c@buacca.bitnet> "Many verbal attacks are part of someone's aim to establish their rank in a dominance hierarchy, the same sort of behavior common among nesting fowl." --Daniel Mocsny <dmocsny@uceng.UC.EDU>
ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/08/89)
jerker@enea.se (Jerker W}gberg) writes: >I am trying to implement coroutines using plain C. My >application is not time critical, so there is no need for speed, >being portable is far more important. It would be great if there >was a way to switch stacks of the processes using just C. >I have figured out a way to implement this that works fine on a PC >with MSC 5.1, but wreaks havoc when run on a SUN4. The idea is that >instead of actually switching stacks, I use the "real" stack but >swap it in and out of malloced memory. >1. Can somebody explain why this fails on a SUN4. This is the first > program that I ever have tried on a SUN4 so I don't have any > idea what happens inside the CPU. Could it be that the SUN4 have > some peculiar registers that must be restored but isn't below? The right question is: can you explain why you expect this to work on any machine? This will fail on the Sun4 for more than one reason. It can't be made to work because of a feature called register windowing. The processor takes care of saving and restoring registers when a procedure is called. It uses internal memory for this. Only when it runs out of internal memory the registers are actually saved on the stack (you have to reserve space for it.) >2. Are there more machines out there that probably will throw up > when executing this code, apart from CPU's where the stack is > growing upward instead of downward. This can be taken care of, > but I did not want to clutter the example below. There is no way I know of to implement coroutines in a portable way. The peculiarities of each machine force you to change a small part of your program. For instance: stacks don't always grow from high to low addresses. This is not the way to do it. >3. Does anyone know of a better way to do coroutines in C ? Carefully isolate the machine dependencies, and write a small part in assembler. Useful knowledge for the Sun4: trap #3 flushes the register window. Don't expect a very efficient implementation of context switching on a SPARC though. There is a lot of context on these machines, 8 * 16 registers in the worst case. I'm working on this one (coroutines for the SUN4) but I've not completely figured out how everything is done. (I'm porting some stuff from a Sun3) Ge' Weijers Ge' Weijers Internet/UUCP: ge@cs.kun.nl Faculty of Mathematics and Computer Science, (uunet.uu.net!cs.kun.nl!ge) University of Nijmegen, Toernooiveld 1 6525 ED Nijmegen, the Netherlands tel. +3180612483 (UTC-2)
ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/08/89)
kenmoore@unix.cis.pitt.edu (Kenneth L Moore) writes: >Yup. This is state of the art computater (sic) architecture. This idea >arose along with RISC (Reduced Instruction Set Computer) but can be used >on RISC or CISC (Complex Instruction Set Computer) machines. And a sorry state it is. I'd rather have twice the number of registers, a store-multiple-registers instruction, and NO register windows. It's just as fast, because you mostly save registers that contain garbage. The SPARC was designed with the assumption in mind that nobody uses recursion anyway. This might be true for C programmers. >Remember that RISC is the result of a statistical study that showed that >99% of the instructions used on a CISC machine were a sub-set of some 30 >(out of maybe 256? on an IBM 360) commands. Also, 10 instructions >accounted for 80% and 21 accounted for 95%. This has nothing to to with register windows. >Another aspect that became apparent during these studies was that much >of the overhead in a processor was consumed in keeping track of >subroutine calls and returns. Recent studies have also shown that you can do the same in software, using a simple analysis. >The original RISC-I guys had a lot of left over chip area and decided >that the thing to do with the extra area was to make extra registers. >And they shrewdly decided to use the registers to facilitate subroutine >handling. It was a good idea as an experiment. It's a pain if your recursion goes 500 deep. But because the SPARC has no other way to save a lot of registers in reasonable time, you're stuck with it. It also introduces quite a bit of overhead when the OS has to switch contexts. >C programmers, keep these facts in mind as they will dominate computer >architecture in the near term. Prolog/ML/other implementors keep these facts in mind, as they will limit the performance of your programs. I don't think the SPARC is bad, it's just not as good as other processors for some uses. Ge' Weijers Ge' Weijers Internet/UUCP: ge@cs.kun.nl Faculty of Mathematics and Computer Science, (uunet.uu.net!cs.kun.nl!ge) University of Nijmegen, Toernooiveld 1 6525 ED Nijmegen, the Netherlands tel. +3180612483 (UTC-2)
tim@nucleus.amd.com (Tim Olson) (12/09/89)
In article <576@kunivv1.sci.kun.nl> ge@kunivv1.sci.kun.nl (Ge' Weijers) writes: | kenmoore@unix.cis.pitt.edu (Kenneth L Moore) writes: | | >Yup. This is state of the art computater (sic) architecture. This idea | >arose along with RISC (Reduced Instruction Set Computer) but can be used | >on RISC or CISC (Complex Instruction Set Computer) machines. | | And a sorry state it is. I'd rather have twice the number of registers, | a store-multiple-registers instruction, and NO register windows. | It's just as fast, because you mostly save registers that contain | garbage. The SPARC was designed with the assumption in mind that nobody | uses recursion anyway. This might be true for C programmers. You should check out the Am29000. It has 192 visible registers, load-multiple and store-multiple instructions, and the register windowing scheme is implemented in software, so that the allocated register window size is exactly tailored to the frame size required by the function. | >Remember that RISC is the result of a statistical study that showed that | >99% of the instructions used on a CISC machine were a sub-set of some 30 | >(out of maybe 256? on an IBM 360) commands. Also, 10 instructions | >accounted for 80% and 21 accounted for 95%. | | This has nothing to to with register windows. True. | >Another aspect that became apparent during these studies was that much | >of the overhead in a processor was consumed in keeping track of | >subroutine calls and returns. | | Recent studies have also shown that you can do the same in software, | using a simple analysis. It isn't simple. You need inter-procedural register allocation ("universal"), and even that is a static analysis -- it doesn't take into account the dynamic nature of the call tree at runtime, like hardware register windows do. | >The original RISC-I guys had a lot of left over chip area and decided | >that the thing to do with the extra area was to make extra registers. | >And they shrewdly decided to use the registers to facilitate subroutine | >handling. | | It was a good idea as an experiment. It's a pain if your recursion goes | 500 deep. But because the SPARC has no other way to save a lot of | registers in reasonable time, you're stuck with it. | It also introduces quite a bit of overhead when the OS has to switch | contexts. Actually, recursion isn't the problem -- it's large, quick changes in the stack depth. In fact, many times recursive routines are better on register-windowed machines, because they only have to save some cached portion of the stack if they overflow the cache. Consider a dynamic call chain (of a recursive routine) which looks like: 1 2 2 3 3 3 3 3 3 3 4 4 4 4 4 4 a non-register-windowed machine must save and restore registers on each function call and return, while the register-windowed machine could run without any register state saved or restored to/from memory. -- Tim Olson Advanced Micro Devices (tim@amd.com)
ge@kunivv1.sci.kun.nl (Ge' Weijers) (12/11/89)
tim@nucleus.amd.com (Tim Olson) writes: >You should check out the Am29000. It has 192 visible registers, >load-multiple and store-multiple instructions, and the register >windowing scheme is implemented in software, so that the allocated >register window size is exactly tailored to the frame size required by >the function. This sounds a lot better. One of the problems I have with the SPARC is that it saves 16 registers, even in the simplest routines. This gives pedestrian performance compared to routines that don't overflow the register window set. Especially on useless routines calculating Fibonacci numbers, or (worse) functional programs doing spine evaluation on a list. P.S. how long does context switching take, with a full window? >It isn't simple. You need inter-procedural register allocation >("universal"), and even that is a static analysis -- it doesn't take >into account the dynamic nature of the call tree at runtime, like >hardware register windows do. If you have register windows that only save the minimum amount of registers necessary then in the worst case you do the same amount of work as if you use a store-multiple instruction to save your registers. I think my opinion was based on the SPARC only, not a bad design, but suboptimal for certain uses. A shame you don't see many 29000 systems around, except in the controller market. I recently saw a FDDI controller board for the PC, featuring a 29000 and quite a bit of memory. Looks like the equivalent of a rubber dingy with a Rolls Royce Merlin engine :-) (the PC being the dingy, of course) Interprocedural analysis is not too hard for the kind on (non-typical) programming languages I have to deal with. It's positively easy, when compared to C. You don't have to invalidate any register contents after a procedure call, as there are no global variables. >Actually, recursion isn't the problem -- it's large, quick changes in >the stack depth. In fact, many times recursive routines are better on >register-windowed machines, because they only have to save some cached >portion of the stack if they overflow the cache. Consider a dynamic call >chain (of a recursive routine) which looks like: > 1 > 2 2 > 3 3 3 3 3 3 3 > 4 4 4 4 4 4 >a non-register-windowed machine must save and restore registers on >each function call and return, while the register-windowed machine >could run without any register state saved or restored to/from memory. 4-deep recursion is hardly any recursion. I'm talking about 30-300,000 deep. If you store 6 unnecessary longwords/call, you use to much stack, and too many cycles. The performance of the SPARC drops a factor of 2 if you have a lot of recursion. I think we agree. > -- Tim Olson > Advanced Micro Devices > (tim@amd.com) Ge' Weijers Ge' Weijers Internet/UUCP: ge@cs.kun.nl Faculty of Mathematics and Computer Science, (uunet.uu.net!cs.kun.nl!ge) University of Nijmegen, Toernooiveld 1 6525 ED Nijmegen, the Netherlands tel. +3180612483 (UTC-2)