pardo@june.cs.washington.edu (David Keppel) (08/03/88)
>[ When is it hard to do alloca? ]
The fast (compiler-supported) alloca may turn off other optmizations.
It need only turn them off in the function using alloca and (given
that memory management is often expensive) you aren't likely to lose a
whole lot by using alloca.
The slower (library-supported) alloca will break whenever you have a
multi-thread (e.g., lightweight process) program (at least in every
"reasonable" implementation that I've thought of). This may be a
problem for shared-memory multiprocessors.
During the last alloca war somebody suggested having alloca require
explicit "free" semantics, so that malloc-based (library) allocations
would be reasonable to write and work on shared-memory multiprocesors.
This would probably also hose fewer compilers/architectures. If I
understand, code would look like:
:
foo = alloca( size );
:
afree( size );
return( value );
This complicates the code slightly, but (I believe) would give
everybody "the best of both worlds". Does anybody object to these
semantics?
Note that GNU products are *not* designed to work on every machine.
The documentation for gcc (as I remember) says that it is designed to
work with 32-bit machines with a flat (non-segment-register) address
space. Alloca could be seen as a constraint on this. Finally, gcc
makes some extensions that can destroy alloca semantics; garbage
collection can (silently) take place before the end of the function,
so even their use is broke (and is documented as such).
[ Please, he said following up, try to keep down your
followup-rate. We just got done with this war a few weeks ago ]
;-D on ( What, me deallocate? ) Pardo
pardo@cs.washington.edu
{rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo
gwyn@smoke.ARPA (Doug Gwyn ) (08/03/88)
In article <5422@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes:
- foo = alloca( size );
- :
- afree( size );
- return( value );
-Does anybody object to these semantics?
Yes, I object very much. If you're going to do that, just use
malloc()/free(), which are universally available.
jeff@amsdsg.UUCP (Jeff Barr) (08/04/88)
Correct me if I'm wrong. The benefits of alloca() over malloc() are:
(1) There is no need for the program to explicitly free the storage
allocated via alloca(), as the free() is implicit at the end of the
scope which encloses the alloca call. (Actually implemented by
adjusting the frame pointer at the end of the scope in a true
alloca implementation.)
(2) The memory arena used by malloc() is not referenced by alloca (),
bypassing any possible fragmentation problems.
--
Jeff
+---------------------------------------------------------+
| Jeff Barr AMS-DSG uunet!amsdsg!jeff 800-832-8668 |
+---------------------------------------------------------+
ddb@ns.UUCP (David Dyer-Bennet) (08/04/88)
In article <5422@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
$ During the last alloca war somebody suggested having alloca require
$ explicit "free" semantics, so that malloc-based (library) allocations
$ would be reasonable to write and work on shared-memory multiprocesors.
$ This would probably also hose fewer compilers/architectures. If I
$ understand, code would look like:
$ :
$ foo = alloca( size );
$ :
$ afree( size );
$ return( value );
$ This complicates the code slightly, but (I believe) would give
$ everybody "the best of both worlds". Does anybody object to these
$ semantics?
Well, the reasons why alloca is difficult to implement seem convincing.
However, the only point I ever saw to alloca was the automatic deallocation
on exit. If you don't have that, what advantage is there over malloc or
any of the standard allocation routines?
--
-- David Dyer-Bennet
...!{rutgers!dayton | amdahl!ems | uunet!rosevax}!umn-cs!ns!ddb
ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300
pardo@june.cs.washington.edu (David Keppel) (08/04/88)
I wrote:
> foo = alloc(); ... afree(foo);
Karl Haddock pointed out that afree's will not be called for any
frames whose scope is left by longjmp() of a dynamically included
(child) function. Oh well, it seemed like a good idea at the time.
Sorry for the bandwidth.
;-D on ( Memory obfuscation/deallafreeing ) Pardo
swilson%thetone@Sun.COM (Scott Wilson) (08/04/88)
Oh gosh, I'm sorry I ever brought up alloca. The original post was in response to the portability issue, so let me see if we agree on this: 1.) Code that, in general, makes use of features that are less available than others is, in general, less portable. 2.) Alloca is less available than malloc/free. 3.) Therefore, all other things equal, code that uses alloca is less portable than code that doesn't. Does this make sense? -- Scott Wilson arpa: swilson@sun.com "Why do dogs lick Sun Microsystems uucp: ...!sun!swilson their balls? Because Mt. View, CA they can!"
djones@megatest.UUCP (Dave Jones) (08/05/88)
From article <696@ns.UUCP>, by ddb@ns.UUCP (David Dyer-Bennet): > > Well, the reasons why alloca is difficult to implement seem convincing. > However, the only point I ever saw to alloca was the automatic deallocation > on exit. If you don't have that, what advantage is there over malloc or > any of the standard allocation routines? > Speed. The malloc() on in Sun3 OS, for example, is incredibly slow after a few thousand blocks have been allocated. As has been pointed out, a memory allocation package built on a separate stack could be used. (But then you miss the automatic deallocation, of course.) -- djones
exodus@mfgfoc.UUCP (Greg Onufer) (08/05/88)
From article <62721@sun.uucp>, by swilson%thetone@Sun.COM (Scott Wilson): > 1.) Code that, in general, makes use of features that are less available > than others is, in general, less portable. I don't feel alloca is such a big issue in portability... consider the GNU code. Makes extensive use of alloca. Works on quite a few machines too, I'd say. > 2.) Alloca is less available than malloc/free. On a M68k with a decent OS, alloca is not more than a few lines of assembly code, correct? (Judging by the size of the GNU assembly alloca)... If one needs alloca and it is not available, why not write a quick alloca? If the machine architecture or OS is braindamaged, then one would have to program around it. GNU does that also. > 3.) Therefore, all other things equal, code that uses alloca is less > portable than code that doesn't. Not if the programmer cares that extra little bit. > Does this make sense? Does it? -Greg -- Greg Onufer GEnie: G.ONUFER University of the Pacific UUCP: -= Focus Semiconductor =- exodus@mfgfoc ...!sun!daver!mfgfoc!exodus (and postmaster/exodus@uop.edu) AT&T: 415-965-0604 USMAIL: #901 1929 Crisanto Ave, Mtn View, CA 94040
swilson%thetone@Sun.COM (Scott Wilson) (08/06/88)
In article <394@mfgfoc.UUCP> exodus@mfgfoc.UUCP (Greg Onufer) writes: >On a M68k with a decent OS, alloca is not more than a few lines of assembly >code, correct? (Judging by the size of the GNU assembly alloca)... >If one needs alloca and it is not available, why not write a quick alloca? This makes about as much sense as saying: If your in Tokyo and need to get to the airport just ask directions. It's only one sentence, how hard can that be? Well if you don't know the language it's probably pretty damn hard. I would guess the same is true about writing m68k code, if you don't know how it doesn't matter if it is just a few lines. Ok, so I need to write alloca, so let's see I brush up on m68k, buy the book if necessary, understand the calling convention and stack frame stuff for my compiler, figure out if my programming environment has an easy way to get assembler in, understand the particular enviroment's assembler syntax, etc. Gee that should only take 10-15 minutes. We're talking about C here, I think it's unreasonable to expect a programmer to have any knowledge of assembly code to port a program. What may seem like a simple exercise to you and others may be a major chore for me and others. And after all the debate on here about alloca implementation difficulties, I would be hesitant to try it on a cpu that it hasn't already been done for in fear that it be impossible. And what if I'm not running on a 68k? C runs under many OS's and many cpu's some you've never heard of. I've written C code for a TP1 running MOST, how do I do alloca there? The best assembler manual for it is in Japanese. It took me quite a while to write setjmp/longjmp for this beast, it would be just as much of a hassle to write alloca. Isn't that what languages like C are all about, so I don't have to know assembly language? Again let me say that this whole discussion started with regard to portability. Your solution is to add machine dependent code to a program to make it work. So how does that help future portability? It doesn't of course. -- Scott Wilson arpa: swilson@sun.com Sun Microsystems uucp: ...!sun!swilson Mt. View, CA
gordon@sneaky.TANDY.COM (Gordon Burditt) (08/07/88)
> > 2.) Alloca is less available than malloc/free. > On a M68k with a decent OS, alloca is not more than a few lines of assembly > code, correct? (Judging by the size of the GNU assembly alloca)... > If one needs alloca and it is not available, why not write a quick alloca? > If the machine architecture or OS is braindamaged, then one would have > to program around it. GNU does that also. If the compiler knows about alloca [Read: recognizes the name as special and refrains from doing certain things because of it, OR recognizes the name as special and implements it as a built-in], , it could be made to work nicely. If the compiler doesn't know about alloca you can have a real headache trying to implement it as it was intended (memory on the stack frame), even with a "nice" CPU. Further, even with the ultimate alloca implementation, GNU GCC with alloca built into the compiler, it still gets it wrong. Take, for example, the following 68000 linkage conventions: d0-d1, a0-a1 are clobbered across calls. d2-d7, a2-a5 are preserved across calls. a6 is the frame pointer. a7 is the stack pointer. Args are on the stack, first arg at the lowest address. CALLER POPS ARGS FROM THE STACK. Function return value is in d0 or d0/d1 (for doubles). These are not negotiable, unless I scrap all my libraries and the compiler that came with the system, and actually seem fairly reasonable. These are used on Tandy's 68k Xenix compiler, and some other systems, like the sun (2 and/or 3) version of the GNU compiler, which I have adapted to also work on a Tandy 6000. Ok, so alloca should be simple, right? - Leave a2-a6 and d2-d7 alone, so you don't have to save them. - Pop the return address into a register (a0 probably). [1 instruction] - Pick up the requested amount of memory and round it up to a multiple of 2. [2-3 instructions] - Subtract the rounded amount of memory from the stack pointer. [1 instruction] - Put the stack pointer + the size of the argument ( = 4) in d0 (because the caller is going to adjust the stack to remove that argument). [1-2 instructions] - Return to the saved return address. [1 instruction] [Nitpickers: I really don't care if I'm one or two instructions off in the above estimates] Simple, huh? So how come bison infinite loops? Typical function entry/exit code for Tandy's compiler: .globl _foo _foo: link a6,#-<stack frame size> moveml #<<bunch of registers>>,-(a7) ... moveml (a7)+,#<<bunch of registers>> | AARRRGGGGHHHH!!!! unlk a6 rts The "bunch of registers" is determined from what is actually used. There are variations of this code if 0 or 1 registers need to be saved. Notice the instruction marked "AARRRGGGGHHHH". If alloca moves the stack pointer, then it has to allocate some extra memory and leave a copy of the saved registers in them, so this instruction can pop them off. (Why didn't the compiler writer use "moveml <offset>(a6),#<bunch of registers>" instead? I don't care, really, but it might be because the code it actually generates is 2 bytes shorter (and 1 clock cycle faster on a 68020).) So, my first shot at alloca trashes all of the registers in the caller of the caller of alloca. Ok, how much memory? 40 bytes (d2-d7 and a2-a5), regardless of whether the function actually saves them or not. Why not find out what's actually saved? To do this, you'd need to find the entry point of the function and disassemble a few instructions. The frame pointer points to the return point in the caller of the caller of alloca. However, in 68000 code, given the address of the end of an instruction and the contents of memory, you cannot always uniquely find the address of the beginning of it. (Ok, maybe 99% of the time you can, but having any need for a "panic: alloca can't find it's caller - program aborted" message is unacceptable). Besides, if the caller of alloca was called through a function pointer (I DID NOT say alloca was called through a function pointer! although that wouldn't matter much anyway), the address might be and usually is in a0, and has been destroyed before alloca is called. Now, on to another issue: function calls and arguments. Assume that the compiler has a maximum of N temporary locations in registers to store computed arguments that haven't been pushed on the stack yet. (Suppose for this example that N = 2). Assume that the compiler pushes its arguments onto the stack in an endian order (I did NOT say which end!). The arguments and return types of foo and bar are (char *), and bar returns its first argument. How does the compiler evaluate this (foo is presumed to have 2N+3 arguments): foo(bar(alloca(20)), bar(alloca(20)), bar(alloca(20)), alloca(40), bar(alloca(20)), bar(alloca(20)), bar(alloca(20)) ); Can anyone find ANY compiler on a system with caller-pops-args linkage conventions that gets this right? Or for that matter, any compiler on ANY system where alloca really allocates memory on the stack? To correctly evaluate this, the compiler would have to evaluate "bar(alloca(20))" 3 times, store the results somewhere other than pushing on the stack, and since it has only 2 registers, one of them has to be elsewhere (stack frame relative temporary seems to be the only place left), evaluate alloca(40), at which point it had better not have pushed anything on the stack, and then evaluate "bar(alloca(20))" 3 more times. THEN it fetches the 7 values from assorted storage and pushes them on the stack. Now, consider a compiler that doesn't know alloca is special. Can you imagine the laughter if it always generated the above sequence for every function, alloca or not? Storing stuff on the stack frame, then copying it with a push-contents-of-memory-on-stack instruction? Inefficient code in the extreme ... . What it's really going to do is evaluate "bar(alloca(20))", push the result on the stack, repeat 2 more times, evaluate "alloca(40)", push the result on the stack, and repeat evaluating and pushing "bar(alloca(20))" 3 times. foo() will then be called with all but the last-pushed argument in the wrong place. This is, in fact, how gcc 1.24 evaluates it, WITH AND WITHOUT using the builtin alloca. With the builtin and without the optimizer, the alloca() becomes 2 instructions (fix stack and move return value to d0). The round-up-to-multiple-of-2 gets done at compile time since the arguments to the alloca calls were constant. Consider the GNU compiler feature "defer-pop". If a compiler defers taking arguments off the stack after a function call, and doesn't recognize alloca as a clue to turn off this behavior, and there is no flag to control it, then I have to throw up my hands and abandon any attempt to write alloca. Supposedly GCC does recognize alloca as special for this purpose, and in any case, it has flags to control it. Back to my Tandy 68k machine. To accomodate the args-on-the-stack problem, I have to make alloca copy more of the stack frame to handle the worst-case of args pushed on the stack when it's called. There is no worst-case, but 5 args ( = 20 more bytes) might be good enough. I got lucky. Tandy's compiler references auto variables relative to the stack frame pointer. If it referenced them relative to the stack pointer, I'd be out of luck. So now it works like this: - Leave a2-a6 and d2-d7 alone, so you don't have to save them. - Pop the return address into a register (a0 probably). [1 instruction] - Pick up the requested amount of memory and round it up to a multiple of 2, and add 60. [2-3 instructions] - Subtract the rounded amount of memory from the stack pointer. [1 instruction] - Copy 60 bytes from the old stack pointer + 4 to the new stack pointer + 4 for length 60. [15 instructions, straightline. Could be less but slower with a loop] - Put the stack pointer + the size of the argument plus the copied stack ( = 64) in d0 (because the caller is going to adjust the stack to remove that argument). [1-2 instructions] - Return to the saved return address. [1 instruction] So I have a version of alloca that allocates its memory on the stack frame and wastes 60 bytes per call. If a compiler uses defer-pop, it's going to screw up. If there are more than 5 arguments on the stack when it's called, it's going to screw up. But bison doesn't infinite loop any more. Why don't I post it? Several reasons: - It's too compiler- and system- specific to be of much use to many people. - The waste of memory per call is embarrasing. - It was debugged using bison, and according to discussions going on in gnu.gcc, it is probably subject to the GNU Public License, and I'd have to post the source to everything I used it in. I don't have that much disk space, and neither does the net. - It still doesn't work right, in ways I indicated. Gordon L. Burditt killer!ninja!sneaky!gordon
henry@utzoo.uucp (Henry Spencer) (08/07/88)
In article <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: >Speed. The malloc() on in Sun3 OS, for example, is incredibly slow >after a few thousand blocks have been allocated... This just says that whoever wrote Sun's malloc was incompetent. There is nothing inherently hard about writing a fast malloc, although it's not simple, especially for widely-diversified needs. -- MSDOS is not dead, it just | Henry Spencer at U of Toronto Zoology smells that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
sarima@gryphon.CTS.COM (Stan Friesen) (08/08/88)
In article <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: >From article <696@ns.UUCP>, by ddb@ns.UUCP (David Dyer-Bennet): >> If you don't have that, what advantage is there over malloc or >> any of the standard allocation routines? > >Speed. The malloc() on in Sun3 OS, for example, is incredibly slow >after a few thousand blocks have been allocated. Indeed, and not just on Suns either. A similar thing happened to me on a Convergent. In fact it eventually got sooo slow that our customers complained! (Luckily the large number of allocations was unnecessary, the code was not freeing a message block, so it was easy to fix by just adding a free() after the message was displayed. But still a 10-20 fold increase in speed *just* by adding one free!!) -- Sarima Cardolandion sarima@gryphon.CTS.COM aka Stanley Friesen rutgers!marque!gryphon!sarima Sherman Oaks, CA
exodus@mfgfoc.UUCP (Greg Onufer) (08/08/88)
From article <63153@sun.uucp>, by swilson%thetone@Sun.COM (Scott Wilson): > In article <394@mfgfoc.UUCP> exodus@mfgfoc.UUCP I write: >>On a M68k with a decent OS, alloca is not more than a few lines of assembly >>code, correct? (Judging by the size of the GNU assembly alloca)... >>If one needs alloca and it is not available, why not write a quick alloca? > ... > Again let me say that this whole discussion started with regard to > portability. Your solution is to add machine dependent code to a > program to make it work. So how does that help future portability? > It doesn't of course. Repeat this ten times: Good C programmers should always know the output of their compiler!!! And I tried to emphasize the GNU code.... It does it--- portable too. And if you look at the GNU code (and I think _everybody_ should have copies of some of the important C files in both Emacs and Gnu C), there _IS_ a portable implementation that works on machines which could not reasonably be asked to support proper alloca functionality. It does require one to call alloca with a argument of zero to clear free up the allocated memory (making it no better than malloc, essentially). This is an example where one could learn from someone else's code. I'd say GnuEmacs and Gnu C are very excellent sources of education on portable program writing. > Scott Wilson arpa: swilson@sun.com -- Greg Onufer GEnie: G.ONUFER University of the Pacific UUCP: -= Focus Semiconductor =- exodus@mfgfoc ...!sun!daver!mfgfoc!exodus (and postmaster/exodus@uop.edu) AT&T: 415-965-0604 USMAIL: #901 1929 Crisanto Ave, Mtn View, CA 94040
daveb@llama.rtech.UUCP (Dave Brower) (08/12/88)
In <1988Aug7.002334.6987@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >In <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes: >>Speed. The malloc() on in Sun3 OS, for example, is incredibly slow >>after a few thousand blocks have been allocated... > >This just says that whoever wrote Sun's malloc was incompetent. There is >nothing inherently hard about writing a fast malloc, although it's not >simple, especially for widely-diversified needs. Don't you just love the way the electronic communication helps make generally reasonable people offensive? It's more than a bit out of line to say the Sun programmer is incompetant. It isn't simple making the tradoffs inherent in an allocator. I've never seen a one that would satisfy all desirable criteria for all clients, because the requirements are completely program specific. There are at least four dimensions to the allocator problem: (1) Time to get a block. (2) Run size of the program (swap space). (3) Fragmentation, and what is does to (1) and (2). (4) Time to release a block, and what that does to (2) and (3). No one algorithm is ever going to be optimal in all cases. The stack based alloca is nearly ideal for allocating large numbers of small easily disposed objects. Malloc and friends are better for smaller numbers of probably larger objects. Sbrk with +/- increments is more for getting and releasing small numbers of large objects. There is no one correct answer. "Incompetence" is a unfairly glib way of saying that one perfectly valid implementation isn't the one that would be best for a particular application. -dB "Ready when you are Raoul!" {amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!
henry@utzoo.uucp (Henry Spencer) (08/16/88)
In article <2375@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes: >Don't you just love the way the electronic communication helps make >generally reasonable people offensive? It's more than a bit out of line >to say the Sun programmer is incompetant... Who, me, offensive? Nah. :-) On the contrary, it is quite in line to say that some Sun programmers are incompetent; this is an established fact. (Some others are highly competent, but unfortunately they're not the whole story.) >There is no one correct answer. "Incompetence" is a unfairly glib way >of saying that one perfectly valid implementation isn't the one that >would be best for a particular application. Well, no, actually it was a shorthand way of saying that while this *might* be a case of implementation-application mismatch, it is much more likely that it's a case of programmer incompetence, or at least deliberate disregard of performance issues (which appears to be a widespread disease at Sun). -- Intel CPUs are not defective, | Henry Spencer at U of Toronto Zoology they just act that way. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu