[comp.lang.c] alloca wars

pardo@june.cs.washington.edu (David Keppel) (08/03/88)

>[ When is it hard to do alloca? ]

The fast (compiler-supported) alloca may turn off other optmizations.
It need only turn them off in the function using alloca and (given
that memory management is often expensive) you aren't likely to lose a
whole lot by using alloca.

The slower (library-supported) alloca will break whenever you have a
multi-thread (e.g., lightweight process) program (at least in every
"reasonable" implementation that I've thought of).  This may be a
problem for shared-memory multiprocessors.

During the last alloca war somebody suggested having alloca require
explicit "free" semantics, so that malloc-based (library) allocations
would be reasonable to write and work on shared-memory multiprocesors.
This would probably also hose fewer compilers/architectures.  If I
understand, code would look like:

	:
	foo = alloca( size );
	:
	afree( size );
	return( value );

This complicates the code slightly, but (I believe) would give
everybody "the best of both worlds".  Does anybody object to these
semantics?

Note that GNU products are *not* designed to work on every machine.
The documentation for gcc (as I remember) says that it is designed to
work with 32-bit machines with a flat (non-segment-register) address
space.  Alloca could be seen as a constraint on this.  Finally, gcc
makes some extensions that can destroy alloca semantics; garbage
collection can (silently) take place before the end of the function,
so even their use is broke (and is documented as such).

[ Please, he said following up, try to keep down your
  followup-rate.  We just got done with this war a few weeks ago ]

	    ;-D on  ( What, me deallocate? )  Pardo

		    pardo@cs.washington.edu
    {rutgers,cornell,ucsd,ubc-cs,tektronix}!uw-beaver!june!pardo

gwyn@smoke.ARPA (Doug Gwyn ) (08/03/88)

In article <5422@june.cs.washington.edu> pardo@june.cs.washington.edu (David Keppel) writes:
-	foo = alloca( size );
-	:
-	afree( size );
-	return( value );
-Does anybody object to these semantics?

Yes, I object very much.  If you're going to do that, just use
malloc()/free(), which are universally available.

jeff@amsdsg.UUCP (Jeff Barr) (08/04/88)

Correct me if I'm wrong.  The benefits of alloca() over malloc() are:

(1)  There is no need for the program to explicitly free the storage 
     allocated via alloca(), as the free() is implicit at the end of the 
     scope which encloses the alloca call.  (Actually implemented by
     adjusting the frame pointer at the end of the scope in a true
     alloca implementation.)

(2)  The memory arena used by malloc() is not referenced by alloca (), 
     bypassing any possible fragmentation problems.
 
-- 
						Jeff
	+---------------------------------------------------------+
	|  Jeff Barr   AMS-DSG   uunet!amsdsg!jeff   800-832-8668 |
	+---------------------------------------------------------+

ddb@ns.UUCP (David Dyer-Bennet) (08/04/88)

In article <5422@june.cs.washington.edu>, pardo@june.cs.washington.edu (David Keppel) writes:
$ During the last alloca war somebody suggested having alloca require
$ explicit "free" semantics, so that malloc-based (library) allocations
$ would be reasonable to write and work on shared-memory multiprocesors.
$ This would probably also hose fewer compilers/architectures.  If I
$ understand, code would look like:

$ 	:
$ 	foo = alloca( size );
$ 	:
$ 	afree( size );
$ 	return( value );

$ This complicates the code slightly, but (I believe) would give
$ everybody "the best of both worlds".  Does anybody object to these
$ semantics?

   Well, the reasons why alloca is difficult to implement seem convincing.
However, the only point I ever saw to alloca was the automatic deallocation
on exit.  If you don't have that, what advantage is there over malloc or
any of the standard allocation routines?
-- 
	-- David Dyer-Bennet
	...!{rutgers!dayton | amdahl!ems | uunet!rosevax}!umn-cs!ns!ddb
	ddb@Lynx.MN.Org, ...{amdahl,hpda}!bungia!viper!ddb
	Fidonet 1:282/341.0, (612) 721-8967 hst/2400/1200/300

pardo@june.cs.washington.edu (David Keppel) (08/04/88)

I wrote:
>  foo = alloc(); ... afree(foo);

Karl Haddock pointed out that afree's will not be called for any
frames whose scope is left by longjmp() of a dynamically included
(child) function.  Oh well, it seemed like a good idea at the time.
Sorry for the bandwidth.

	;-D on  ( Memory obfuscation/deallafreeing )  Pardo

swilson%thetone@Sun.COM (Scott Wilson) (08/04/88)

Oh gosh, I'm sorry I ever brought up alloca.  The original post
was in response to the portability issue, so let me see if we
agree on this:

1.)  Code that, in general, makes use of features that are less available
     than others is, in general, less portable.

2.)  Alloca is less available than malloc/free.

3.)  Therefore, all other things equal, code that uses alloca is less
     portable than code that doesn't.

Does this make sense?

--
Scott Wilson		arpa: swilson@sun.com		"Why do dogs lick
Sun Microsystems	uucp: ...!sun!swilson		their balls?  Because
Mt. View, CA						they can!"

djones@megatest.UUCP (Dave Jones) (08/05/88)

From article <696@ns.UUCP>, by ddb@ns.UUCP (David Dyer-Bennet):
> 
>    Well, the reasons why alloca is difficult to implement seem convincing.
> However, the only point I ever saw to alloca was the automatic deallocation
> on exit.  If you don't have that, what advantage is there over malloc or
> any of the standard allocation routines?
>

Speed.  The malloc() on in Sun3 OS, for example, is incredibly slow
after a few thousand blocks have been allocated.  As has been pointed
out, a memory allocation package built on a separate stack could
be used.  (But then you miss the automatic deallocation, of course.)


		-- djones

exodus@mfgfoc.UUCP (Greg Onufer) (08/05/88)

From article <62721@sun.uucp>, by swilson%thetone@Sun.COM (Scott Wilson):
> 1.)  Code that, in general, makes use of features that are less available
>      than others is, in general, less portable.
I don't feel alloca is such a big issue in portability... consider the GNU 
code.  Makes extensive use of alloca.  Works on quite a few machines too,
I'd say.

> 2.)  Alloca is less available than malloc/free.
On a M68k with a decent OS, alloca is not more than a few lines of assembly
code, correct?  (Judging by the size of the GNU assembly alloca)...
If one needs alloca and it is not available, why not write a quick alloca?
If the machine architecture or OS is braindamaged, then one would have
to program around it.  GNU does that also.

> 3.)  Therefore, all other things equal, code that uses alloca is less
>      portable than code that doesn't.
Not if the programmer cares that extra little bit.

> Does this make sense?
Does it?

-Greg
-- 
Greg Onufer   		GEnie: G.ONUFER		University of the Pacific
UUCP:						-= Focus Semiconductor =-
exodus@mfgfoc ...!sun!daver!mfgfoc!exodus  (and postmaster/exodus@uop.edu)
AT&T: 415-965-0604	USMAIL: #901 1929 Crisanto Ave, Mtn View, CA 94040

swilson%thetone@Sun.COM (Scott Wilson) (08/06/88)

In article <394@mfgfoc.UUCP> exodus@mfgfoc.UUCP (Greg Onufer) writes:
>On a M68k with a decent OS, alloca is not more than a few lines of assembly
>code, correct?  (Judging by the size of the GNU assembly alloca)...
>If one needs alloca and it is not available, why not write a quick alloca?

This makes about as much sense as saying:  If your in Tokyo and need
to get to the airport just ask directions.  It's only one sentence, how
hard can that be?  Well if you don't know the language it's probably
pretty damn hard.  I would guess the same is true about writing m68k
code, if you don't know how it doesn't matter if it is just a few lines.

Ok, so I need to write alloca, so let's see I brush up on m68k, buy the
book if necessary, understand the calling convention and stack frame
stuff for my compiler, figure out if my programming environment has
an easy way to get assembler in, understand the particular enviroment's
assembler syntax, etc.  Gee that should only take 10-15 minutes.  We're
talking about C here, I think it's unreasonable to expect a programmer
to have any knowledge of assembly code to port a program.  What may
seem like a simple exercise to you and others may be a major chore
for me and others.  And after all the debate on here about alloca 
implementation difficulties, I would be hesitant to try it on a cpu that
it hasn't already been done for in fear that it be impossible.  And what
if I'm not running on a 68k?  C runs under many OS's and many cpu's some
you've never heard of.  I've written C code for a TP1 running MOST, how
do I do alloca there?  The best assembler manual for it is in Japanese.
It took me quite a while to write setjmp/longjmp for this beast, it would
be just as much of a hassle to write alloca.  Isn't that what languages like
C are all about, so I don't have to know assembly language?

Again let me say that this whole discussion started with regard to
portability.  Your solution is to add machine dependent code to a
program to make it work.  So how does that help future portability?
It doesn't of course.

--
Scott Wilson		arpa: swilson@sun.com
Sun Microsystems	uucp: ...!sun!swilson
Mt. View, CA

gordon@sneaky.TANDY.COM (Gordon Burditt) (08/07/88)

> > 2.)  Alloca is less available than malloc/free.
> On a M68k with a decent OS, alloca is not more than a few lines of assembly
> code, correct?  (Judging by the size of the GNU assembly alloca)...
> If one needs alloca and it is not available, why not write a quick alloca?
> If the machine architecture or OS is braindamaged, then one would have
> to program around it.  GNU does that also.

If the compiler knows about alloca [Read:  recognizes the name as special 
and refrains from doing certain things because of it, OR recognizes the name 
as special and implements it as a built-in], , it could be made to work 
nicely.  If the compiler doesn't know about alloca you can have a real 
headache trying to implement it as it was intended (memory on the stack 
frame), even with a "nice" CPU.  

Further, even with the ultimate alloca implementation, GNU GCC with
alloca built into the compiler, it still gets it wrong.

Take, for example, the following 68000 linkage conventions:

d0-d1, a0-a1 are clobbered across calls.  d2-d7, a2-a5 are preserved across 
calls.  a6 is the frame pointer.  a7 is the stack pointer.  Args are
on the stack, first arg at the lowest address.  CALLER POPS ARGS FROM THE 
STACK.  Function return value is in d0 or d0/d1 (for doubles).

These are not negotiable, unless I scrap all my libraries and the compiler
that came with the system, and actually seem fairly reasonable.  These are 
used on Tandy's 68k Xenix compiler, and some other systems, like the sun 
(2 and/or 3) version of the GNU compiler, which I have adapted to also
work on a Tandy 6000.  

Ok, so alloca should be simple, right?  
	- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
	- Pop the return address into a register (a0 probably). [1 instruction]
	- Pick up the requested amount of memory and round it up to
	  a multiple of 2. [2-3 instructions]
	- Subtract the rounded amount of memory from the stack pointer.
	  [1 instruction]
	- Put the stack pointer + the size of the argument ( = 4) in d0
	  (because the caller is going to adjust the stack to remove
	  that argument). [1-2 instructions]
	- Return to the saved return address. [1 instruction]

[Nitpickers:  I really don't care if I'm one or two instructions off in the
above estimates]

Simple, huh?  So how come bison infinite loops?

Typical function entry/exit code for Tandy's compiler:
	.globl _foo
_foo:
	link	a6,#-<stack frame size>
	moveml	#<<bunch of registers>>,-(a7)
	...
	moveml	(a7)+,#<<bunch of registers>>		| AARRRGGGGHHHH!!!!
	unlk	a6
	rts
The "bunch of registers" is determined from what is actually used.  There
are variations of this code if 0 or 1 registers need to be saved.

Notice the instruction marked "AARRRGGGGHHHH".  If alloca moves the stack 
pointer, then it has to allocate some extra memory and leave a copy of the 
saved registers in them, so this instruction can pop them off.  (Why didn't 
the compiler writer use "moveml <offset>(a6),#<bunch of registers>" instead?  
I don't care, really, but it might be because the code it actually generates 
is 2 bytes shorter (and 1 clock cycle faster on a 68020).)  So, my first shot 
at alloca trashes all of the registers in the caller of the caller of alloca.

Ok, how much memory?  40 bytes (d2-d7 and a2-a5), regardless of whether the 
function actually saves them or not.  Why not find out what's actually saved?  
To do this, you'd need to find the entry point of the function and disassemble 
a few instructions.  The frame pointer points to the return point in the 
caller of the caller of alloca.  However, in 68000 code, given the address 
of the end of an instruction and the contents of memory, you cannot always
uniquely find the address of the beginning of it.  (Ok, maybe 99% of the time 
you can, but having any need for a "panic:  alloca can't find it's caller - 
program aborted" message is unacceptable).  Besides, if the caller of alloca 
was called through a function pointer (I DID NOT say alloca was called through 
a function pointer!  although that wouldn't matter much anyway), the address 
might be and usually is in a0, and has been destroyed before alloca is called.  

Now, on to another issue: function calls and arguments.  Assume that the 
compiler has a maximum of N temporary locations in registers to store 
computed arguments that haven't been pushed on the stack yet.  (Suppose for 
this example that N = 2).  Assume that the compiler pushes its arguments onto 
the stack in an endian order (I did NOT say which end!).  The arguments and 
return types of foo and bar are (char *), and bar returns its first argument.  
How does the compiler evaluate this (foo is presumed to have 2N+3 arguments):

	foo(bar(alloca(20)), bar(alloca(20)), bar(alloca(20)), alloca(40),
		bar(alloca(20)), bar(alloca(20)), bar(alloca(20)) );

Can anyone find ANY compiler on a system with caller-pops-args linkage
conventions that gets this right?  Or for that matter, any compiler on
ANY system where alloca really allocates memory on the stack?

To correctly evaluate this, the compiler would have to evaluate 
"bar(alloca(20))" 3 times, store the results somewhere other than pushing
on the stack, and since it has only 2 registers, one of them has to be 
elsewhere (stack frame relative temporary seems to be the only place left), 
evaluate alloca(40), at which point it had better not have pushed anything on 
the stack, and then evaluate "bar(alloca(20))" 3 more times.  THEN it fetches 
the 7 values from assorted storage and pushes them on the stack.

Now, consider a compiler that doesn't know alloca is special.  Can you
imagine the laughter if it always generated the above sequence for
every function, alloca or not?  Storing stuff on the stack frame, then
copying it with a push-contents-of-memory-on-stack instruction?  Inefficient
code in the extreme ...  .  

What it's really going to do is evaluate "bar(alloca(20))", push the result on 
the stack, repeat 2 more times, evaluate "alloca(40)", push the result on 
the stack, and repeat evaluating and pushing "bar(alloca(20))" 3 times.  
foo() will then be called with all but the last-pushed argument in the wrong 
place.  This is, in fact, how gcc 1.24 evaluates it, WITH AND WITHOUT 
using the builtin alloca.  With the builtin and without the optimizer, the 
alloca() becomes 2 instructions (fix stack and move return value to d0).
The round-up-to-multiple-of-2 gets done at compile time since the arguments
to the alloca calls were constant.

Consider the GNU compiler feature "defer-pop".  If a compiler defers
taking arguments off the stack after a function call, and doesn't recognize 
alloca as a clue to turn off this behavior, and there is no flag to
control it, then I have to throw up my hands and abandon any attempt to
write alloca.  Supposedly GCC does recognize alloca as special for this
purpose, and in any case, it has flags to control it.

Back to my Tandy 68k machine.  To accomodate the args-on-the-stack problem, 
I have to make alloca copy more of the stack frame to handle the worst-case 
of args pushed on the stack when it's called.  There is no worst-case, but 5 
args ( = 20 more bytes) might be good enough.

I got lucky.  Tandy's compiler references auto variables relative to the
stack frame pointer.  If it referenced them relative to the stack pointer,
I'd be out of luck.

So now it works like this:
	- Leave a2-a6 and d2-d7 alone, so you don't have to save them.
	- Pop the return address into a register (a0 probably). [1 instruction]
	- Pick up the requested amount of memory and round it up to
	  a multiple of 2, and add 60. [2-3 instructions]
	- Subtract the rounded amount of memory from the stack pointer.
	  [1 instruction]
	- Copy 60 bytes from the old stack pointer + 4 to the new stack
	  pointer + 4 for length 60.  [15 instructions, straightline.
	  Could be less but slower with a loop]
	- Put the stack pointer + the size of the argument plus the
	  copied stack ( = 64) in d0 (because the caller is going to adjust 
	  the stack to remove that argument). [1-2 instructions]
	- Return to the saved return address. [1 instruction]

So I have a version of alloca that allocates its memory on the stack frame
and wastes 60 bytes per call.  If a compiler uses defer-pop, it's 
going to screw up.  If there are more than 5 arguments on the stack when
it's called, it's going to screw up.  But bison doesn't infinite loop
any more.

Why don't I post it?  Several reasons:  
- It's too compiler- and system- specific to be of much use to many people.
- The waste of memory per call is embarrasing.
- It was debugged using bison, and according to discussions going on in 
  gnu.gcc, it is probably subject to the GNU Public License, and I'd have to 
  post the source to everything I used it in.  I don't have that much disk 
  space, and neither does the net.
- It still doesn't work right, in ways I indicated.

				Gordon L. Burditt
				killer!ninja!sneaky!gordon

henry@utzoo.uucp (Henry Spencer) (08/07/88)

In article <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>Speed.  The malloc() on in Sun3 OS, for example, is incredibly slow
>after a few thousand blocks have been allocated...

This just says that whoever wrote Sun's malloc was incompetent.  There is
nothing inherently hard about writing a fast malloc, although it's not
simple, especially for widely-diversified needs.
-- 
MSDOS is not dead, it just     |     Henry Spencer at U of Toronto Zoology
smells that way.               | uunet!attcan!utzoo!henry henry@zoo.toronto.edu

sarima@gryphon.CTS.COM (Stan Friesen) (08/08/88)

In article <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>From article <696@ns.UUCP>, by ddb@ns.UUCP (David Dyer-Bennet):
>> If you don't have that, what advantage is there over malloc or
>> any of the standard allocation routines?
>
>Speed.  The malloc() on in Sun3 OS, for example, is incredibly slow
>after a few thousand blocks have been allocated.

	Indeed, and not just on Suns either. A similar thing happened to
me on a Convergent. In fact it eventually got sooo slow that our customers
complained! (Luckily the large number of allocations was unnecessary, the
code was not freeing a message block, so it was easy to fix by just adding
a free() after the message was displayed. But still a 10-20 fold increase in
speed *just* by adding one free!!)
-- 
Sarima Cardolandion			sarima@gryphon.CTS.COM
aka Stanley Friesen			rutgers!marque!gryphon!sarima
					Sherman Oaks, CA

exodus@mfgfoc.UUCP (Greg Onufer) (08/08/88)

From article <63153@sun.uucp>, by swilson%thetone@Sun.COM (Scott Wilson):
> In article <394@mfgfoc.UUCP> exodus@mfgfoc.UUCP I write:
>>On a M68k with a decent OS, alloca is not more than a few lines of assembly
>>code, correct?  (Judging by the size of the GNU assembly alloca)...
>>If one needs alloca and it is not available, why not write a quick alloca?
> ...
> Again let me say that this whole discussion started with regard to
> portability.  Your solution is to add machine dependent code to a
> program to make it work.  So how does that help future portability?
> It doesn't of course.

Repeat this ten times:
Good C programmers should always know the output of their compiler!!!

And I tried to emphasize the GNU code....  It does it--- portable too.
And if you look at the GNU code (and I think _everybody_ should have copies
of some of the important C files in both Emacs and Gnu C), there _IS_
a portable implementation that works on machines which could not 
reasonably be asked to support proper alloca functionality.  It does
require one to call alloca with a argument of zero to clear free up the
allocated memory (making it no better than malloc, essentially).
This is an example where one could learn from someone else's code.
I'd say GnuEmacs and Gnu C are very excellent sources of education on
portable program writing.

> Scott Wilson		arpa: swilson@sun.com


-- 
Greg Onufer   		GEnie: G.ONUFER		University of the Pacific
UUCP:						-= Focus Semiconductor =-
exodus@mfgfoc ...!sun!daver!mfgfoc!exodus  (and postmaster/exodus@uop.edu)
AT&T: 415-965-0604	USMAIL: #901 1929 Crisanto Ave, Mtn View, CA 94040

daveb@llama.rtech.UUCP (Dave Brower) (08/12/88)

In <1988Aug7.002334.6987@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>In <697@goofy.megatest.UUCP> djones@megatest.UUCP (Dave Jones) writes:
>>Speed.  The malloc() on in Sun3 OS, for example, is incredibly slow
>>after a few thousand blocks have been allocated...
>
>This just says that whoever wrote Sun's malloc was incompetent.  There is
>nothing inherently hard about writing a fast malloc, although it's not
>simple, especially for widely-diversified needs.

Don't you just love the way the electronic communication helps make
generally reasonable people offensive?  It's more than a bit out of line
to say the Sun programmer is incompetant.  It isn't simple making the
tradoffs inherent in an allocator.  I've never seen a one that would
satisfy all desirable criteria for all clients, because the requirements
are completely program specific.

There are at least four dimensions to the allocator problem:

	(1)	Time to get a block.
	(2)	Run size of the program (swap space).
	(3)	Fragmentation, and what is does to (1) and (2).
	(4)	Time to release a block, and what that does to (2) and (3).

No one algorithm is ever going to be optimal in all cases.  The stack
based alloca is nearly ideal for allocating large numbers of small
easily disposed objects.  Malloc and friends are better for smaller
numbers of probably larger objects.  Sbrk with +/- increments is more
for getting and releasing small numbers of large objects.  

There is no one correct answer.  "Incompetence" is a unfairly glib way
of saying that one perfectly valid implementation isn't the one that
would be best for a particular application.

-dB
"Ready when you are Raoul!"
{amdahl, cpsc6a, mtxinu, sun, hoptoad}!rtech!daveb daveb@rtech.com <- FINALLY!

henry@utzoo.uucp (Henry Spencer) (08/16/88)

In article <2375@rtech.rtech.com> daveb@llama.UUCP (Dave Brower) writes:
>Don't you just love the way the electronic communication helps make
>generally reasonable people offensive?  It's more than a bit out of line
>to say the Sun programmer is incompetant...

Who, me, offensive?  Nah. :-)

On the contrary, it is quite in line to say that some Sun programmers are
incompetent; this is an established fact.  (Some others are highly competent,
but unfortunately they're not the whole story.)

>There is no one correct answer.  "Incompetence" is a unfairly glib way
>of saying that one perfectly valid implementation isn't the one that
>would be best for a particular application.

Well, no, actually it was a shorthand way of saying that while this *might*
be a case of implementation-application mismatch, it is much more likely
that it's a case of programmer incompetence, or at least deliberate disregard
of performance issues (which appears to be a widespread disease at Sun).
-- 
Intel CPUs are not defective,  |     Henry Spencer at U of Toronto Zoology
they just act that way.        | uunet!attcan!utzoo!henry henry@zoo.toronto.edu