[comp.lang.c] dynamic arrays

scs@hstbme.mit.edu (Steve Summit) (09/05/89)

[Crossposted to comp.lang.c because it's now a programming topic.]

In article <282@6sigma.UUCP> blm@6sigma.UUCP (Brian Matthews) writes:
>...I agree
>with your comment that they shouldn't have hard-coded limits on anything.
>...you may be able to get your vendor to
>recompile with larger limits, but you probably won't be able to get
>them to do the job right and replace the arrays with mallocs.

Brian is probably right (many people think that dynamic
allocation is hard), but I'd like to show how easy it can be.
(It's beyond me, though, why a production compiler would be using
arrays for symbol tables in the first place, rather than trees or
hash tables.)

Suppose you have a fixed-size array

	#define NELEM 100
	char *array[NELEM];
	int nents = 0;

into which elements are inserted as follows:

	extern char *strdup();

	addentry(newent)
	char *newent;
	{
	if(nents >= NELEM)
		fatalerror("table overflow");

	array[nents++] = strdup(newent);
	}

This is simple, straightforward, and correct; and drives
people nuts because no matter how big you #define NELEM,
one day somebody tries to exceed it.

Look how easy it is to grow the array dynamically:

	int nalloc = 0;
	char **array = NULL;	/* simulates char *array[nalloc] */
	int nents = 0;

	extern char *realloc();					/* note 3 */

	addentry(newent)
	char *newent;
	{
	if(nents >= nalloc) {
		nalloc += 10;					/* note 1 */
		array = (char **)realloc((char *)array,		/* note 2 */
						nalloc * sizeof(char *));
		if(array == NULL)
			fatalerror("out of memory");
		}
	array[nents++] = strdup(newent);
	}

Also, in any files where the array is referenced, you have to
change

	extern char *array[];
to
	extern char **array;

and any other references to NELEM (there shouldn't be any)
must be changed to nalloc.

The nice thing is that no actual references to the array need to
be changed, because the reference array[i] acts the same [4]
whether array is a char *[] or a char **.

Changing fixed-sized arrays to dynamic can therefore actually be
quite simple.  Merely remove the last level of brackets in the
array declaration, add a leading asterisk, and initialize the
resulting pointer to NULL.  Add another variable which keeps
track of how many entries have been allocated (as opposed to how
many are being used).  When the number used would exceed the
number allocated, grow the array using realloc.  (If code and
data space are more important than speed, the auxiliary variable
and the chunking behavior can be omitted and the array always
kept at the exact size required.)

                                                Steve Summit

Footnotes:

1. Many people feel that the allocation of the array should be
   grown by multiplicative factors, not additive:

	nalloc *= 2;

   The drawback here is that you may allocate much more than you
   need, and an intermediate allocation may fail unnecessarily,
   so the code should probably be changed so it "backs off" and
   tries again.  The additive approach, although it could be
   O(n**2) in data movement for some realloc implementations, has
   proved to be efficient enough for my uses.  (The additive
   approach is not expensive in data movement if the underlying
   malloc implementation is already doing power-of-2 chunking, as
   the 4bsd malloc does.)

2. For pre-ANSI realloc implementations, the code should be
   changed to

	if(array == NULL)
		array = (char **)malloc(nalloc * sizeof(char *));
	else	array = (char **)realloc((char *)array,
						nalloc * sizeof(char *));

   since realloc was not formerly guaranteed to handle NULL
   pointers.  (This example illustrates perfectly why the new
   behavior is useful.)

3. Under ANSI C, realloc is more correctly declared

	extern void *realloc();

4. The statement that "the reference array[i] acts the same
   whether array is a char *[] or a char **" should NOT be
   construed as meaning that pointers are the same as arrays.
   They are not, and the compiler will generate different code
   for array[i] depending on how array is declared; but the
   effect from the user's (author's of the code) point of view is
   the same.

5. The thrust of the examples above may be slightly obscured by
   the fact that the array being simulated is already an array of
   pointers.  Some may be unnecessarily bewildered by the double
   indirection in

	char **array;

   If each occurrence of the sequence "char *" is replaced by
   "int ", "double ", or "struct something " (with the exception
   of the declaration of realloc() and the cast of its first
   argument), and the new array element filled in with something
   other than the return from strdup, the intent may be clearer.

6. The above code is for illustrative purposes and may contain
   typoes.  (It also doesn't use function prototypes.)  Code just
   like it is in practically every program I write and works fine.

chris@mimsy.UUCP (Chris Torek) (09/05/89)

In article <14064@bloom-beacon.MIT.EDU> scs@hstbme.mit.edu (Steve Summit)
writes:
>(It's beyond me, though, why a production compiler would be using
>arrays for symbol tables in the first place, rather than trees or
>hash tables.)

PCC does better: it uses arrays that are organised into trees and
hash tables. :-)

There are (sometimes) good reasons.  But anyway:

>	#define NELEM 100
>	char *array[NELEM];

becomes

>	int nalloc = 0;
>	char **array = NULL;	/* simulates char *array[nalloc] */
	...
>	if(nents >= nalloc) {
>		nalloc += 10;					/* note 1 */
>		array = (char **)realloc((char *)array,		/* note 2 */
>						nalloc * sizeof(char *));

There is (at least) one case in which this will cause subtle breakage.

>The nice thing is that no actual references to the array need to
>be changed, because the reference array[i] acts the same [4]
>whether array is a char *[] or a char **.

[and I have to include footnote 4 here]

>4. The statement that "the reference array[i] acts the same
>   whether array is a char *[] or a char **" should NOT be
>   construed as meaning that pointers are the same as arrays.
>   They are not, and the compiler will generate different code
>   for array[i] depending on how array is declared; but the
>   effect from the user's (author's of the code) point of view is
>   the same.

There is at least one time that the dynamic allocation will cause
code to break.  Consider this, for instance:

	int tab[TSIZE];
	...
		int *p = &tab[k];
		... insert_into_tab(); ...
		... use *p ...
	
	insert_into_tab() {
		if (ntab > TSIZE) error("table full");
		tab[TSIZE++] = next();
	}

Now, when we change `tab[SIZE]' as recommended, all of a sudden the
use of `*p' becomes incorrect.  This is not a particularly unusual
situation (and indeed, PCC is full of such references, which is why I
never made its tables dynamic myself).  This problem can be fixed,
but it does take more work.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

blm@6sigma.UUCP (Brian Matthews) (09/06/89)

In article <14064@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:
|[Crossposted to comp.lang.c because it's now a programming topic.]
|In article <282@6sigma.UUCP> blm@6sigma.UUCP (Brian Matthews) writes:
|>...I agree
|>with your comment that they shouldn't have hard-coded limits on anything.
|>...you may be able to get your vendor to
|>recompile with larger limits, but you probably won't be able to get
|>them to do the job right and replace the arrays with mallocs.
|Brian is probably right (many people think that dynamic
|allocation is hard), but I'd like to show how easy it can be.
|(It's beyond me, though, why a production compiler would be using
|arrays for symbol tables in the first place, rather than trees or
|hash tables.)

Actually I was a little misleading.  The symbol table does use a hash
table, but the structures for the individual symbols are allocated out
of a big array with code basically like the first chunk in Steve's message.
So the compiler isn't doing anything horrible like a linear search, but
it is working with a pool of structures with a fixed upper limit on the
number that can be allocated, which is the cause of the "too many names"
message from lint that started this discussion.

|[code to allocate out of a fixed array omitted]
|This is simple, straightforward, and correct; and drives
|people nuts because no matter how big you #define NELEM,
|one day somebody tries to exceed it.

Yep.  If there's one "universal computing message", it's that no matter
how big you make a limit, it isn't big enough.  They're arguing 16 bit
vs. 32 bit integers in comp.sys.mac.programmer, and there has been a lot
of discussion in the past about 16 bit inode numbers and 32 bit file and
file system sizes.

|[code to dynamically grow an array omitted]
|The nice thing is that no actual references to the array need to
|be changed, because the reference array[i] acts the same [4]
|whether array is a char *[] or a char **.
|Changing fixed-sized arrays to dynamic can therefore actually be
|quite simple.

Yep again.  It's also not much harder to convert to using malloc
for each structure, or to put the malloced arrays into a linked list
instead of copying a bunch of bytes when you grow the array (of course
then you have to touch all of the places that use the array, which
might be prohibitive in something like a compiler.)

The lesson, though, is the same.  Doing dynamic allocation is only
marginally more difficult than having a fixed size array (one could
argue that using a fixed size array is more difficult because you'll
have to do the fixed size array implementation, and then when your
customers complain, you'll have to go back and do the dynamic
implementation :-)
-- 
Brian L. Matthews			blm@6sigma.UUCP
Six Sigma CASE, Inc.			+1 206 854 6578
PO Box 40316, Bellevue, WA  98004

exspes@gdr.bath.ac.uk (P E Smee) (09/06/89)

In article <14064@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes:
>1. Many people feel that the allocation of the array should be
>   grown by multiplicative factors, not additive:
>
>	nalloc *= 2;
>
>   The drawback here is that you may allocate much more than you
>   need, and an intermediate allocation may fail unnecessarily,
>   so the code should probably be changed so it "backs off" and
>   tries again.  The additive approach, although it could be
>   O(n**2) in data movement for some realloc implementations, has
>   proved to be efficient enough for my uses.

It was suggested some years ago (about 15-20, I think) that growing
things like this based on the Fibonacci series provided, on average, a
better balance between the two problems of allocating too much and so
wasting space, and on the other hand allocating too little and so having
to do lots of reallocs.  Don't know if it has been seriously looked at
since.  (The Fibonacci series, if you've never heard of it, is the one
where each successive number is the sum of the two preceeding numbers,
with the 0th and 1st members defined as being 1.  So, 1, 1, 2, 3, 5, 8,
13, 21, ...)

The idea being that if your first block was 1X units long, and you needed
more, you'd realloc to 2X units, then to 3X units, then to 5X units, and
so on.  (Or looked at the other way, at the first growth you'd ADD 1X
more units, at the second growth ADD 1X more units, then ADD 2X more
units, etc.)

-- 
 Paul Smee               |    JANET: Smee@uk.ac.bristol
 Computer Centre         |   BITNET: Smee%uk.ac.bristol@ukacrl.bitnet
 University of Bristol   | Internet: Smee%uk.ac.bristol@nsfnet-relay.ac.uk
 (Phone: +44 272 303132) |     UUCP: ...!mcvax!ukc!gdr.bath.ac.uk!exspes