scs@hstbme.mit.edu (Steve Summit) (09/05/89)
[Crossposted to comp.lang.c because it's now a programming topic.] In article <282@6sigma.UUCP> blm@6sigma.UUCP (Brian Matthews) writes: >...I agree >with your comment that they shouldn't have hard-coded limits on anything. >...you may be able to get your vendor to >recompile with larger limits, but you probably won't be able to get >them to do the job right and replace the arrays with mallocs. Brian is probably right (many people think that dynamic allocation is hard), but I'd like to show how easy it can be. (It's beyond me, though, why a production compiler would be using arrays for symbol tables in the first place, rather than trees or hash tables.) Suppose you have a fixed-size array #define NELEM 100 char *array[NELEM]; int nents = 0; into which elements are inserted as follows: extern char *strdup(); addentry(newent) char *newent; { if(nents >= NELEM) fatalerror("table overflow"); array[nents++] = strdup(newent); } This is simple, straightforward, and correct; and drives people nuts because no matter how big you #define NELEM, one day somebody tries to exceed it. Look how easy it is to grow the array dynamically: int nalloc = 0; char **array = NULL; /* simulates char *array[nalloc] */ int nents = 0; extern char *realloc(); /* note 3 */ addentry(newent) char *newent; { if(nents >= nalloc) { nalloc += 10; /* note 1 */ array = (char **)realloc((char *)array, /* note 2 */ nalloc * sizeof(char *)); if(array == NULL) fatalerror("out of memory"); } array[nents++] = strdup(newent); } Also, in any files where the array is referenced, you have to change extern char *array[]; to extern char **array; and any other references to NELEM (there shouldn't be any) must be changed to nalloc. The nice thing is that no actual references to the array need to be changed, because the reference array[i] acts the same [4] whether array is a char *[] or a char **. Changing fixed-sized arrays to dynamic can therefore actually be quite simple. Merely remove the last level of brackets in the array declaration, add a leading asterisk, and initialize the resulting pointer to NULL. Add another variable which keeps track of how many entries have been allocated (as opposed to how many are being used). When the number used would exceed the number allocated, grow the array using realloc. (If code and data space are more important than speed, the auxiliary variable and the chunking behavior can be omitted and the array always kept at the exact size required.) Steve Summit Footnotes: 1. Many people feel that the allocation of the array should be grown by multiplicative factors, not additive: nalloc *= 2; The drawback here is that you may allocate much more than you need, and an intermediate allocation may fail unnecessarily, so the code should probably be changed so it "backs off" and tries again. The additive approach, although it could be O(n**2) in data movement for some realloc implementations, has proved to be efficient enough for my uses. (The additive approach is not expensive in data movement if the underlying malloc implementation is already doing power-of-2 chunking, as the 4bsd malloc does.) 2. For pre-ANSI realloc implementations, the code should be changed to if(array == NULL) array = (char **)malloc(nalloc * sizeof(char *)); else array = (char **)realloc((char *)array, nalloc * sizeof(char *)); since realloc was not formerly guaranteed to handle NULL pointers. (This example illustrates perfectly why the new behavior is useful.) 3. Under ANSI C, realloc is more correctly declared extern void *realloc(); 4. The statement that "the reference array[i] acts the same whether array is a char *[] or a char **" should NOT be construed as meaning that pointers are the same as arrays. They are not, and the compiler will generate different code for array[i] depending on how array is declared; but the effect from the user's (author's of the code) point of view is the same. 5. The thrust of the examples above may be slightly obscured by the fact that the array being simulated is already an array of pointers. Some may be unnecessarily bewildered by the double indirection in char **array; If each occurrence of the sequence "char *" is replaced by "int ", "double ", or "struct something " (with the exception of the declaration of realloc() and the cast of its first argument), and the new array element filled in with something other than the return from strdup, the intent may be clearer. 6. The above code is for illustrative purposes and may contain typoes. (It also doesn't use function prototypes.) Code just like it is in practically every program I write and works fine.
chris@mimsy.UUCP (Chris Torek) (09/05/89)
In article <14064@bloom-beacon.MIT.EDU> scs@hstbme.mit.edu (Steve Summit) writes: >(It's beyond me, though, why a production compiler would be using >arrays for symbol tables in the first place, rather than trees or >hash tables.) PCC does better: it uses arrays that are organised into trees and hash tables. :-) There are (sometimes) good reasons. But anyway: > #define NELEM 100 > char *array[NELEM]; becomes > int nalloc = 0; > char **array = NULL; /* simulates char *array[nalloc] */ ... > if(nents >= nalloc) { > nalloc += 10; /* note 1 */ > array = (char **)realloc((char *)array, /* note 2 */ > nalloc * sizeof(char *)); There is (at least) one case in which this will cause subtle breakage. >The nice thing is that no actual references to the array need to >be changed, because the reference array[i] acts the same [4] >whether array is a char *[] or a char **. [and I have to include footnote 4 here] >4. The statement that "the reference array[i] acts the same > whether array is a char *[] or a char **" should NOT be > construed as meaning that pointers are the same as arrays. > They are not, and the compiler will generate different code > for array[i] depending on how array is declared; but the > effect from the user's (author's of the code) point of view is > the same. There is at least one time that the dynamic allocation will cause code to break. Consider this, for instance: int tab[TSIZE]; ... int *p = &tab[k]; ... insert_into_tab(); ... ... use *p ... insert_into_tab() { if (ntab > TSIZE) error("table full"); tab[TSIZE++] = next(); } Now, when we change `tab[SIZE]' as recommended, all of a sudden the use of `*p' becomes incorrect. This is not a particularly unusual situation (and indeed, PCC is full of such references, which is why I never made its tables dynamic myself). This problem can be fixed, but it does take more work. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
blm@6sigma.UUCP (Brian Matthews) (09/06/89)
In article <14064@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes: |[Crossposted to comp.lang.c because it's now a programming topic.] |In article <282@6sigma.UUCP> blm@6sigma.UUCP (Brian Matthews) writes: |>...I agree |>with your comment that they shouldn't have hard-coded limits on anything. |>...you may be able to get your vendor to |>recompile with larger limits, but you probably won't be able to get |>them to do the job right and replace the arrays with mallocs. |Brian is probably right (many people think that dynamic |allocation is hard), but I'd like to show how easy it can be. |(It's beyond me, though, why a production compiler would be using |arrays for symbol tables in the first place, rather than trees or |hash tables.) Actually I was a little misleading. The symbol table does use a hash table, but the structures for the individual symbols are allocated out of a big array with code basically like the first chunk in Steve's message. So the compiler isn't doing anything horrible like a linear search, but it is working with a pool of structures with a fixed upper limit on the number that can be allocated, which is the cause of the "too many names" message from lint that started this discussion. |[code to allocate out of a fixed array omitted] |This is simple, straightforward, and correct; and drives |people nuts because no matter how big you #define NELEM, |one day somebody tries to exceed it. Yep. If there's one "universal computing message", it's that no matter how big you make a limit, it isn't big enough. They're arguing 16 bit vs. 32 bit integers in comp.sys.mac.programmer, and there has been a lot of discussion in the past about 16 bit inode numbers and 32 bit file and file system sizes. |[code to dynamically grow an array omitted] |The nice thing is that no actual references to the array need to |be changed, because the reference array[i] acts the same [4] |whether array is a char *[] or a char **. |Changing fixed-sized arrays to dynamic can therefore actually be |quite simple. Yep again. It's also not much harder to convert to using malloc for each structure, or to put the malloced arrays into a linked list instead of copying a bunch of bytes when you grow the array (of course then you have to touch all of the places that use the array, which might be prohibitive in something like a compiler.) The lesson, though, is the same. Doing dynamic allocation is only marginally more difficult than having a fixed size array (one could argue that using a fixed size array is more difficult because you'll have to do the fixed size array implementation, and then when your customers complain, you'll have to go back and do the dynamic implementation :-) -- Brian L. Matthews blm@6sigma.UUCP Six Sigma CASE, Inc. +1 206 854 6578 PO Box 40316, Bellevue, WA 98004
exspes@gdr.bath.ac.uk (P E Smee) (09/06/89)
In article <14064@bloom-beacon.MIT.EDU> scs@adam.pika.mit.edu (Steve Summit) writes: >1. Many people feel that the allocation of the array should be > grown by multiplicative factors, not additive: > > nalloc *= 2; > > The drawback here is that you may allocate much more than you > need, and an intermediate allocation may fail unnecessarily, > so the code should probably be changed so it "backs off" and > tries again. The additive approach, although it could be > O(n**2) in data movement for some realloc implementations, has > proved to be efficient enough for my uses. It was suggested some years ago (about 15-20, I think) that growing things like this based on the Fibonacci series provided, on average, a better balance between the two problems of allocating too much and so wasting space, and on the other hand allocating too little and so having to do lots of reallocs. Don't know if it has been seriously looked at since. (The Fibonacci series, if you've never heard of it, is the one where each successive number is the sum of the two preceeding numbers, with the 0th and 1st members defined as being 1. So, 1, 1, 2, 3, 5, 8, 13, 21, ...) The idea being that if your first block was 1X units long, and you needed more, you'd realloc to 2X units, then to 3X units, then to 5X units, and so on. (Or looked at the other way, at the first growth you'd ADD 1X more units, at the second growth ADD 1X more units, then ADD 2X more units, etc.) -- Paul Smee | JANET: Smee@uk.ac.bristol Computer Centre | BITNET: Smee%uk.ac.bristol@ukacrl.bitnet University of Bristol | Internet: Smee%uk.ac.bristol@nsfnet-relay.ac.uk (Phone: +44 272 303132) | UUCP: ...!mcvax!ukc!gdr.bath.ac.uk!exspes