vanaards%t7@uk.ac.man.cs (12/14/90)
What is the quickest way of determining the size of a file using Standard ANSI C functions ? Currently I'm using fseek to locate the end of file, followed by ftell to interrogate the position index. Also, I'm having to read strings from a file - well full file path-names, now in a hierarchial structured system these vary in size according to the location of the file. The format of the file is defined as : full-filename number full-filename number number \n When reading strings it seems you have to allocate an array (buffer) whose size will cater for the largest string possible - but in my case that isn't exactly possible. Apart from reading chars using fgetc either on a two pass basis - first pass to determine length, and second to fill a dynamically allocated buffer, or reading into a linked list - how would you recommend that I get the best performance ? +--------------------------------+-----------------------------------------+ | ()()TEVEN () | Are you sure it isn't time for another | | () ()() | colorful metaphor? -- Spock, | | ()() () ()AN () () | "The Voyage Home," stardate 8390. | | () ()() ()()()() +-----------------------------------------+ | ()() () () ()ARDT |JANET E-mail : vanaards@uk.ac.man.cs.p4 | +--------------------------------+-----------------------------------------+
wollman@emily.uvm.edu (Garrett Wollman) (12/18/90)
Hmmm... -----begin frgets.c---- /* WARNING: typed from memory, untested on this machine */ #include <stdio.h> static char *frgets_help(FILE *fp,int len) { int c = fgetc(fp); char *res; if((EOF == c) || ('\n' == c)) { char *res; if(res) res[len] = '\0'; return(res); } if(res = frgets_help(fp,len+1)) res[len] = (char)c; return(res); } char *fgrets(FILE *fp) { return(frgets_help(fp,0)); } ------end frgets.c----- -GAWollman Garrett A. Wollman - wollman@emily.uvm.edu Disclaimer: I'm not even sure this represents *my* opinion, never mind UVM's, EMBA's, EMBA-CF's, or indeed anyone else's.
scs@adam.mit.edu (Steve Summit) (12/18/90)
In article <1990Dec14.133848.14932@cns.umist.ac.uk> vanaards%t7@uk.ac.man.cs writes: > What is the quickest way of determining the size of a file using Standard > ANSI C functions ? If by "size of a file" you mean the number of characters that the C function getc would return, the only precise, portable way is to read the file, a character at a time, counting them. > Currently I'm using fseek to locate the end of file, > followed by ftell to interrogate the position index. ftell is _not_ guaranteed to return a byte offset, and on many non-Unix systems, it does not. If you have a Unixy system, with a stat() call, you may be tempted to use the st_size field. This will work on real Unix, but will give you incorrect results on systems such as MS-DOS and VMS which have a line/record structure in text files which is not implemented in terms of single newline characters. (MS-DOS uses the two-character sequence CRLF. VMS stores no explicit end-of- line character, but precedes each record, or line, with a two- byte length. On both of these systems, the operating system will report that the file size is bigger, by a number equal to the number of lines, than it appears to be based on a C character-at- a-time read. The fact that, for these systems, the format of a text file does not correspond to a C byte stream also explains why ftell may not return a byte offset.) Assumptions about byte-offset seeks and/or exact byte-size st_size fields can lead to real portability problems for programs originally written on Unix systems. For instance, the tar program (which was originally intended, based on its name, to be a tape archiver, but is widely used for interchange purposes) is painful to implement on non-Unix systems. The tar format consists of a file header, including the size of the file in bytes, followed by the text of the file (padded to a block boundary with 0's), followed by the next header, followed by the next file, etc. The file size in the header must be an exact byte count, both so that the next header can be located, and so that the correct number of trailing 0's can be stripped off of files being extracted. Tar implementations on non-Unix systems typically have to read each file being archived twice -- once to determine the exact size, and a second time to copy the file to the output medium. This is a "specification bug" -- there's nothing a non-Unix implementation can do about it. It arose because the original tar implementation, on Unix, could make use of the stat() call which, on Unix, does return an exact byte count. (I use the pejorative term "bug" because the format also makes recovery from tape errors difficult.) The best thing to do is to avoid setting things up so you need to know file sizes in advance. > When reading strings it seems you have to allocate an array (buffer) whose > size will cater for the largest string possible - but in my case that isn't > exactly possible. Apart from reading chars using fgetc either on a two pass > basis - first pass to determine length, and second to fill a dynamically > allocated buffer, Well, here you can get away without a two-pass read. I use routines like the following to read lines of arbitrary length, returning pointers to the malloc'ed memory. (The caller is responsible for freeing it.) #include <stdio.h> extern char *realloc(); #define CHUNK 16 char * agetline(fd) FILE *fd; { char *base = NULL; char *p = base; int len = 0; int allocated = 0; int c; while(1) { c = getc(fd); if(c == EOF && len == 0) return NULL; len++; if(len > allocated) { allocated += CHUNK; if((base = realloc(base, allocated)) == NULL) { /* error */ return NULL; } p = &base[len - 1]; } if(c == '\n' || c == EOF) { *p++ = '\0'; break; } *p++ = c; } return realloc(base, len); } (This routine could be improved in several ways; see the postscript.) The heart of the routine is the realloc function; this is a textbook usage. (It also illustrates the utility of the ANSI guarantee that realloc(NULL, size) behaves like malloc.) Steve Summit scs@adam.mit.edu P.S. It is usually considered best to implement an incremental allocator with a multiplicative rather than additive growth process, that is, the allocation would be increased using something like allocated *= 2; I generally use a factor of 1.5, if anything, but in either case the multiplicative case is a bit trickier, because it isn't self-starting without an extra wrinkle, and because it is more vulnerable to running out of (contiguous) memory as the space needed becomes large. The error checking could be improved: for the simple routine illustrated, an out-of-memory error is indistinguishable from EOF. It might also be appropriate to free the so-far-accumulated partial hunk when realloc fails. Under a pre-ANSI library, it may be necessary to call malloc rather than realloc to allocate the first chunk. If the calling program only needs access to one line at a time, the buf and allocated variables can be made static, and the final realloc deleted, so that the routine returns the canonical "pointer to static memory which is overwritten with each call." The difference is that the "static memory" is always big enough. (In this case, of course, the caller must _not_ pass the returned pointer to free().)