[comp.lang.c] ftell fseek II

vanaards%t7@uk.ac.man.cs (12/14/90)

  What is the quickest way of determining the size of a file using Standard
ANSI C functions ? Currently I'm using fseek to locate the end of file, 
followed by ftell to interrogate the position index.

  Also, I'm having to read strings from a file - well full file path-names,
now in a hierarchial structured system these vary in size according to the
location of the file. The format of the file is defined as :

  full-filename   number  full-filename number number \n 

  When reading strings it seems you have to allocate an array (buffer) whose
size will cater for the largest string possible - but in my case that isn't
exactly possible. Apart from reading chars using fgetc either on a two pass
basis - first pass to determine length, and second to fill a dynamically
allocated buffer, or reading into a linked list - how would you recommend that
I get the best performance ?
 

+--------------------------------+-----------------------------------------+
|   ()()TEVEN         ()         |  Are you sure it isn't time for another |
|  ()                ()()        | colorful metaphor?   -- Spock,          |
|   ()()   ()  ()AN ()  ()       | "The Voyage Home,"  stardate 8390.      |
|      ()   ()()   ()()()()      +-----------------------------------------+
|   ()()     ()   ()      ()ARDT |JANET E-mail : vanaards@uk.ac.man.cs.p4  |
+--------------------------------+-----------------------------------------+

wollman@emily.uvm.edu (Garrett Wollman) (12/18/90)

Hmmm...

-----begin frgets.c----
/* WARNING: typed from memory, untested on this machine */

#include <stdio.h>

static char *frgets_help(FILE *fp,int len) {
	int c = fgetc(fp);
	char *res;

	if((EOF == c) || ('\n' == c)) {
		char *res;
		if(res)
			res[len] = '\0';
		return(res);
	}

	if(res = frgets_help(fp,len+1))
		res[len] = (char)c;

	return(res);
}

char *fgrets(FILE *fp) {
	return(frgets_help(fp,0));
}

------end frgets.c-----

-GAWollman
	
Garrett A. Wollman - wollman@emily.uvm.edu

Disclaimer:  I'm not even sure this represents *my* opinion, never
mind UVM's, EMBA's, EMBA-CF's, or indeed anyone else's.

scs@adam.mit.edu (Steve Summit) (12/18/90)

In article <1990Dec14.133848.14932@cns.umist.ac.uk> vanaards%t7@uk.ac.man.cs writes:
> What is the quickest way of determining the size of a file using Standard
> ANSI C functions ?

If by "size of a file" you mean the number of characters that the
C function getc would return, the only precise, portable way is
to read the file, a character at a time, counting them.

> Currently I'm using fseek to locate the end of file, 
> followed by ftell to interrogate the position index.

ftell is _not_ guaranteed to return a byte offset, and on many
non-Unix systems, it does not.

If you have a Unixy system, with a stat() call, you may be
tempted to use the st_size field.  This will work on real Unix,
but will give you incorrect results on systems such as MS-DOS and
VMS which have a line/record structure in text files which is not
implemented in terms of single newline characters.  (MS-DOS uses
the two-character sequence CRLF.  VMS stores no explicit end-of-
line character, but precedes each record, or line, with a two-
byte length.  On both of these systems, the operating system will
report that the file size is bigger, by a number equal to the
number of lines, than it appears to be based on a C character-at-
a-time read.  The fact that, for these systems, the format of a
text file does not correspond to a C byte stream also explains
why ftell may not return a byte offset.)

Assumptions about byte-offset seeks and/or exact byte-size
st_size fields can lead to real portability problems for programs
originally written on Unix systems.  For instance, the tar
program (which was originally intended, based on its name, to be
a tape archiver, but is widely used for interchange purposes) is
painful to implement on non-Unix systems.  The tar format
consists of a file header, including the size of the file in
bytes, followed by the text of the file (padded to a block
boundary with 0's), followed by the next header, followed by the
next file, etc.  The file size in the header must be an exact
byte count, both so that the next header can be located, and so
that the correct number of trailing 0's can be stripped off of
files being extracted.

Tar implementations on non-Unix systems typically have to read
each file being archived twice -- once to determine the exact
size, and a second time to copy the file to the output medium.
This is a "specification bug" -- there's nothing a non-Unix
implementation can do about it.  It arose because the original
tar implementation, on Unix, could make use of the stat() call
which, on Unix, does return an exact byte count.  (I use the
pejorative term "bug" because the format also makes recovery
from tape errors difficult.)

The best thing to do is to avoid setting things up so you need to
know file sizes in advance.

> When reading strings it seems you have to allocate an array (buffer) whose
> size will cater for the largest string possible - but in my case that isn't
> exactly possible. Apart from reading chars using fgetc either on a two pass
> basis - first pass to determine length, and second to fill a dynamically
> allocated buffer,

Well, here you can get away without a two-pass read.  I use
routines like the following to read lines of arbitrary length,
returning pointers to the malloc'ed memory.  (The caller is
responsible for freeing it.)

	#include <stdio.h>

	extern char *realloc();

	#define CHUNK 16

	char *
	agetline(fd)
	FILE *fd;
	{
	char *base = NULL;
	char *p = base;
	int len = 0;
	int allocated = 0;
	int c;

	while(1)
		{
		c = getc(fd);

		if(c == EOF && len == 0)
			return NULL;

		len++;

		if(len > allocated)
			{
			allocated += CHUNK;
			if((base = realloc(base, allocated)) == NULL)
				{
				/* error */
				return NULL;
				}

			p = &base[len - 1];
			}

		if(c == '\n' || c == EOF)
			{
			*p++ = '\0';
			break;
			}

		*p++ = c;
		}

	return realloc(base, len);
	}

(This routine could be improved in several ways; see the
postscript.)  The heart of the routine is the realloc function;
this is a textbook usage.  (It also illustrates the utility of
the ANSI guarantee that realloc(NULL, size) behaves like malloc.)

                                            Steve Summit
                                            scs@adam.mit.edu

P.S. It is usually considered best to implement an incremental
allocator with a multiplicative rather than additive growth
process, that is, the allocation would be increased using
something like

	allocated *= 2;

I generally use a factor of 1.5, if anything, but in either case
the multiplicative case is a bit trickier, because it isn't
self-starting without an extra wrinkle, and because it is more
vulnerable to running out of (contiguous) memory as the space
needed becomes large.

The error checking could be improved: for the simple routine
illustrated, an out-of-memory error is indistinguishable from
EOF.  It might also be appropriate to free the so-far-accumulated
partial hunk when realloc fails.

Under a pre-ANSI library, it may be necessary to call malloc
rather than realloc to allocate the first chunk.

If the calling program only needs access to one line at a time,
the buf and allocated variables can be made static, and the final
realloc deleted, so that the routine returns the canonical
"pointer to static memory which is overwritten with each call."
The difference is that the "static memory" is always big enough.
(In this case, of course, the caller must _not_ pass the returned
pointer to free().)