[comp.lang.c] C, string, array, big file

jimmy@blake.acs.washington.edu (Jim Li) (11/16/89)

I need to manipulate (searching, sorting, ...) some big files (>500K)
What is the best way to store the contents of the files?
Suppose I have a file that is 5000 lines long, the simplest decl. would be

char array[5000][80]; 

Any better ideas?           
(Speed is a key issue.)

Thanks in advance.
Jim.
.

dskim@mordor.eng.umd.edu (Daeshik Kim) (11/16/89)

	Semaphore!

	How about disk-buffer along with shared-memory-buffer having
	2 processes - one for using buffer, the other filling it in.

	Conserve Memory!

CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
C									      C
C	Daeshik Kim			(301) 445-0475/2147		      C
C									      C
C	SCHOOL	: dkim@cscwam.umd.edu, dskim@eng.umd.edu	              C
C	WORK	: uunet!daffy!dkim	(703) 689-7308 (Mon, Wed, Fri)	      C
C									      C
CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC

lee@sq.sq.com (Liam R. E. Quin) (11/24/89)

In article <4473@blake.acs.washington.edu> jimmy@blake.acs.washington.edu (Jim Li) writes:
>I need to manipulate (searching, sorting, ...) some big files (>500K)
>What is the best way to store the contents of the files?
I sent an example by mail, but decided that it was probably worth posting.
Enclosed is a program that uses one method (it works, and passes lint, but
I only just typed it and profiling showed it could be made faster).
Also enclosed is an outline for another method, and understanding the program
will enable you to write the second program youself, I think.

>Suppose I have a file that is 5000 lines long, the simplest decl. would be
>char array[5000][80]; 
>Any better ideas?           
>(Speed is a key issue.)

Fixed buffers are a bad idea.  It is because of this alone that made me post!
Consider the file with a line of 81 characters (maybe produced by a program
with a bug, or a Unix text editor...).  For that matter, an 80 character
plus a newline or trailing \0...

One not very general method (complete routine given below) is as follows:

    determine the size of the file (e.g. Unix stat(2) system call)
    allocate a buffer large enough to hold the file
    open the file and read it into the buffer, then close it
    count the number of lines in the buffer
    make an array large enough to hold a pointer to the start of each line
    set the pointers in the array to point to the line starts.

This has running time
	O(FileSize * 2 + read)
Read will probably be roughly O(FileSize / BUFSIZ), so at least the
algorithm is within a constant factor of O(n), the best possible.
If we knew that all lines contained exactly 80 characters, we could be
much, much faster.  But we don't.

I wrote a routine to do this (see below), and then wrote some scaffolding
to test it.  This found a couple of bugs, but then it worked fine.
See Jon Bently's books `Programming Pearls' and `More Programming Pearls'
for ideas on scaffolding to test programs -- particularly the More... book.

Now, this can be improved:
    I could use realloc() for the array of pointers to lines, and that way
I don't need to know hpw many there are in advance.  This saves one pass
over the data.
    By the same reasoning, I don't really need to know the size of the file,
because I can simply expand the array with realloc().
Here is a scheme which does this, but which minimises the amount of
copying that realloc() has to do, and which also provides a useful
general purpose routine getline(), which returns a line of text with
a null terminator, all ready in a malloc'd buffer, so you don't need
to strcpy() it anywhere...

If this sounds hard, look at the program PROGRAM below.  Here is the
second algorithm in pseudo-C:

	allocate an array (say enough for 500 lines)
	/* but you could use stat(2) where the information is available */
	while readline() != EOF {
	    if (nlines > length of array) { /* (keep a length count) */
		grow(array) /* with realloc() */
	    }
	    array[nlines++] = line; /* no need to copy */
	}

	readline() {
	    allocate a buffer /* say, 80 characters */
	    while getchar() {
		EOF && buffer length zero --> return EOF
		if (linelength > buffer length) {
		    expand buffer by (say) 50 characters;
		}
		*p++ = ch;
	    }
	    *p = '\0';
	    /* use realloc here to set the buffer to the right size,
	     * avoiding wastage.
	     */
	    return p;
	}
This has the advantage of working with stdin, and is probably better.
But understand both methods and decide for yourself.

Here is the original program.
Delete the .signature at the end before compiling!

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/errno.h>
#include <stdio.h>
#include <fcntl.h> /* for open modes, O_RDONLY etc */
#ifdef MALLOC
# include <malloc.h>
  /* use faster malloc where available, compile with -DMALLOC and
   * link with -lmalloc -- i.e.,
   * cc -DMALLOC -o prog prog.c -lmalloc
   */
#endif

#define E_WARN 1
#define E_FATAL 2

#define Error(lev, msg, a1, a2, a3) \
	{ (void) fprintf(stderr, "%s: ", progname); \
	(void) fprintf(stderr, msg, a1, a2, a3); perror(""); \
	if (lev != E_WARN) exit(1); } (void) 1

extern char *malloc();
int FindNewLines();
int ReadFile();

char *progname;

main(argc, argv)
    int argc;
    char *argv[];
{
    char *BigBuf;
    char **Lines;
    int n;

    progname = argv[0]; /* for errors */

    /* print lines of a file in reverse order */
    while (--argc > 0) {
	if ((n = ReadFile(*++argv, &BigBuf, &Lines)) > 0L) {
	    register int i;

	    for (i = n - 1; i >= 0; i--) { /* arrays start at zero... */
		(void) printf("%d: %s\n", i + 1, Lines[i]);
	    }
	    (void) free(BigBuf);
	    (void) free((char *) Lines);
	} else {
	    Error(E_FATAL, "Can't open file \"%s\"", *argv, 0, 0);
	}
    }
}

/* Read a file into memory.  Allocates a buffer for the text in the (char *)
 * pointed to by Bufferp, and an array of pointers to the start of each
 * line in the (char **) pointed to by Linep.
 * A return value less than 1 is an error.
 * BUG: fails on very large files (bigger than MAXINT), due to integer
 *     wrap-around.  Hence, be careful on DOS, Xenix or Micropork, with
 *     lunatic memory models...
 * Uses FindNewLines().
 */
int
ReadFile(Name, Bufferp, Linep)
    char *Name;
    char **Bufferp;
    char ***Linep;
{
    /* typing as I go... */
    extern int errno;

    struct stat StatBuf;
    int AmountRead;
    int fd;

    if (stat(Name, &StatBuf) < 0) {
	return -1; /* no such file, or permission, or... */
    }

    if ((fd = open(Name, O_RDONLY, 0)) < 0) {
	return -2; /* (these magick numbers should be defines) */
    }

    /* allocate enough memory to read the file */
    if ((*Bufferp = malloc((unsigned) (StatBuf.st_size + 1))) == (char *) 0) {
	(void) close(fd);
	return -3; /* not enough memory */
    }

    if ((AmountRead = read(fd, *Bufferp, StatBuf.st_size)) < 0) {
	(void) close(fd);
	(void) free(*Bufferp);
	*Bufferp = (char *) 0;
	return -4; /* read failed (I wonder why?) */
    }

    if (AmountRead != StatBuf.st_size) {
	Error(E_WARN, "Only read %d of %ld bytes from file \"%s\"",
			    AmountRead, StatBuf.st_size, Name);

	/* I carry on here -- yu might want to return an error */
    }

    (void) close(fd);

    /* Now we have a copy of the entire contents of the file in
     * memory.  We did this with four system calls:
     *    stat, to find out how much to read
     *    open
     *    read
     *    close
     * Now we have to find the newlines.
     */
    return FindNewLines(Name, Bufferp, Linep);
}

/*ARGSUSED*/
int
FindNewLines(Name, Bufferp, Linep)
    char *Name;  /* in case I want to report errors... */
    char **Bufferp;
    char ***Linep;
{
    register char *p;
    int LineCount = 1L; /* remember to count the first line! */
    register int AtStart;

    /* This is where the program spends most of its time, but it is still
     * reasonably fast.
     */
    if (!Bufferp || !*Bufferp) {
	return -1L; /* defensive ... */
    }

    /* Expensive -- count the lines in the input
     * It might be faster to use bsearch or strchr() or something.
     */
    for (p = (*Bufferp); *p; p++) {
	if (*p == '\n') {
	    LineCount++;
	}
    }

    /*NOSTRICT*/
    (*Linep) = (char **) malloc((unsigned)(LineCount * sizeof(char *)));
    
    if ((char *) *Linep == (char *) 0) {
	(void) free(*Bufferp);
	return -3; /* not enough memory */
    }

    /* now have an array containing room for a pointer to the start of
     * each line, so let's fill it.
     */

    LineCount = 0; /* arrays start at 0 in C */
    AtStart = 1; /* we're at the start of a line */

    for (p = (*Bufferp); *p; p++) {
	if (AtStart) {
	    (*Linep)[LineCount++] = p;
	    AtStart = 0;
	}
	if (*p == '\n') {
	    AtStart = 1;
	    *p = '\0'; /* strings are now null terminated */
	}
    }
    return LineCount; /* number of lines */
}

Hope this helps.  Delete this and the rest of the file
before compiling (or many syntax errors will result :-)

Lee

-- 
Liam R. Quin, Unixsys (UK) Ltd [note: not an employee of "sq" - a visitor!]
lee@sq.com (Whilst visiting Canada from England)
People caught shopping are warned that they will be fined by an amount not
exceeding the total value of their purchases, plus sales tax.