jimmy@blake.acs.washington.edu (Jim Li) (11/16/89)
I need to manipulate (searching, sorting, ...) some big files (>500K) What is the best way to store the contents of the files? Suppose I have a file that is 5000 lines long, the simplest decl. would be char array[5000][80]; Any better ideas? (Speed is a key issue.) Thanks in advance. Jim. .
dskim@mordor.eng.umd.edu (Daeshik Kim) (11/16/89)
Semaphore! How about disk-buffer along with shared-memory-buffer having 2 processes - one for using buffer, the other filling it in. Conserve Memory! CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC C C C Daeshik Kim (301) 445-0475/2147 C C C C SCHOOL : dkim@cscwam.umd.edu, dskim@eng.umd.edu C C WORK : uunet!daffy!dkim (703) 689-7308 (Mon, Wed, Fri) C C C CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC
lee@sq.sq.com (Liam R. E. Quin) (11/24/89)
In article <4473@blake.acs.washington.edu> jimmy@blake.acs.washington.edu (Jim Li) writes: >I need to manipulate (searching, sorting, ...) some big files (>500K) >What is the best way to store the contents of the files? I sent an example by mail, but decided that it was probably worth posting. Enclosed is a program that uses one method (it works, and passes lint, but I only just typed it and profiling showed it could be made faster). Also enclosed is an outline for another method, and understanding the program will enable you to write the second program youself, I think. >Suppose I have a file that is 5000 lines long, the simplest decl. would be >char array[5000][80]; >Any better ideas? >(Speed is a key issue.) Fixed buffers are a bad idea. It is because of this alone that made me post! Consider the file with a line of 81 characters (maybe produced by a program with a bug, or a Unix text editor...). For that matter, an 80 character plus a newline or trailing \0... One not very general method (complete routine given below) is as follows: determine the size of the file (e.g. Unix stat(2) system call) allocate a buffer large enough to hold the file open the file and read it into the buffer, then close it count the number of lines in the buffer make an array large enough to hold a pointer to the start of each line set the pointers in the array to point to the line starts. This has running time O(FileSize * 2 + read) Read will probably be roughly O(FileSize / BUFSIZ), so at least the algorithm is within a constant factor of O(n), the best possible. If we knew that all lines contained exactly 80 characters, we could be much, much faster. But we don't. I wrote a routine to do this (see below), and then wrote some scaffolding to test it. This found a couple of bugs, but then it worked fine. See Jon Bently's books `Programming Pearls' and `More Programming Pearls' for ideas on scaffolding to test programs -- particularly the More... book. Now, this can be improved: I could use realloc() for the array of pointers to lines, and that way I don't need to know hpw many there are in advance. This saves one pass over the data. By the same reasoning, I don't really need to know the size of the file, because I can simply expand the array with realloc(). Here is a scheme which does this, but which minimises the amount of copying that realloc() has to do, and which also provides a useful general purpose routine getline(), which returns a line of text with a null terminator, all ready in a malloc'd buffer, so you don't need to strcpy() it anywhere... If this sounds hard, look at the program PROGRAM below. Here is the second algorithm in pseudo-C: allocate an array (say enough for 500 lines) /* but you could use stat(2) where the information is available */ while readline() != EOF { if (nlines > length of array) { /* (keep a length count) */ grow(array) /* with realloc() */ } array[nlines++] = line; /* no need to copy */ } readline() { allocate a buffer /* say, 80 characters */ while getchar() { EOF && buffer length zero --> return EOF if (linelength > buffer length) { expand buffer by (say) 50 characters; } *p++ = ch; } *p = '\0'; /* use realloc here to set the buffer to the right size, * avoiding wastage. */ return p; } This has the advantage of working with stdin, and is probably better. But understand both methods and decide for yourself. Here is the original program. Delete the .signature at the end before compiling! #include <sys/types.h> #include <sys/stat.h> #include <sys/errno.h> #include <stdio.h> #include <fcntl.h> /* for open modes, O_RDONLY etc */ #ifdef MALLOC # include <malloc.h> /* use faster malloc where available, compile with -DMALLOC and * link with -lmalloc -- i.e., * cc -DMALLOC -o prog prog.c -lmalloc */ #endif #define E_WARN 1 #define E_FATAL 2 #define Error(lev, msg, a1, a2, a3) \ { (void) fprintf(stderr, "%s: ", progname); \ (void) fprintf(stderr, msg, a1, a2, a3); perror(""); \ if (lev != E_WARN) exit(1); } (void) 1 extern char *malloc(); int FindNewLines(); int ReadFile(); char *progname; main(argc, argv) int argc; char *argv[]; { char *BigBuf; char **Lines; int n; progname = argv[0]; /* for errors */ /* print lines of a file in reverse order */ while (--argc > 0) { if ((n = ReadFile(*++argv, &BigBuf, &Lines)) > 0L) { register int i; for (i = n - 1; i >= 0; i--) { /* arrays start at zero... */ (void) printf("%d: %s\n", i + 1, Lines[i]); } (void) free(BigBuf); (void) free((char *) Lines); } else { Error(E_FATAL, "Can't open file \"%s\"", *argv, 0, 0); } } } /* Read a file into memory. Allocates a buffer for the text in the (char *) * pointed to by Bufferp, and an array of pointers to the start of each * line in the (char **) pointed to by Linep. * A return value less than 1 is an error. * BUG: fails on very large files (bigger than MAXINT), due to integer * wrap-around. Hence, be careful on DOS, Xenix or Micropork, with * lunatic memory models... * Uses FindNewLines(). */ int ReadFile(Name, Bufferp, Linep) char *Name; char **Bufferp; char ***Linep; { /* typing as I go... */ extern int errno; struct stat StatBuf; int AmountRead; int fd; if (stat(Name, &StatBuf) < 0) { return -1; /* no such file, or permission, or... */ } if ((fd = open(Name, O_RDONLY, 0)) < 0) { return -2; /* (these magick numbers should be defines) */ } /* allocate enough memory to read the file */ if ((*Bufferp = malloc((unsigned) (StatBuf.st_size + 1))) == (char *) 0) { (void) close(fd); return -3; /* not enough memory */ } if ((AmountRead = read(fd, *Bufferp, StatBuf.st_size)) < 0) { (void) close(fd); (void) free(*Bufferp); *Bufferp = (char *) 0; return -4; /* read failed (I wonder why?) */ } if (AmountRead != StatBuf.st_size) { Error(E_WARN, "Only read %d of %ld bytes from file \"%s\"", AmountRead, StatBuf.st_size, Name); /* I carry on here -- yu might want to return an error */ } (void) close(fd); /* Now we have a copy of the entire contents of the file in * memory. We did this with four system calls: * stat, to find out how much to read * open * read * close * Now we have to find the newlines. */ return FindNewLines(Name, Bufferp, Linep); } /*ARGSUSED*/ int FindNewLines(Name, Bufferp, Linep) char *Name; /* in case I want to report errors... */ char **Bufferp; char ***Linep; { register char *p; int LineCount = 1L; /* remember to count the first line! */ register int AtStart; /* This is where the program spends most of its time, but it is still * reasonably fast. */ if (!Bufferp || !*Bufferp) { return -1L; /* defensive ... */ } /* Expensive -- count the lines in the input * It might be faster to use bsearch or strchr() or something. */ for (p = (*Bufferp); *p; p++) { if (*p == '\n') { LineCount++; } } /*NOSTRICT*/ (*Linep) = (char **) malloc((unsigned)(LineCount * sizeof(char *))); if ((char *) *Linep == (char *) 0) { (void) free(*Bufferp); return -3; /* not enough memory */ } /* now have an array containing room for a pointer to the start of * each line, so let's fill it. */ LineCount = 0; /* arrays start at 0 in C */ AtStart = 1; /* we're at the start of a line */ for (p = (*Bufferp); *p; p++) { if (AtStart) { (*Linep)[LineCount++] = p; AtStart = 0; } if (*p == '\n') { AtStart = 1; *p = '\0'; /* strings are now null terminated */ } } return LineCount; /* number of lines */ } Hope this helps. Delete this and the rest of the file before compiling (or many syntax errors will result :-) Lee -- Liam R. Quin, Unixsys (UK) Ltd [note: not an employee of "sq" - a visitor!] lee@sq.com (Whilst visiting Canada from England) People caught shopping are warned that they will be fined by an amount not exceeding the total value of their purchases, plus sales tax.