roy@alanine.phri.nyu.edu (Roy Smith) (09/18/90)
Is there a standard, portable, efficient way to read a file in reverse? I'm doing random seeks in a file and occassionally want to be able to find the beginning of the line into the middle of which I just seeked. Right now, I'm using a simple-minded revgetc() I wrote which is basically: fseek (fp, -2L, FSEEK_CURRENT); getc (fp); but which turns out to loose big, since it looks like it does a read(2) for every call to getc(). Life is made worse by the fact that occasionally I have very long (over 1kbyte) lines. My application is currently spending about 98% of its CPU time in revgetc(), according to gprof. Obviously, the thing to do is write a buffered revgetc(). Modifying the standard getc() macro to traverse the buffer in the opposite direction would be easy, but the result would be non-portable. To make it portable, I want to write it using the stdio functions, and that's the hitch. The warning in the fseek/ftell man page about stdio read pointers possibly being magic cookies makes it hard for me to calculate how far back to seek to read in the preceeding, say, 1024 bytes. What to do? -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
hutch@fps.com (Jim Hutchison) (09/21/90)
In <1990Sep18.152818.1303@phri.nyu.edu> roy@alanine.phri.nyu.edu (Roy Smith): > Is there a standard, portable, efficient way to read a file in >reverse? No, yes, and pretty much. There isn't much call for this, atleast visibly, so there isn't enough mass for a standard. You can use constants defined for your system, this makes it relatively portable. You can get efficiency along with portability, really. > I'm doing random seeks in a file and occassionally want to be able >to find the beginning of the line into the middle of which I just seeked. >Right now, I'm using a simple-minded revgetc() I wrote which is basically: > > fseek (fp, -2L, FSEEK_CURRENT); > getc (fp); This will spend a lot of time calling fseek and read. By "a lot" I mean you do (line-length / 4) reads & seeks per line you want to read. Remember that stdio trys to read-ahead, so each time you back-up it has to start over. It could be smarter, but depending on that would be very non-portable. >but which turns out to loose big, since it looks like it does a read(2) for >every call to getc(). Life is made worse by the fact that occasionally I >have very long (over 1kbyte) lines. My application is currently spending >about 98% of its CPU time in revgetc(), according to gprof. > > Obviously, the thing to do is write a buffered revgetc(). Modifying >the standard getc() macro to traverse the buffer in the opposite direction >would be easy, but the result would be non-portable. To make it portable, I >want to write it using the stdio functions, and that's the hitch. The >warning in the fseek/ftell man page about stdio read pointers possibly being >magic cookies makes it hard for me to calculate how far back to seek to read >in the preceeding, say, 1024 bytes. What to do? The ideal block size for reading your file can be gotten by doing a stat() or fstat() on it, and using stat->st_blksize (see stat(2)). If you setbuffer() with a buffer that size, you should have a quasi-optimal block size. If you backup to the previous block boundary. You should be able to getc to the current offset, remembering the last time you found the start of the line, at worst you may have to repeat this step many times. This minimizes the disk I/O and system calls, which are expensive. start_of_string = 0 -- If nowhere, it must start at 0 trim = offset % blksize; -- how far from block boundary do { end = offset; -- current end of search if (trim) offset -= trim; -- trim back to boundary else offset -= blksize; -- next boundary is previous block fseek(fp, offset, FSEEK_ABSOLUTE); -- backup while (offset != end) { -- while position is not to the end c = get(c); -- get char from our buffer if (start_of_string(c)) -- decide if you have found a start start_of_string = offset; -- remember current offset offset++; -- move on to next position } } while (start_of_string == 0 && offset > 0); -- find it until BOF You'd get better performance if you read() the file yourself, and used your own buffer instead of using getc(). That way you could proceed in your direction of choice, backwards. I used getc() here because you seem to want to use stdio. You can also back down from "blksize" if it gets to be to gross in terms of your average line length (I'd guess greater than 4x is probably too hefty). - -- - Jim Hutchison {dcdwest,ucbvax}!ucsd!fps!hutch Disclaimer: I am not an official spokesman for FPS computing
dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (09/26/90)
The question from Roy Smith was: Is there a standard, portable, efficient way to read a file in reverse? The word "portable" is the catch here. There is *no* truly portable way to seek to a place in a file where you haven't been before. Worse, there is *no* truly portable way to seek to *any* place in a text file whether or not you have been there before. (I'm sure some people will protest at the second claim and mention magic cookies, but they may be assuming that "portable" means "ANSI-conformant". This is not a portable assumption :-) Since this is a comp.unix.* discussion, however, we may be able to modify the meaning of "portable" somewhat by restricting ourselves to solutions that work in a **IX environment. In <11217@celit.fps.com> hutch@fps.com (Jim Hutchison) writes: >The ideal block size for reading your file can be gotten by doing a stat() or >fstat() on it, and using stat->st_blksize (see stat(2)). If you setbuffer() >with a buffer that size, you should have a quasi-optimal block size. Alas, neither stat->st_blksize nor setbuffer() are universally available, not even in a **IX environment. You may need to use setvbuf() instead, which itself isn't universally available. Modifying the meaning of "portable" still further, perhaps we could settle for it to mean "works under 4.3BSD and System V Release 4". Now you're talking! -- Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com> UUCP: oliveb!cirrusl!dhesi