[comp.unix.wizards] Need efficient way to read file in reverse

roy@alanine.phri.nyu.edu (Roy Smith) (09/18/90)

	Is there a standard, portable, efficient way to read a file in
reverse?  I'm doing random seeks in a file and occassionally want to be able
to find the beginning of the line into the middle of which I just seeked.
Right now, I'm using a simple-minded revgetc() I wrote which is basically:

	fseek (fp, -2L, FSEEK_CURRENT);
	getc (fp);

but which turns out to loose big, since it looks like it does a read(2) for
every call to getc().  Life is made worse by the fact that occasionally I
have very long (over 1kbyte) lines.  My application is currently spending
about 98% of its CPU time in revgetc(), according to gprof.

	Obviously, the thing to do is write a buffered revgetc().  Modifying
the standard getc() macro to traverse the buffer in the opposite direction
would be easy, but the result would be non-portable.  To make it portable, I
want to write it using the stdio functions, and that's the hitch.  The
warning in the fseek/ftell man page about stdio read pointers possibly being
magic cookies makes it hard for me to calculate how far back to seek to read
in the preceeding, say, 1024 bytes.  What to do?
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

hutch@fps.com (Jim Hutchison) (09/21/90)

In <1990Sep18.152818.1303@phri.nyu.edu> roy@alanine.phri.nyu.edu (Roy Smith):
>	Is there a standard, portable, efficient way to read a file in
>reverse?

No, yes, and pretty much.  There isn't much call for this, atleast visibly,
so there isn't enough mass for a standard.  You can use constants defined
for your system, this makes it relatively portable.  You can get efficiency
along with portability, really.

>          I'm doing random seeks in a file and occassionally want to be able
>to find the beginning of the line into the middle of which I just seeked.
>Right now, I'm using a simple-minded revgetc() I wrote which is basically:
>
>	fseek (fp, -2L, FSEEK_CURRENT);
>	getc (fp);

This will spend a lot of time calling fseek and read.  By "a lot" I mean you
do (line-length / 4) reads & seeks per line you want to read.  Remember that
stdio trys to read-ahead, so each time you back-up it has to start over.  It
could be smarter, but depending on that would be very non-portable.

>but which turns out to loose big, since it looks like it does a read(2) for
>every call to getc().  Life is made worse by the fact that occasionally I
>have very long (over 1kbyte) lines.  My application is currently spending
>about 98% of its CPU time in revgetc(), according to gprof.
>
>	Obviously, the thing to do is write a buffered revgetc().  Modifying
>the standard getc() macro to traverse the buffer in the opposite direction
>would be easy, but the result would be non-portable.  To make it portable, I
>want to write it using the stdio functions, and that's the hitch.  The
>warning in the fseek/ftell man page about stdio read pointers possibly being
>magic cookies makes it hard for me to calculate how far back to seek to read
>in the preceeding, say, 1024 bytes.  What to do?

The ideal block size for reading your file can be gotten by doing a stat() or
fstat() on it, and using stat->st_blksize (see stat(2)).  If you setbuffer()
with a buffer that size, you should have a quasi-optimal block size.  If you
backup to the previous block boundary.  You should be able to getc to the
current offset, remembering the last time you found the start of the line,
at worst you may have to repeat this step many times.  This minimizes the
disk I/O and system calls, which are expensive.

start_of_string = 0			-- If nowhere, it must start at 0
trim = offset % blksize;		-- how far from block boundary
do {
    end = offset;			-- current end of search
    if (trim)
	offset -= trim;			-- trim back to boundary
    else
	offset -= blksize;		-- next boundary is previous block

    fseek(fp, offset, FSEEK_ABSOLUTE);	-- backup
    while (offset != end) {		-- while position is not to the end
	c = get(c);			-- get char from our buffer
	if (start_of_string(c))		-- decide if you have found a start
	    start_of_string = offset;	-- remember current offset
	offset++;			-- move on to next position
    }
} while (start_of_string == 0 && offset > 0);	-- find it until BOF

You'd get better performance if you read() the file yourself, and used your
own buffer instead of using getc().  That way you could proceed in your
direction of choice, backwards.  I used getc() here because you seem to
want to use stdio.

You can also back down from "blksize" if it gets to be to gross in terms of
your average line length (I'd guess greater than 4x is probably too hefty).
-
--
-
Jim Hutchison		{dcdwest,ucbvax}!ucsd!fps!hutch
Disclaimer:  I am not an official spokesman for FPS computing

dhesi%cirrusl@oliveb.ATC.olivetti.com (Rahul Dhesi) (09/26/90)

The question from Roy Smith was:

     Is there a standard, portable, efficient way to read a file in
     reverse?

The word "portable" is the catch here.  There is *no* truly portable
way to seek to a place in a file where you haven't been before.  Worse,
there is *no* truly portable way to seek to *any* place in a text file
whether or not you have been there before.  (I'm sure some people will
protest at the second claim and mention magic cookies, but they may be
assuming that "portable" means "ANSI-conformant".  This is not a
portable assumption :-)

Since this is a comp.unix.* discussion, however, we may be able to
modify the meaning of "portable" somewhat by restricting ourselves to
solutions that work in a **IX environment.

In <11217@celit.fps.com> hutch@fps.com (Jim Hutchison) writes:

>The ideal block size for reading your file can be gotten by doing a stat() or
>fstat() on it, and using stat->st_blksize (see stat(2)).  If you setbuffer()
>with a buffer that size, you should have a quasi-optimal block size.

Alas, neither stat->st_blksize nor setbuffer() are universally
available, not even in a **IX environment.  You may need to use
setvbuf() instead, which itself isn't universally available.

Modifying the meaning of "portable" still further, perhaps we could
settle for it to mean "works under 4.3BSD and System V Release 4".  Now
you're talking!
--
Rahul Dhesi <dhesi%cirrusl@oliveb.ATC.olivetti.com>
UUCP:  oliveb!cirrusl!dhesi