[comp.lang.c] Flat ASCII File Data Access

mcdonald@uxe.cso.uiuc.edu (01/23/89)

>                                                  ... I have rammed into
>a wall in trying to access a flat ascii data file with 14,000 records in
>it.  Naturally, I could read the file one record at a time, but the 
>end user would probably expire due to old age if I wrote this program
>in that manner.


>I am not familiar with any of the "Ctree" type file managers, but I did
>have a similar problem on a UNIX system.  We had a file full of 4 byte records

>My solution?  Buffer the stuff up.  Instead of reading 4 bytes at a time,
>I read 512 bytes (128 records) at a time.  This reduced the number of disk
>accesses/syscalls from roughly 4000 per record to 30.  Runtime is now
>15 minutes (good conditions) to 45 minutes (bad conditions).  

I have tried this sort of stuff of MS-DOS, and it doesn't seem to 
do much good. Has anyone else gotten improvements this way? What
DOES do some good is to get a good disk cache program. I think the
previous two included paragraphs may only apply to (certain)
multitasking OS's.

Doug McDonald

bradb@ai.toronto.edu (Brad Brown) (01/26/89)

In article <225800111@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>
>>                                                  ... I have rammed into
>>a wall in trying to access a flat ascii data file with 14,000 records in
>>it.  Naturally, I could read the file one record at a time, but the 
>>end user would probably expire due to old age if I wrote this program
>>in that manner.
>>[...]
>
>I have tried this sort of stuff of MS-DOS, and it doesn't seem to 
>do much good. Has anyone else gotten improvements this way? What
>DOES do some good is to get a good disk cache program.

I have done things like this in MS-DOS and it works *really well*.  I have
a tiny flatfile manager that uses lseek and read to goto and read specific
records, and it works much faster than using streams.  (That is, I use
open() to open a file, *not* fopen().)

If you really have to move through a lot of data, why don't you write your
program so that it reads a large bunch of records (the larger the better)
at once then processes them in memory.  I think this should help you a 
lot because most of your overhead is going to be in waiting for the disk
if you have to do an individual disk read for each record.

Caching may or may not help you, depending on the type of processing you
do.  If you are just making a single pass through the data, a cache will
not make anything go faster than you can do by reading several records
at a time anyway.  That's because you still have to read each record once
and it never gets read again from the cache, so you don't save anything.

If you skip around the database a lot, you might want to think about
writing a record cache into the database part of your program.  A record
cache will have a large pool of record slots and will fill in an empty one
when you read a new record.  If you request a record that was recently
read, it will return a pointer to the record without reading if it's
in the cache.  You should be able to go *even faster* this way compared
to a disk cache of the same size, though writing an efficient cache can
be hairy.

					(-:  Brad Brown  :-)
					bradb@ai.toronto.edu

bagpiper@oxy.edu (Michael Paul Hunter) (01/27/89)

In article <225800111@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>
>>						    ... I have rammed into
>>a wall in trying to access a flat ascii data file with 14,000 records in
[stuff]
>>My solution?	Buffer the stuff up.  Instead of reading 4 bytes at a time,
>>I read 512 bytes (128 records) at a time.  This reduced the number of disk
>>accesses/syscalls from roughly 4000 per record to 30.  Runtime is now
>>15 minutes (good conditions) to 45 minutes (bad conditions).
>
>I have tried this sort of stuff of MS-DOS, and it doesn't seem to
>do much good. Has anyone else gotten improvements this way? What
[stuff]
>Doug McDonald

Under ms-dos file buffering is already done for you.  One
thing to try is to read a whole lot more then 512bytes (which is the size of
the file buffer I think) and see if you get any speed up.  But, I don't
think that this will change the number of accesses since ms-dos just read
sizeof(buffer) number of chars each time (that is assuming seq access).
For random access, determining adjacency and reading a large number of adjacent
items would probably help if you could organize what you wanted to do to
just work with adjacent items (where a adjacent to b is true if a and b are read
on the same pass).

					  Mike

mcdonald@uxe.cso.uiuc.edu (01/28/89)

>I have tried this sort of stuff of MS-DOS, and it doesn't seem to 
>do much good. Has anyone else gotten improvements this way? What
>DOES do some good is to get a good disk cache program.

>I have done things like this in MS-DOS and it works *really well*.  I have
>a tiny flatfile manager that uses lseek and read to goto and read specific
>records, and it works much faster than using streams.  (That is, I use
>open() to open a file, *not* fopen().)

Perhaps our mileage varies because we are driving different kinds of disk
drives. I volunteer to make a scientific survey. Any computer mag
out there want to pay me? It would make a nice article. No, I won't
do it for free.

Doug McDonald