[net.unix-wizards] Speed of seeks

peter@baylor.UUCP (Peter da Silva) (05/14/86)

> Wait a sec... A Seek() call on the Amiga HAS to read every sector.
> 
> Each sector on the amiga contains a (byte, word, long, whatever) indicating
> how much of the sector is used. So, if you want to read byte 600, you can't
> just go to the second record-- you have to go to the first, find out how
> much space it uses, go to the second, find out how much, etc.

If this is the case, and if it is also the case that the Amiga directories
don't contain the file names (what do they contain? a hash table?), then C=
Amiga has some serious redesigning to do before I or anybody I know actually
go out and buy this thing.

Incidentally, despite the poor design of the files a seek() does not have to
read every sector... a mistake often made by library writers is to try to
make seek offsets simple integers. According to the library, the argument
to an absolute seek() (lseek(fd, off, 0) or lseek(fd, off, 2)) only needs
to be the returned value from a tell() call: it may indeed be a magic cookie
like a sector/offset pair (and in fact "magic cookie" is the way it's described
in the manual). It is under RSX/11M and on the ATARI 800.

This error is not restricted to relative newcomers: there's an IBM mainframe
implementation of 'C' that copies all files into fixed record length files
when you open them just so you can use UNIX-like seeks. If you want to do
a UNIX-like seek, build UNIX-like files (either one long "record" or a bunch
of maximum length records) so your offset calculations work. It's not
meaningful to seek to an unknown depth in a text file or other weird file
anyway.

The Lattice 'C' runtime library on the IBM-PC has an obscure bug related to
this, by the way, so I'm not surprised they screwed up their Amiga library
too...
-- 
-- Peter da Silva
-- UUCP: ...!shell!{baylor,graffiti}!peter; MCI: PDASILVA; CIS: 70216,1076

chris@umcp-cs.UUCP (Chris Torek) (05/16/86)

In article <645@baylor.UUCP> peter@baylor.UUCP (Peter da Silva) writes:
>a mistake often made by library writers is to try to make seek
>offsets simple integers. According to the library, the argument to
>an absolute seek() (lseek(fd, off, 0) or lseek(fd, off, 2)) only
>needs to be the returned value from a tell() call: it may indeed
>be a magic cookie like a sector/offset pair (and in fact "magic
>cookie" is the way it's described in the manual). It is under
>RSX/11M and on the ATARI 800.

Add `f' in front of the names (fseek, ftell) and you are correct;
Unix systems, however---or at least 4.3beta---claim that lseek
offsets are in `bytes'.  (There is no `tell' system call.)

Actually, those who write libraries such that fseek takes a byte
index either have missed that point, or are very clever:  Because
PDP-11 and Vax Unix systems have always had fseek and ftell return
byte indicies, there is no doubt quite a bit of code that depends
on that behaviour.  These programs are technically broken, but it
is hard to keep a customer happy by informing it [note non-sexist
word :-)] of this fact.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

zben@umd5.UUCP (Ben Cranston) (05/17/86)

In article <645@baylor.UUCP> peter@baylor.UUCP (Peter da Silva) writes:

>Incidentally, despite the poor design of the files a seek() does not have to
>read every sector... a mistake often made by library writers is to try to
>make seek offsets simple integers. According to the library, the argument
>to an absolute seek() (lseek(fd, off, 0) or lseek(fd, off, 2)) only needs
>to be the returned value from a tell() call: it may indeed be a magic cookie
>like a sector/offset pair (and in fact "magic cookie" is the way it's described
>in the manual). It is under RSX/11M and on the ATARI 800.

>This error is not restricted to relative newcomers: there's an IBM mainframe
>implementation of 'C' that copies all files into fixed record length files
>when you open them just so you can use UNIX-like seeks. If you want to do
>a UNIX-like seek, build UNIX-like files (either one long "record" or a bunch
>of maximum length records) so your offset calculations work. It's not
>meaningful to seek to an unknown depth in a text file or other weird file
>anyway.

The Software Tools NOTE/SEEK design uses two Fortran integers to store SEEK
addresses.  The predominant text data format on the Sperry 1100 system is a
variable length record, with the record length in a four byte header area.

My implementation of the Tools for the Sperry uses the first of the two
Fortran integers as the "character address within file" (i.e. 4 X wordaddr)
and the second Fortran integer as "character number within this record",
that is, how many characters back to go to get to the record header.  The
code uses this value to get "back in sync" after a random seek.

This has the advantage that the first word of the address appears to be a
normally-incrementing address, with 4-7 spaces between records.  It would
be possible to optimize NOTE address storage: if one knew that positions
stored would always be at the beginning of record and the file was always
ASCII one could keep just the first integer and supply "4" for the second.

Oh, and if the character code is "Fieldata" (tm) rather than ASCII then
the second word is negative.  For historical reasons only...

-- 
"We're taught to cherish what we have   |          Ben Cranston
 by what we have no longer..."          |          zben@umd2.umd.edu
                          ...{seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben  

campbell@maynard.UUCP (Larry Campbell) (05/17/86)

> Incidentally, despite the poor design of the files a seek() does not have to
> read every sector... a mistake often made by library writers is to try to
> make seek offsets simple integers. According to the library, the argument
> to an absolute seek() (lseek(fd, off, 0) or lseek(fd, off, 2)) only needs
> to be the returned value from a tell() call: it may indeed be a magic cookie
> like a sector/offset pair (and in fact "magic cookie" is the way it's described
> in the manual). It is under RSX/11M and on the ATARI 800.
> -- Peter da Silva
> -- UUCP: ...!shell!{baylor,graffiti}!peter; MCI: PDASILVA; CIS: 70216,1076

Wrongo.  First, the manual describes the offset as a long (NOT a "simple
integer" nor a "magic cookie").  Second, if the offsets AREN'T integers,
then almost any database library or package won't work (like dbm(3)),
because they often hash keys into offsets in the index file.
-- 
Larry Campbell                                 The Boston Software Works, Inc.
ARPA: maynard.UUCP:campbell@harvard.ARPA       120 Fulton Street
UUCP: {harvard,cbosgd}!wjh12!maynard!campbell  Boston MA 02109

breuel@h-sc1.UUCP (thomas breuel) (05/17/86)

||Wait a sec... A Seek() call on the Amiga HAS to read every sector.
||
||Each sector on the amiga contains a (byte, word, long, whatever) indicating
||how much of the sector is used. So, if you want to read byte 600, you can't
||just go to the second record-- you have to go to the first, find out how
||much space it uses, go to the second, find out how much, etc.
|
|If this is the case, and if it is also the case that the Amiga directories
|don't contain the file names (what do they contain? a hash table?), then C=
|Amiga has some serious redesigning to do before I or anybody I know actually
|go out and buy this thing.

No, this is not true. A Seek() call has to read the block allocation
list, a linked list of pointers to blocks allocated to a file.
Each block, except for the last one, is filled (or at least considered
to be filled for the purpose of Seek's). This means that for every 72
blocks in the file (36k) there is another block in the block allocation
list. Therefore, for large files (720k), a seek may have to go
through as many as 20 blocks in the block allocation list, which
can be quite slow. This was the point of my original complaint.
Supposedly, something is being done about this in 1.2, although
nobody from C/A has said anything specific as to *what* is being
done. My proposals are: (1) change the file system to use allocation
trees (2) buffer all allocation pointers for a file in memory (as
a backwards compatible fix; on a two drive system, this can mean
at most 20k of memory and gives the fastest seeks possible).

|Incidentally, despite the poor design of the files a seek() does not have to
|read every sector... a mistake often made by library writers is to try to
|make seek offsets simple integers. According to the library, the argument
|to an absolute seek() (lseek(fd, off, 0) or lseek(fd, off, 2)) only needs
|to be the returned value from a tell() call: it may indeed be a magic cookie
|like a sector/offset pair (and in fact "magic cookie" is the way it's described
|in the manual). It is under RSX/11M and on the ATARI 800.

Both lseek and fseek are guaranteed to seek to byte offsets under
UN*X (2.9). There may be systems with other behaviours, but I would
not want to be forced to use such a disaster...

|It's not
|meaningful to seek to an unknown depth in a text file or other weird file
|anyway.

Sure it is. You can do a binary search (and gain by doing it) even if
you don't have a fixed record length.

|The Lattice 'C' runtime library on the IBM-PC has an obscure bug related to
|this, by the way, so I'm not surprised they screwed up their Amiga library
|too...

I'm not sure what you mean by this. The Lattice runtime library
has (fortunately) nothing to do with the Amiga file system.

						Thomas.

jpl@allegra.UUCP (John P. Linderman) (05/19/86)

    With a little help from Chris Torek, Peter da Silva noted
>>According to the library, the argument to [fseek] needs to be the
>>returned value from [an ftell()] call: it may indeed be a magic cookie
>>like a sector/offset pair (and in fact "magic cookie" is the way it's
>>described in the manual). It is under RSX/11M and on the ATARI 800.
    and Chris goes on to add
>Actually, those who write libraries such that fseek takes a byte
>index either have missed that point, or are very clever:  Because
>PDP-11 and Vax Unix systems have always had fseek and ftell return
>byte indicies, there is no doubt quite a bit of code that depends
>on that behaviour.  These programs are technically broken, but it
>is hard to keep a customer happy by informing it [note non-sexist
>word :-)] of this fact.

This points up the typical bind faced by those who wish to write programs
that are portable but not pathetic.  I wanted to write a sort program
that checkpointed itself... after sorting and writing a coreload of data,
the data is (kind of) safe, and, if the system crashes, it isn't
necessary to sort it again, if we can just seek past it in the input
file.  In general, we don't know how many complete records will fit in a
coreload until we are part way through a record that is too big to fit.
If seek addresses are byte offsets, no big deal.  Just keep track of how
many bytes have been read after each complete record.  But if seek
addresses are magic cookies, bad news.  We can use ftell to find the find
the address of the middle of the current, incomplete, record, but that
does us no good at all, since it is the address of the START of the
record that we need, and you cannot do cookie arithmetic.  Or, we can do
an ftell after every complete record, just in case the next record won't
fit.  Ughleee!  That ftell boils down to a system call, and after stdio
goes to all the bother of minimizing system calls, we come along and add
one after every record.  I have tried this, and it results in performance
that is anywhere from 10 to 100 times as bad!  Completely unacceptable.

In the particular case, there is a way out.  Since restarting from a
checkpoint is rare, and since input may be coming from a pipe anyway,
it is both tolerable and preferable to get to byte offset N by doing
N getc's (I like to call this a getceek.)  But the solution is peculiar
to this problem, and I would have chosen to ``technically break'' the
program if I had been forced to choose between complete portability and
acceptable performance.

John P. Linderman  Seek and Ye Shall Grind  allegra!jpl