[bionet.molbio.genbank] Will Gilbert's ideas from INFO-GCG

HARPER@CSC.FI ("Robert Harper Finland", CSC) (01/25/91)
After reading Jasper Rees note I used LDBASE to retrieve the
Will Gilbert message from the INFO-GCG notebooks.

************************ CLIP ********************
> PRINT 000326
>>> Item number 326, dated 90/07/17 18:30:00 -- ALL
>Date:         Tue, 17 Jul 90 18:30:00 +0100
>Reply-To:     Will Gilbert <GILBERT@EMBL.BITNET>
>Sender:       "INFO-GCG: GCG Genetics Software Discussion"
>              <INFO-GCG@UTORONTO.BITNET>
>From:         Will Gilbert <GILBERT@EMBL.BITNET>
>Subject:      Native Database files
 
INTRODUCTION
------------
Here's an idea that I've been kicking around for a couple of months and would
like to have the members of the Info-GCG board discuss it.
 
It relates to that age old problem of having to have the same database in
several different formats and the associated problem of having to build the
database files with each new release.
 
For those of you who buy the databases directly from GCG and only support the
GCG package, you may type "Delete" at this point, but are welcome to read on.
 
 
BACKGROUND
----------
Currently, the GCG package maintains each database as two large files, one
containing the Text of the entries, .REF file, and the other containing the
sequence information, .SEQ file.  There are also several supporting files,
.NAMES, .OFFSET, .HEADER etc., which are used to access the large files.
These files are generally small and can be quickly build from the .REF and .SEQ
files.   Some of us, most(?), few(?) also have programs from the PIR.  These
programs also read data from the same two large data files, .REF and .SEQ, but
have their own set of supporting index files. This philosphy works out quite
well, the format of the large files is shared and each set of programs has it's
own method of accessing them.
 
Each entry in the .OFFSET file has two byte offsets values.  A byte offset
is the position within the .REF or .SEQ where this entry can be begins. This
address can set in one step as opposed to reading into the file until the
appropriate entry is found.   The .HEADER file does or could contain the
origin of the database, PIR, EMBL or GENBANK.  The PIR index file, .INX,
also contains similar byte offsets and uses them in much the same way.
 
PROPOSAL
-------
What I am proposing for discussion is an extention of this philosphy of having
shared database files but with seperate index files for each supported package
of programs.
 
The layout of these large database files should be that supplied by the vendor.
For the PIR protein databases nothing changes, they supply the data base in the
familiar two file format, .REF and .SEQ.  However, the GenBank and EMBL
databases are supplied with the text and data in the same file. Currently, in
order to create the .REF and .SEQ files on must run programs to essentially
split the raw data file and  move the locus name and description field around.
As the databases have grown this process is taking longer and the diskspace
required to hold three copies, the existing database, the new raw data and the
newly created database files is getting harder to find.
 
Cut to the chase -- The large database files should be used EXACTLY as they
come from the database supplier.  We, the consumer, should be able to read
the tape, build the index files which we need, GCG, PIR or both.
 
The concept of the .OFFSET or .INX files containing byte offsets would not
change, instead of pointing into two seperate file the offsets would point into
the same file.  The program would set them depending on which part of the
database entry, Text or sequence, is required to read at that moment.
 
I realize that such a change could take time given that we would have to
convice the GCG, PIR and perhaps IG to modify the access routines in their
programs.  However, if the GCG leads the way in this effort, the other vendors
could be pressured to follow.
 
OBJECTIONS
----------
One obvious objection is, "But, Bill Pearson's FASTA program won't run if
there's text in the .SEQ file!!". Not so, I talked with Bill last May and he
has versions of both FASTA and TFASTA which read the native Genbank or EMBL
files.  Another Question is, "What about performace? Is having to skip over all
that text going to slow things down??".  I just don't the answer to this, if
sequences are truly being accessed via their byte offsets then, a prioi, I would
think that the access times would be same.  Then again, how many hours of CPU
time are spent reformating database releases?  Not to mention human time in
trying to find enough disk space to actually do the conversion.
 
I'ld be very interested to see this discussion proceed.  However, I would like
to ask one thing, that you think about this before you respond, either for or
against.  What we say here "IS" taken seriously by the people at both the GCG
and the PIR and I wouldn't what to ask for something just to create buzy work
for them.
 
 
                                       William Gilbert, Ph.D.
                                       Whitehead Institute
                                       Cambridge, Ma
                                       Gilbert@MITWIBR
                                       Gilbert@EMBL (vistors programme)
********************  END  *******************