HARPER@CSC.FI ("Robert Harper Finland", CSC) (01/25/91)
After reading Jasper Rees note I used LDBASE to retrieve the Will Gilbert message from the INFO-GCG notebooks. ************************ CLIP ******************** > PRINT 000326 >>> Item number 326, dated 90/07/17 18:30:00 -- ALL >Date: Tue, 17 Jul 90 18:30:00 +0100 >Reply-To: Will Gilbert <GILBERT@EMBL.BITNET> >Sender: "INFO-GCG: GCG Genetics Software Discussion" > <INFO-GCG@UTORONTO.BITNET> >From: Will Gilbert <GILBERT@EMBL.BITNET> >Subject: Native Database files INTRODUCTION ------------ Here's an idea that I've been kicking around for a couple of months and would like to have the members of the Info-GCG board discuss it. It relates to that age old problem of having to have the same database in several different formats and the associated problem of having to build the database files with each new release. For those of you who buy the databases directly from GCG and only support the GCG package, you may type "Delete" at this point, but are welcome to read on. BACKGROUND ---------- Currently, the GCG package maintains each database as two large files, one containing the Text of the entries, .REF file, and the other containing the sequence information, .SEQ file. There are also several supporting files, .NAMES, .OFFSET, .HEADER etc., which are used to access the large files. These files are generally small and can be quickly build from the .REF and .SEQ files. Some of us, most(?), few(?) also have programs from the PIR. These programs also read data from the same two large data files, .REF and .SEQ, but have their own set of supporting index files. This philosphy works out quite well, the format of the large files is shared and each set of programs has it's own method of accessing them. Each entry in the .OFFSET file has two byte offsets values. A byte offset is the position within the .REF or .SEQ where this entry can be begins. This address can set in one step as opposed to reading into the file until the appropriate entry is found. The .HEADER file does or could contain the origin of the database, PIR, EMBL or GENBANK. The PIR index file, .INX, also contains similar byte offsets and uses them in much the same way. PROPOSAL ------- What I am proposing for discussion is an extention of this philosphy of having shared database files but with seperate index files for each supported package of programs. The layout of these large database files should be that supplied by the vendor. For the PIR protein databases nothing changes, they supply the data base in the familiar two file format, .REF and .SEQ. However, the GenBank and EMBL databases are supplied with the text and data in the same file. Currently, in order to create the .REF and .SEQ files on must run programs to essentially split the raw data file and move the locus name and description field around. As the databases have grown this process is taking longer and the diskspace required to hold three copies, the existing database, the new raw data and the newly created database files is getting harder to find. Cut to the chase -- The large database files should be used EXACTLY as they come from the database supplier. We, the consumer, should be able to read the tape, build the index files which we need, GCG, PIR or both. The concept of the .OFFSET or .INX files containing byte offsets would not change, instead of pointing into two seperate file the offsets would point into the same file. The program would set them depending on which part of the database entry, Text or sequence, is required to read at that moment. I realize that such a change could take time given that we would have to convice the GCG, PIR and perhaps IG to modify the access routines in their programs. However, if the GCG leads the way in this effort, the other vendors could be pressured to follow. OBJECTIONS ---------- One obvious objection is, "But, Bill Pearson's FASTA program won't run if there's text in the .SEQ file!!". Not so, I talked with Bill last May and he has versions of both FASTA and TFASTA which read the native Genbank or EMBL files. Another Question is, "What about performace? Is having to skip over all that text going to slow things down??". I just don't the answer to this, if sequences are truly being accessed via their byte offsets then, a prioi, I would think that the access times would be same. Then again, how many hours of CPU time are spent reformating database releases? Not to mention human time in trying to find enough disk space to actually do the conversion. I'ld be very interested to see this discussion proceed. However, I would like to ask one thing, that you think about this before you respond, either for or against. What we say here "IS" taken seriously by the people at both the GCG and the PIR and I wouldn't what to ask for something just to create buzy work for them. William Gilbert, Ph.D. Whitehead Institute Cambridge, Ma Gilbert@MITWIBR Gilbert@EMBL (vistors programme) ******************** END *******************