[bionet.molbio.genbank] Why do the various index files have different formats?

roy@alanine.phri.nyu.edu (Roy Smith) (09/28/90)

	I'm working on some software to do keyword searches on genbank using
the distributed index (.idx) files.  Everything was going fine, until
somebody asked me to use my program to figure out what locus contained a
certain accession number; it was at that point that I realized that the
gbacc.idx file is a different format from the other .idx files.  Reading the
docs, I see the gene index is the same as the acc index.

	Why?  I can see no advantage that the gbacc.idx format has over the
other format, and it has the big disadvantage that it is different (i.e.
programs that search the index files have to know which file they are
searching and adjust their parsing accordingly).  It seems to me that this
is just wanton lossage.  Am I missing something?

	It's certainly far too late to do anything about it now without
breaking a lot of existing software, but it sure is irrating.  Hopefully,
any additional index files that are invented in the future will stick to the
"standard" format (i.e. the one that gbkey.idx uses).  At least that way,
software developers will only have to special case a finite (and fixed)
number of indicies (currently 2).
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane?  Did you say arcane?  It wouldn't be Unix if it wasn't arcane!"

benton@genbank.bio.net (David Benton) (09/28/90)

There have been two formats for the indexes since IntelliGenetics
became part of the GenBank effort (Oct 1987).  The simple reason
for our doing it was that the contract obligated us to continue
to distribute the database in the format in which it had been
distributed by the previous contractor (BBN).  I suspect that
the original reason for the two formats was that is seemed wasteful to
spend a full line on an index key that was never going to be more
than 6 characters long (in the case of accession numbers), but using
a full line was necessary in the cases where the index key could be
very long (journal citations, keyword, e.g.).  I further suspect
that these two formats have been used since indexes were put on the
GenBank distribution tapes and have been documented in the release
notes for that long as well.  Since the release notes don't seem to
get the kind of circulation we'd like, I'll copy the relevant sections
below.  The release notes are available by anonymous FTP from
genbank.bio.net and paper copies can be requested by e-mail from
genbank@genbank.bio.net.

For those who may not have been aware of the two index formats issue
Roy Smith raised, the formats are described below.  (These indexes are
distributed with the text-file (magnetic tape format) versions of
GenBank: the index formats of the floppy disk and CD ROM versions
are, we hope, much more rational and consistent.)

Sincerely,

David Benton
benton@karyon.bio.net

------------------------------------------------------------------------
3.3  Index Files

There are five files containing indices to the entries in this release:

     	* Accession number index file
     	* Keyword phrase index file
     	* Author name index file
     	* Journal citation index file
     	* Gene symbol index file

The accession numbers, keywords, authors, journals, or gene symbols (the index 
keys) of an index are sorted alphabetically. (The index keys for the keyword 
phrases and author names appear in uppercase characters even though they appear 
in mixed case in the sequence entries.) Under each index key, the names of the 
sequence entries containing that index key are listed alphabetically. Each 
sequence name is also followed by its data file division and primary accession 
number. The following codes are used to designate the data file divisions:

     	1. PRI 	- primates
     	2. ROD 	- rodents
     	3. MAM 	- other mammals
     	4. VRT 	- other vertebrates
     	5. INV 	- invertebrates
     	6. PLN 	- plants, fungi, and algae
     	7. ORG	- organelles
     	8. BCT 	- bacteria
    	9. RNA 	- structural RNAs
       10. VRL 	- viruses
       11. PHG 	- bacteriophage
       12. SYN 	- synthetic sequences
       13. UNA 	- unannotated sequences               

The index key begins in column 1 of a record. An 11-character field for the 
sequence entry name starts in position 14 of a record, followed by a 
3-character field for the data file division, starting at position 25 and 
ending at position 27, and a 6-character field for the primary accession 
number, starting at position 29 and ending at position 34. All entries in the 
fields are left-justified.

Beginning at positions 36 and 58, the three fields repeat, so three sets of 
sequence information can appear in one record. If there are more than three 
entry names, the next records are used; the index key is not repeated. For the 
accession number and human gene symbol index files, the entry names begin in 
the same record as the index key, since the key is always less than 12 
characters. In the other index files, the entry names begin on the record 
following the index key record.

3.3.1  Accession Number Index File

Accession numbers consist of a single letter followed by five digits. They 
provide an unchanging designation for the data with which they are associated, 
and we encourage you to cite accession numbers whenever you refer to data from 
the data bank. The primary accession number is the first accession number of an 
entry. It is unique to that entry. Citation of that number will enable other 
investigators to locate the data no matter what entry name changes or other 
data bank reorganizations may occur. The accession numbers, however, carry no 
intrinsic information about the data.

The following excerpt from the middle of the accession number index file 
illustrates the format of the accession number index file:


1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------

J00316       HUMTBB11P  PRI J00316
J00317       HUMTBB46P  PRI J00317
J00318       HUMUG1     PRI J00318
J00319       HUMUG1PA   PRI J00319
J00320       HUMVIPMR1  PRI L00154 HUMVIPMR2  PRI L00155 HUMVIPMR3  PRI L00156
             HUMVIPMR4  PRI L00157 HUMVIPMR5  PRI L00158
J00321       BABA1AT    PRI J00321
J00322       CHPRSA     PRI J00322
J00323       AGMRSASPC  PRI J00323
J00324       BABATIII   PRI J00324
---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 4. Accession Number Index File

If the same accession number is found in more than one entry (a result of the 
infrequent occasions when a single entry is split into two or more separate 
entries), then the additional entries and groups in which the number appears 
are also given.

3.3.2  Keyword Phrase Index File

Keyword phrases consist of names for gene products and other characteristics of 
sequence entries. There are approximately 9800 keyword phrases. An excerpt from 
the keyword phrase index file is shown below:


1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------

DNA HELICASE
             ECOHELIV   BCT J04726 ECOUVRD    BCT X00738
DNA INVERTASE
             ECOPIN     BCT K00676 ECOPINP    BCT K03521 PMUGINMOM  PHG V01463
DNA LIGASE
             ECOLIG     BCT M24278 ECOLIGA    BCT M30255 PT4G30     PHG X00039
             PT7CG      PHG J02518 YSCCDC9    PLN X03246 YSPCDC17   PLN X05107
DNA MATURATION
             HS1CAS     VRL M22962
DNA METHYLASE
             HEHMTS     BCT J02677
DNA METHYLATION
             HEHMTS     BCT J02677 HUMSPM1    PRI X06585 HUMSPM2    PRI X06586
             HUMSPM3    PRI X06587 HUMSPM4    PRI X06588 HUMSPM5    PRI X07490
             HUMSPM6    PRI X07491 HUMSPM7    PRI X07492 HUMSPM8    PRI X07493
             HUMSPM9    PRI X07494
DNA NUCLEOTIDYLEXOTRANSFERASE
             MUSTDTR    ROD X04123
DNA PACKAGING
             P29PRO     PHG X05973

---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 5. Keyword Phrase Index File


3.3.3  Author Name Index File

The author name index file lists all of the author names that appear in the 
citations. An excerpt from the author name index file is shown below:


1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------

LANDSMAN,D.
             CHKHMG14   VRT M20817 CHKHMG17   VRT Y00416 CHKHMG17A  VRT J03229
             HUMHMG14   PRI J02621 HUMHMG14A  PRI M21339 HUMHMG17   PRI M12623
             MUSHMG17   ROD X12944 X06353     UNA X06353 X06444     UNA X06444
             X13546     UNA X13546 X13929     UNA X13929 X13930     UNA X13930
LANDSMANN,J.
             LAMCG      PHG J02459 TRTHB      PLN Y00296
LANDY,A.
             ECOLAMATT  BCT J01638 ECOP80ATB  BCT M10892 ECOTGTUFB  BCT J01717
             ECOTGY1    BCT K01197 ECOTRY1    RNA K00266 ECOTRY2    RNA K00267
             ECOTRY3    RNA M10878 LAMCG      PHG J02459 LAMECOGAL  PHG M11151
             LAMPRCA    PHG M12458 LAMPRCB    PHG M12459 P22ATTP    PHG M10893
             P22INT     PHG X04052 P80ATTP    PHG M10891 STYP22ATB  BCT M10894

---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 6. Author Name Index File


3.3.4  Journal Citation Index File

The journal citation index file lists all of the citations that appear in the 
references. All citations are truncated to 80 characters. An excerpt from the 
citation index file is shown below:


1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------

(IN) THE CELL NUCLEUS, VOLUME VIII: 261-305; ACADEMIC PRESS, NEW YORK (1981).
             RATUR5A    RNA K00783
(IN) THE LENS: TRANSPARANCY AND CATARACT: 171-179; EURAGE, RIJSWIJK (1986)
             RANCRYG2A  VRT K02264 RANCRYG4A  VRT K02266 RANCRYG5A  VRT M22529
             RANCRYG6A  VRT M22530 RANCRYR    VRT X00659
(IN) VIRUS RESEARCH. PROCEEDINGS OF 1973 ICN-UCLA SYMPOSIUM: 533-544; ACADEMIC 
P
             LAMCG      PHG J02459
ACTA BIOCHIM. POL. 24, 301-318 (1977)
             LUPTRFJ    RNA K00345 LUPTRFN    RNA K00346
ACTA BIOCHIM. POL. 26, 369-381 (1979)
             BLYTRF1    RNA M10827

---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 7. Journal Citation Index File


3.3.5  Cross-Reference To Gene Symbol Libraries

The gene symbol file contains the gene symbols used in the Howard Hughes 
Medical Institute Human Gene Mapping Library and other gene symbols, such as 
those for the E. coli genes. The gene symbols are only found in the FEATURES 
table description field. The gene symbol references have the form:  
/gene="gene symbol"; an example is found in section 3.5.11.6. An example of 
the format of the gene symbol index file follows:


1       10        20        30        40        50        60        70       79
---------+---------+---------+---------+---------+---------+---------+---------

HPR          HUMHPARS2  PRI K03431 HUMHPRG1   PRI X01794 HUMHPRG2   PRI X01787
             HUMHPRG3   PRI X01788 HUMHPRG4   PRI X01790 HUMHPRG5   PRI X01792
HPRT         HUMHPRT1   PRI M27558 HUMHPRT2   PRI M27559 HUMHPRT3   PRI M27560
             HUMHPRT4   PRI M27561 HUMHPRT5   PRI M29753 HUMHPRT6   PRI M29754
             HUMHPRT7   PRI M29755 HUMHPRT8   PRI M29756 HUMHPRT9   PRI M29757
             HUMHPRTA   PRI M12452 HUMHPRTB   PRI M26434
HPX          HUMHXM     PRI X02537 HUMHXMA    PRI J03048
HRAS         HUMHRASA   PRI M19990
HRG          HUMBHRPA   PRI M18372 HUMHRGA    PRI M13149
HSDS         ECOHSDSK   BCT J01632

---------+---------+---------+---------+---------+---------+---------+---------
1       10        20        30        40        50        60        70       79

Example 8. Gene Symbol Index File