roy@alanine.phri.nyu.edu (Roy Smith) (09/28/90)
I'm working on some software to do keyword searches on genbank using the distributed index (.idx) files. Everything was going fine, until somebody asked me to use my program to figure out what locus contained a certain accession number; it was at that point that I realized that the gbacc.idx file is a different format from the other .idx files. Reading the docs, I see the gene index is the same as the acc index. Why? I can see no advantage that the gbacc.idx format has over the other format, and it has the big disadvantage that it is different (i.e. programs that search the index files have to know which file they are searching and adjust their parsing accordingly). It seems to me that this is just wanton lossage. Am I missing something? It's certainly far too late to do anything about it now without breaking a lot of existing software, but it sure is irrating. Hopefully, any additional index files that are invented in the future will stick to the "standard" format (i.e. the one that gbkey.idx uses). At least that way, software developers will only have to special case a finite (and fixed) number of indicies (currently 2). -- Roy Smith, Public Health Research Institute 455 First Avenue, New York, NY 10016 roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy "Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"
benton@genbank.bio.net (David Benton) (09/28/90)
There have been two formats for the indexes since IntelliGenetics became part of the GenBank effort (Oct 1987). The simple reason for our doing it was that the contract obligated us to continue to distribute the database in the format in which it had been distributed by the previous contractor (BBN). I suspect that the original reason for the two formats was that is seemed wasteful to spend a full line on an index key that was never going to be more than 6 characters long (in the case of accession numbers), but using a full line was necessary in the cases where the index key could be very long (journal citations, keyword, e.g.). I further suspect that these two formats have been used since indexes were put on the GenBank distribution tapes and have been documented in the release notes for that long as well. Since the release notes don't seem to get the kind of circulation we'd like, I'll copy the relevant sections below. The release notes are available by anonymous FTP from genbank.bio.net and paper copies can be requested by e-mail from genbank@genbank.bio.net. For those who may not have been aware of the two index formats issue Roy Smith raised, the formats are described below. (These indexes are distributed with the text-file (magnetic tape format) versions of GenBank: the index formats of the floppy disk and CD ROM versions are, we hope, much more rational and consistent.) Sincerely, David Benton benton@karyon.bio.net ------------------------------------------------------------------------ 3.3 Index Files There are five files containing indices to the entries in this release: * Accession number index file * Keyword phrase index file * Author name index file * Journal citation index file * Gene symbol index file The accession numbers, keywords, authors, journals, or gene symbols (the index keys) of an index are sorted alphabetically. (The index keys for the keyword phrases and author names appear in uppercase characters even though they appear in mixed case in the sequence entries.) Under each index key, the names of the sequence entries containing that index key are listed alphabetically. Each sequence name is also followed by its data file division and primary accession number. The following codes are used to designate the data file divisions: 1. PRI - primates 2. ROD - rodents 3. MAM - other mammals 4. VRT - other vertebrates 5. INV - invertebrates 6. PLN - plants, fungi, and algae 7. ORG - organelles 8. BCT - bacteria 9. RNA - structural RNAs 10. VRL - viruses 11. PHG - bacteriophage 12. SYN - synthetic sequences 13. UNA - unannotated sequences The index key begins in column 1 of a record. An 11-character field for the sequence entry name starts in position 14 of a record, followed by a 3-character field for the data file division, starting at position 25 and ending at position 27, and a 6-character field for the primary accession number, starting at position 29 and ending at position 34. All entries in the fields are left-justified. Beginning at positions 36 and 58, the three fields repeat, so three sets of sequence information can appear in one record. If there are more than three entry names, the next records are used; the index key is not repeated. For the accession number and human gene symbol index files, the entry names begin in the same record as the index key, since the key is always less than 12 characters. In the other index files, the entry names begin on the record following the index key record. 3.3.1 Accession Number Index File Accession numbers consist of a single letter followed by five digits. They provide an unchanging designation for the data with which they are associated, and we encourage you to cite accession numbers whenever you refer to data from the data bank. The primary accession number is the first accession number of an entry. It is unique to that entry. Citation of that number will enable other investigators to locate the data no matter what entry name changes or other data bank reorganizations may occur. The accession numbers, however, carry no intrinsic information about the data. The following excerpt from the middle of the accession number index file illustrates the format of the accession number index file: 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- J00316 HUMTBB11P PRI J00316 J00317 HUMTBB46P PRI J00317 J00318 HUMUG1 PRI J00318 J00319 HUMUG1PA PRI J00319 J00320 HUMVIPMR1 PRI L00154 HUMVIPMR2 PRI L00155 HUMVIPMR3 PRI L00156 HUMVIPMR4 PRI L00157 HUMVIPMR5 PRI L00158 J00321 BABA1AT PRI J00321 J00322 CHPRSA PRI J00322 J00323 AGMRSASPC PRI J00323 J00324 BABATIII PRI J00324 ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 4. Accession Number Index File If the same accession number is found in more than one entry (a result of the infrequent occasions when a single entry is split into two or more separate entries), then the additional entries and groups in which the number appears are also given. 3.3.2 Keyword Phrase Index File Keyword phrases consist of names for gene products and other characteristics of sequence entries. There are approximately 9800 keyword phrases. An excerpt from the keyword phrase index file is shown below: 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- DNA HELICASE ECOHELIV BCT J04726 ECOUVRD BCT X00738 DNA INVERTASE ECOPIN BCT K00676 ECOPINP BCT K03521 PMUGINMOM PHG V01463 DNA LIGASE ECOLIG BCT M24278 ECOLIGA BCT M30255 PT4G30 PHG X00039 PT7CG PHG J02518 YSCCDC9 PLN X03246 YSPCDC17 PLN X05107 DNA MATURATION HS1CAS VRL M22962 DNA METHYLASE HEHMTS BCT J02677 DNA METHYLATION HEHMTS BCT J02677 HUMSPM1 PRI X06585 HUMSPM2 PRI X06586 HUMSPM3 PRI X06587 HUMSPM4 PRI X06588 HUMSPM5 PRI X07490 HUMSPM6 PRI X07491 HUMSPM7 PRI X07492 HUMSPM8 PRI X07493 HUMSPM9 PRI X07494 DNA NUCLEOTIDYLEXOTRANSFERASE MUSTDTR ROD X04123 DNA PACKAGING P29PRO PHG X05973 ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 5. Keyword Phrase Index File 3.3.3 Author Name Index File The author name index file lists all of the author names that appear in the citations. An excerpt from the author name index file is shown below: 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- LANDSMAN,D. CHKHMG14 VRT M20817 CHKHMG17 VRT Y00416 CHKHMG17A VRT J03229 HUMHMG14 PRI J02621 HUMHMG14A PRI M21339 HUMHMG17 PRI M12623 MUSHMG17 ROD X12944 X06353 UNA X06353 X06444 UNA X06444 X13546 UNA X13546 X13929 UNA X13929 X13930 UNA X13930 LANDSMANN,J. LAMCG PHG J02459 TRTHB PLN Y00296 LANDY,A. ECOLAMATT BCT J01638 ECOP80ATB BCT M10892 ECOTGTUFB BCT J01717 ECOTGY1 BCT K01197 ECOTRY1 RNA K00266 ECOTRY2 RNA K00267 ECOTRY3 RNA M10878 LAMCG PHG J02459 LAMECOGAL PHG M11151 LAMPRCA PHG M12458 LAMPRCB PHG M12459 P22ATTP PHG M10893 P22INT PHG X04052 P80ATTP PHG M10891 STYP22ATB BCT M10894 ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 6. Author Name Index File 3.3.4 Journal Citation Index File The journal citation index file lists all of the citations that appear in the references. All citations are truncated to 80 characters. An excerpt from the citation index file is shown below: 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- (IN) THE CELL NUCLEUS, VOLUME VIII: 261-305; ACADEMIC PRESS, NEW YORK (1981). RATUR5A RNA K00783 (IN) THE LENS: TRANSPARANCY AND CATARACT: 171-179; EURAGE, RIJSWIJK (1986) RANCRYG2A VRT K02264 RANCRYG4A VRT K02266 RANCRYG5A VRT M22529 RANCRYG6A VRT M22530 RANCRYR VRT X00659 (IN) VIRUS RESEARCH. PROCEEDINGS OF 1973 ICN-UCLA SYMPOSIUM: 533-544; ACADEMIC P LAMCG PHG J02459 ACTA BIOCHIM. POL. 24, 301-318 (1977) LUPTRFJ RNA K00345 LUPTRFN RNA K00346 ACTA BIOCHIM. POL. 26, 369-381 (1979) BLYTRF1 RNA M10827 ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 7. Journal Citation Index File 3.3.5 Cross-Reference To Gene Symbol Libraries The gene symbol file contains the gene symbols used in the Howard Hughes Medical Institute Human Gene Mapping Library and other gene symbols, such as those for the E. coli genes. The gene symbols are only found in the FEATURES table description field. The gene symbol references have the form: /gene="gene symbol"; an example is found in section 3.5.11.6. An example of the format of the gene symbol index file follows: 1 10 20 30 40 50 60 70 79 ---------+---------+---------+---------+---------+---------+---------+--------- HPR HUMHPARS2 PRI K03431 HUMHPRG1 PRI X01794 HUMHPRG2 PRI X01787 HUMHPRG3 PRI X01788 HUMHPRG4 PRI X01790 HUMHPRG5 PRI X01792 HPRT HUMHPRT1 PRI M27558 HUMHPRT2 PRI M27559 HUMHPRT3 PRI M27560 HUMHPRT4 PRI M27561 HUMHPRT5 PRI M29753 HUMHPRT6 PRI M29754 HUMHPRT7 PRI M29755 HUMHPRT8 PRI M29756 HUMHPRT9 PRI M29757 HUMHPRTA PRI M12452 HUMHPRTB PRI M26434 HPX HUMHXM PRI X02537 HUMHXMA PRI J03048 HRAS HUMHRASA PRI M19990 HRG HUMBHRPA PRI M18372 HUMHRGA PRI M13149 HSDS ECOHSDSK BCT J01632 ---------+---------+---------+---------+---------+---------+---------+--------- 1 10 20 30 40 50 60 70 79 Example 8. Gene Symbol Index File