roy@alanine.phri.nyu.edu (Roy Smith) (09/28/90)
I'm working on some software to do keyword searches on genbank using
the distributed index (.idx) files. Everything was going fine, until
somebody asked me to use my program to figure out what locus contained a
certain accession number; it was at that point that I realized that the
gbacc.idx file is a different format from the other .idx files. Reading the
docs, I see the gene index is the same as the acc index.
Why? I can see no advantage that the gbacc.idx format has over the
other format, and it has the big disadvantage that it is different (i.e.
programs that search the index files have to know which file they are
searching and adjust their parsing accordingly). It seems to me that this
is just wanton lossage. Am I missing something?
It's certainly far too late to do anything about it now without
breaking a lot of existing software, but it sure is irrating. Hopefully,
any additional index files that are invented in the future will stick to the
"standard" format (i.e. the one that gbkey.idx uses). At least that way,
software developers will only have to special case a finite (and fixed)
number of indicies (currently 2).
--
Roy Smith, Public Health Research Institute
455 First Avenue, New York, NY 10016
roy@alanine.phri.nyu.edu -OR- {att,cmcl2,rutgers,hombre}!phri!roy
"Arcane? Did you say arcane? It wouldn't be Unix if it wasn't arcane!"benton@genbank.bio.net (David Benton) (09/28/90)
There have been two formats for the indexes since IntelliGenetics
became part of the GenBank effort (Oct 1987). The simple reason
for our doing it was that the contract obligated us to continue
to distribute the database in the format in which it had been
distributed by the previous contractor (BBN). I suspect that
the original reason for the two formats was that is seemed wasteful to
spend a full line on an index key that was never going to be more
than 6 characters long (in the case of accession numbers), but using
a full line was necessary in the cases where the index key could be
very long (journal citations, keyword, e.g.). I further suspect
that these two formats have been used since indexes were put on the
GenBank distribution tapes and have been documented in the release
notes for that long as well. Since the release notes don't seem to
get the kind of circulation we'd like, I'll copy the relevant sections
below. The release notes are available by anonymous FTP from
genbank.bio.net and paper copies can be requested by e-mail from
genbank@genbank.bio.net.
For those who may not have been aware of the two index formats issue
Roy Smith raised, the formats are described below. (These indexes are
distributed with the text-file (magnetic tape format) versions of
GenBank: the index formats of the floppy disk and CD ROM versions
are, we hope, much more rational and consistent.)
Sincerely,
David Benton
benton@karyon.bio.net
------------------------------------------------------------------------
3.3 Index Files
There are five files containing indices to the entries in this release:
* Accession number index file
* Keyword phrase index file
* Author name index file
* Journal citation index file
* Gene symbol index file
The accession numbers, keywords, authors, journals, or gene symbols (the index
keys) of an index are sorted alphabetically. (The index keys for the keyword
phrases and author names appear in uppercase characters even though they appear
in mixed case in the sequence entries.) Under each index key, the names of the
sequence entries containing that index key are listed alphabetically. Each
sequence name is also followed by its data file division and primary accession
number. The following codes are used to designate the data file divisions:
1. PRI - primates
2. ROD - rodents
3. MAM - other mammals
4. VRT - other vertebrates
5. INV - invertebrates
6. PLN - plants, fungi, and algae
7. ORG - organelles
8. BCT - bacteria
9. RNA - structural RNAs
10. VRL - viruses
11. PHG - bacteriophage
12. SYN - synthetic sequences
13. UNA - unannotated sequences
The index key begins in column 1 of a record. An 11-character field for the
sequence entry name starts in position 14 of a record, followed by a
3-character field for the data file division, starting at position 25 and
ending at position 27, and a 6-character field for the primary accession
number, starting at position 29 and ending at position 34. All entries in the
fields are left-justified.
Beginning at positions 36 and 58, the three fields repeat, so three sets of
sequence information can appear in one record. If there are more than three
entry names, the next records are used; the index key is not repeated. For the
accession number and human gene symbol index files, the entry names begin in
the same record as the index key, since the key is always less than 12
characters. In the other index files, the entry names begin on the record
following the index key record.
3.3.1 Accession Number Index File
Accession numbers consist of a single letter followed by five digits. They
provide an unchanging designation for the data with which they are associated,
and we encourage you to cite accession numbers whenever you refer to data from
the data bank. The primary accession number is the first accession number of an
entry. It is unique to that entry. Citation of that number will enable other
investigators to locate the data no matter what entry name changes or other
data bank reorganizations may occur. The accession numbers, however, carry no
intrinsic information about the data.
The following excerpt from the middle of the accession number index file
illustrates the format of the accession number index file:
1 10 20 30 40 50 60 70 79
---------+---------+---------+---------+---------+---------+---------+---------
J00316 HUMTBB11P PRI J00316
J00317 HUMTBB46P PRI J00317
J00318 HUMUG1 PRI J00318
J00319 HUMUG1PA PRI J00319
J00320 HUMVIPMR1 PRI L00154 HUMVIPMR2 PRI L00155 HUMVIPMR3 PRI L00156
HUMVIPMR4 PRI L00157 HUMVIPMR5 PRI L00158
J00321 BABA1AT PRI J00321
J00322 CHPRSA PRI J00322
J00323 AGMRSASPC PRI J00323
J00324 BABATIII PRI J00324
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Example 4. Accession Number Index File
If the same accession number is found in more than one entry (a result of the
infrequent occasions when a single entry is split into two or more separate
entries), then the additional entries and groups in which the number appears
are also given.
3.3.2 Keyword Phrase Index File
Keyword phrases consist of names for gene products and other characteristics of
sequence entries. There are approximately 9800 keyword phrases. An excerpt from
the keyword phrase index file is shown below:
1 10 20 30 40 50 60 70 79
---------+---------+---------+---------+---------+---------+---------+---------
DNA HELICASE
ECOHELIV BCT J04726 ECOUVRD BCT X00738
DNA INVERTASE
ECOPIN BCT K00676 ECOPINP BCT K03521 PMUGINMOM PHG V01463
DNA LIGASE
ECOLIG BCT M24278 ECOLIGA BCT M30255 PT4G30 PHG X00039
PT7CG PHG J02518 YSCCDC9 PLN X03246 YSPCDC17 PLN X05107
DNA MATURATION
HS1CAS VRL M22962
DNA METHYLASE
HEHMTS BCT J02677
DNA METHYLATION
HEHMTS BCT J02677 HUMSPM1 PRI X06585 HUMSPM2 PRI X06586
HUMSPM3 PRI X06587 HUMSPM4 PRI X06588 HUMSPM5 PRI X07490
HUMSPM6 PRI X07491 HUMSPM7 PRI X07492 HUMSPM8 PRI X07493
HUMSPM9 PRI X07494
DNA NUCLEOTIDYLEXOTRANSFERASE
MUSTDTR ROD X04123
DNA PACKAGING
P29PRO PHG X05973
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Example 5. Keyword Phrase Index File
3.3.3 Author Name Index File
The author name index file lists all of the author names that appear in the
citations. An excerpt from the author name index file is shown below:
1 10 20 30 40 50 60 70 79
---------+---------+---------+---------+---------+---------+---------+---------
LANDSMAN,D.
CHKHMG14 VRT M20817 CHKHMG17 VRT Y00416 CHKHMG17A VRT J03229
HUMHMG14 PRI J02621 HUMHMG14A PRI M21339 HUMHMG17 PRI M12623
MUSHMG17 ROD X12944 X06353 UNA X06353 X06444 UNA X06444
X13546 UNA X13546 X13929 UNA X13929 X13930 UNA X13930
LANDSMANN,J.
LAMCG PHG J02459 TRTHB PLN Y00296
LANDY,A.
ECOLAMATT BCT J01638 ECOP80ATB BCT M10892 ECOTGTUFB BCT J01717
ECOTGY1 BCT K01197 ECOTRY1 RNA K00266 ECOTRY2 RNA K00267
ECOTRY3 RNA M10878 LAMCG PHG J02459 LAMECOGAL PHG M11151
LAMPRCA PHG M12458 LAMPRCB PHG M12459 P22ATTP PHG M10893
P22INT PHG X04052 P80ATTP PHG M10891 STYP22ATB BCT M10894
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Example 6. Author Name Index File
3.3.4 Journal Citation Index File
The journal citation index file lists all of the citations that appear in the
references. All citations are truncated to 80 characters. An excerpt from the
citation index file is shown below:
1 10 20 30 40 50 60 70 79
---------+---------+---------+---------+---------+---------+---------+---------
(IN) THE CELL NUCLEUS, VOLUME VIII: 261-305; ACADEMIC PRESS, NEW YORK (1981).
RATUR5A RNA K00783
(IN) THE LENS: TRANSPARANCY AND CATARACT: 171-179; EURAGE, RIJSWIJK (1986)
RANCRYG2A VRT K02264 RANCRYG4A VRT K02266 RANCRYG5A VRT M22529
RANCRYG6A VRT M22530 RANCRYR VRT X00659
(IN) VIRUS RESEARCH. PROCEEDINGS OF 1973 ICN-UCLA SYMPOSIUM: 533-544; ACADEMIC
P
LAMCG PHG J02459
ACTA BIOCHIM. POL. 24, 301-318 (1977)
LUPTRFJ RNA K00345 LUPTRFN RNA K00346
ACTA BIOCHIM. POL. 26, 369-381 (1979)
BLYTRF1 RNA M10827
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Example 7. Journal Citation Index File
3.3.5 Cross-Reference To Gene Symbol Libraries
The gene symbol file contains the gene symbols used in the Howard Hughes
Medical Institute Human Gene Mapping Library and other gene symbols, such as
those for the E. coli genes. The gene symbols are only found in the FEATURES
table description field. The gene symbol references have the form:
/gene="gene symbol"; an example is found in section 3.5.11.6. An example of
the format of the gene symbol index file follows:
1 10 20 30 40 50 60 70 79
---------+---------+---------+---------+---------+---------+---------+---------
HPR HUMHPARS2 PRI K03431 HUMHPRG1 PRI X01794 HUMHPRG2 PRI X01787
HUMHPRG3 PRI X01788 HUMHPRG4 PRI X01790 HUMHPRG5 PRI X01792
HPRT HUMHPRT1 PRI M27558 HUMHPRT2 PRI M27559 HUMHPRT3 PRI M27560
HUMHPRT4 PRI M27561 HUMHPRT5 PRI M29753 HUMHPRT6 PRI M29754
HUMHPRT7 PRI M29755 HUMHPRT8 PRI M29756 HUMHPRT9 PRI M29757
HUMHPRTA PRI M12452 HUMHPRTB PRI M26434
HPX HUMHXM PRI X02537 HUMHXMA PRI J03048
HRAS HUMHRASA PRI M19990
HRG HUMBHRPA PRI M18372 HUMHRGA PRI M13149
HSDS ECOHSDSK BCT J01632
---------+---------+---------+---------+---------+---------+---------+---------
1 10 20 30 40 50 60 70 79
Example 8. Gene Symbol Index File