Peter.Stoehr%EMBL@PUCC.PRINCETON.EDU (Peter Stoehr) (08/10/90)
In view of recent discussions on the biosci bulletin-boards about changes to database formats, I am posting here text extracted from the release notes of the upcoming release of the EMBL Nucleotide Sequence Database (Rel 24, Aug 90). Peter Stoehr EMBL Data Library ----------------------------------------------------------------------------- 1 CHANGES AT THIS RELEASE (Release 24, August 1990) 1.1 New Feature Table Experience in trying to represent some of the more complex features of nucleotide sequences led both ourselves and GenBank to the conclusion that the old style of feature table was inadequate. EMBL, GenBank and the DNA Data Bank of Japan have completed the design of a new, common, feature table format which we are introducing at this release. If you would like to receive details of the new feature table format then please contact us (by post, telephone or electronic mail) at the address shown on the cover page of this document. A brief introduction to the new format is supplied as the file FTABLE.DOC on the release tape. 1.2 New DR (Database Cross-Reference) Line This new line type cross-references other databases which contain information related to entries in the EMBL nucleotide sequence database. For example, if the protein translation of a sequence exists in the SWISS-PROT or PIR databases there will be DR lines pointing to the relevant SWISS-PROT or PIR entries. If the atomic coordinates of these SWISS-PROT or PIR entries are stored in the Brookhaven Protein Data Bank (PDB) there will be DR line(s) pointing to the corresponding entry(ies) in that data bank. The format of the DR line is as follows: DR database_identifier; primary_identifier; secondary_identifier. The first item on the DR line, the database identifier, is the abbreviated name of the data collection to which reference is made. The initial set of cross-referenced databases are: Database ID Fullname ----------- -------------------------------------------------------- HIV The HIV Sequence Database PDB The Brookhaven Protein Data Bank (PDB) PIR The Protein Sequence Database of the Protein Identification Resource (PIR) SWISS-PROT The SWISS-PROT Protein Sequence Database The second item on the DR line, the primary identifier, is a pointer to the entry in the external database to which reference is being made. The data item used as the primary identifier depends on the database being referenced: Database ID Primary Identifier ----------- ------------------ HIV Accession number PDB Entryname PIR Accession number SWISS-PROT Accession number The third item on the DR line, the secondary identifier, is used to complement the information given by the primary identifier. Again, the data item used depends on the database being referenced: Database ID Secondary Identifier ----------- ---------------------------------------------- HIV Entryname PDB Most recent revision date (last REVDAT record) PIR Entryname SWISS-PROT Entryname Some examples of complete DR lines are shown below: DR HIV; K02013; NEF$BRU. DR PDB; 3ADK; 16-APR-88. DR PIR; A02768; R5EC7. DR SWISS-PROT; P03593; V90K$AMV. 2 FORTHCOMING CHANGES 2.1 RN Line Format Each reference block in a database entry currently contains exactly one RN line which represents three different pieces of information: the number of the reference within the entry, the base span(s) covered by the reference, and an optional comment. The RN line is formatted as follows: RN [n] (bases i-j, k-l, m-n, ...) comment The restriction to one RN line per reference block imposes an arbitrary limit on the number of base spans which can be specified for a reference, and in order to remove this restriction we will change the RN line format at the next quarterly release (i.e. Release 25 in November 1990). The current RN line will be replaced by three line types: a modified RN (Reference Number) line type containing just the reference number, a new RC (Reference Comment) line type containing just the reference comment, and a new RB (Reference Base) line type containing just the base spans covered by the reference. RN [n] RC comment RB i-j, k-l, m-n, ... Each reference block will continue to have exactly one RN line. As many RC lines as are needed to display the reference's comment will appear. If a reference has no comment then the RC line will not appear. As many RB lines as are needed to display the reference's base spans will appear. If a reference has no base spans then the RB line will not appear. 2.2 DT Line Format We have decided to change the information we supply on DT lines, in order to satisfy two of the most common requests for enhancements we receive: to provide an easy way of determining when an entry first appeared in the database and when it was last updated. As from the next quarterly release (i.e. Release 25 in November 1990) each database entry will contain exactly two DT lines, which will indicate when the entry first appeared in the database and when it was last updated. Each entry will also receive a version number, which will be incremented by one every time the entry is updated. The DT lines will be formatted as follows: DT DD-MMM-YYYY (Rel. #; Last updated; Version #) DT DD-MMM-YYYY (Rel. #; Created) For example: DT 12-APR-1990 (Rel. 23; Last updated; Version 3) DT 10-MAR-1990 (Rel. 22; Created) Note that the format of the DT line is unchanged (i.e. a DD-MMM-YYYY date followed by parenthesised text); what we have done is to rigorously specify the text which appears in parentheses after the date. The version number will only appear on the "Last updated" DT line. If an entry has not been updated since it was created, it will still have two DT lines and the "Last updated" line will have the same date (and release number) as the "Created" line. The date supplied on each DT line indicates when the entry was created or updated; that will usually also be the date when the new or modified entry became publically visible, via our file server. The release number indicates the first quarterly release made *after* the entry was created or last updated. 2.3 Lowercase Sequences The EMBL Data Library and GenBank, along with many other groups who deal extensively with sequence data, have long noted that the presentation of sequences using lowercase letters significantly improves the accuracy of human readers who have to deal with them. Since the use of lowercase letters is now allowed in the IUPAC-IUB standard, we will switch to a lowercase presentation of sequences as from the next quarterly release (i.e. Release 25 in November 1990). 2.4 Taxonomic Information We will make the following changes to the way in which taxonomic information is represented in the database as from Release 26 in February 1991. 2.4.1 New OG (Organelle) Line A new linetype will be introduced, to indicate the location of non-nuclear sequences. It will only be present in entries containing non-nuclear sequences and will appear after the last OC line in such entries. The OG line will contain one data item, either "Mitochondrion", "Chloroplast", "Kinetoplast" or a plasmid name (e.g. "Plasmid pBR322"). OS lines of non-nuclear entries will no longer be prefixed by "Mitochondrion", "Chloroplast" or "Kinetoplast"; this information will only appear on the OG line. We will also abandon the use of separate taxonomic trees for chloroplastida and mitochondria. For example, the current: OS Chloroplast Euglena gracilis (green algae) OC Chloroplastida; Planta; Phycophyta; Euglenophyceae. will become: OS Euglena gracilis (green algae) OC Eukaryota; Planta; Phycophyta; Euglenophyceae. XX OG Chloroplast 2.4.2 Hybrids Hybrids will be handled by repeating the OS/OC lines for each source organism in the hybrid. A human/mouse hybrid, for example, will appear as follows: OS Homo sapiens (human) OC ... OC for humans ... XX OS Mus musculus (mouse) OC ... OC for mice ... 2.4.3 Unknown Sources In cases where the source organism is unknown, the taxonomy on the OC line(s) will be as specific as possible and the OS line will be "OS Unknown". For example: OS Unknown OC Prokaryota; Bacteria. 2.4.4 Artificial Sequences A new taxonomic node, "Artificial sequences", will be introduced at the same level as "Prokaryota", "Eukaryota", etc. It will have (at least initially) two child nodes: "Cloning vectors" and "Synthetic genes". 2.4.5 Plasmids For naturally occurring plasmids the OS/OC lines will contain the source organism and the plasmid name will appear on an OG line. For example: OS Escherichia coli OC Prokaryota; ... Enterobacteriaceae. XX OG Plasmid colE1 For artificial plasmids the OS line will be "OS None" and the sequence will be classified as a cloning vector. The plasmid name will appear on an OG line. For example: OS None OC Artificial sequences; Cloning vectors. XX OG Plasmid pBR322 Where only a naturally occurring part of a plasmid is reported, the plasmid name will appear on the OG line and the OS/OC lines will describe the natural source. For example: OS Escherichia coli OC Prokaryota; ... Enterobacteriaceae. XX OG Plasmid pUC8