[bionet.molbio.embldatabank] EMBL Nucleotide Database - forthcoming changes

Peter.Stoehr%EMBL@PUCC.PRINCETON.EDU (Peter Stoehr) (08/10/90)
In view of recent discussions on the biosci bulletin-boards about changes
to database formats, I am posting here text extracted from the release notes
of the upcoming release of the EMBL Nucleotide Sequence Database (Rel 24,
Aug 90).

Peter Stoehr
EMBL Data Library

-----------------------------------------------------------------------------
1  CHANGES AT THIS RELEASE (Release 24, August 1990)

1.1  New Feature Table

Experience in  trying  to  represent  some  of  the  more  complex  features  of
nucleotide  sequences  led both ourselves and GenBank to the conclusion that the
old style of feature table was inadequate.  EMBL, GenBank and the DNA Data  Bank
of  Japan have completed the design of a new, common, feature table format which
we are introducing at this release.

If you would like to receive details of the new feature table format then please
contact  us  (by post, telephone or electronic mail) at the address shown on the
cover page of this document.

A brief introduction to the new format is supplied as the file FTABLE.DOC on the
release tape.


1.2  New DR (Database Cross-Reference) Line

This new line type cross-references other databases  which  contain  information
related to entries in the EMBL nucleotide sequence database.

For example, if the protein translation of a sequence exists in  the  SWISS-PROT
or  PIR  databases there will be DR lines pointing to the relevant SWISS-PROT or
PIR entries.  If the atomic coordinates of these SWISS-PROT or PIR  entries  are
stored  in  the  Brookhaven  Protein  Data  Bank  (PDB) there will be DR line(s)
pointing to the corresponding entry(ies) in that data bank.

The format of the DR line is as follows:

     DR  database_identifier; primary_identifier; secondary_identifier.

The first item on the DR line, the database identifier, is the abbreviated  name
of  the  data  collection  to  which  reference  is  made.   The  initial set of
cross-referenced databases are:

     Database ID    Fullname
     -----------    --------------------------------------------------------
     HIV            The HIV Sequence Database
     PDB            The Brookhaven Protein Data Bank (PDB)
     PIR            The   Protein   Sequence   Database   of   the   Protein
                    Identification Resource (PIR)
     SWISS-PROT     The SWISS-PROT Protein Sequence Database

The second item on the DR line, the primary identifier,  is  a  pointer  to  the
entry  in the external database to which reference is being made.  The data item
used as the primary identifier depends on the database being referenced:

     Database ID    Primary Identifier
     -----------    ------------------
     HIV            Accession number
     PDB            Entryname
     PIR            Accession number
     SWISS-PROT     Accession number

The third item on the DR line, the secondary identifier, is used  to  complement
the  information  given  by  the  primary identifier.  Again, the data item used
depends on the database being referenced:

     Database ID    Secondary Identifier
     -----------    ----------------------------------------------
     HIV            Entryname
     PDB            Most recent revision date (last REVDAT record)
     PIR            Entryname
     SWISS-PROT     Entryname

Some examples of complete DR lines are shown below:

     DR   HIV; K02013; NEF$BRU.
     DR   PDB; 3ADK; 16-APR-88.
     DR   PIR; A02768; R5EC7.
     DR   SWISS-PROT; P03593; V90K$AMV.



2  FORTHCOMING CHANGES

2.1  RN Line Format

Each reference block in a database entry currently contains exactly one RN  line
which  represents  three  different  pieces  of  information:  the number of the
reference within the entry, the base span(s) covered by the  reference,  and  an
optional comment.  The RN line is formatted as follows:

     RN   [n] (bases i-j, k-l, m-n, ...) comment

The restriction to one RN line per reference block imposes an arbitrary limit on
the number of base spans which can be specified for a reference, and in order to
remove this restriction we will change the RN line format at the next  quarterly
release (i.e. Release 25 in November 1990).

The current RN line will be  replaced  by  three  line  types:   a  modified  RN
(Reference  Number)  line  type  containing  just the reference number, a new RC
(Reference Comment) line type containing just the reference comment, and  a  new
RB  (Reference  Base)  line  type  containing just the base spans covered by the
reference.

     RN   [n]
     RC   comment
     RB   i-j, k-l, m-n, ...

Each reference block will continue to have exactly one  RN  line.   As  many  RC
lines  as  are  needed  to  display  the  reference's comment will appear.  If a
reference has no comment then the RC line will not appear.  As many RB lines  as
are  needed  to  display the reference's base spans will appear.  If a reference
has no base spans then the RB line will not appear.


2.2  DT Line Format

We have decided to change the information we supply on DT  lines,  in  order  to
satisfy two of the most common requests for enhancements we receive:  to provide
an easy way of determining when an entry first appeared in the database and when
it was last updated.

As from the next quarterly release  (i.e. Release  25  in  November  1990)  each
database  entry  will contain exactly two DT lines, which will indicate when the
entry first appeared in the database and when it was last updated.   Each  entry
will  also receive a version number, which will be incremented by one every time
the entry is updated.  The DT lines will be formatted as follows:

     DT   DD-MMM-YYYY (Rel. #; Last updated; Version #)
     DT   DD-MMM-YYYY (Rel. #; Created)

For example:

     DT   12-APR-1990 (Rel. 23; Last updated; Version 3)
     DT   10-MAR-1990 (Rel. 22; Created)

Note that the format of the DT line  is  unchanged  (i.e.   a  DD-MMM-YYYY  date
followed  by parenthesised text); what we have done is to rigorously specify the
text which appears in parentheses after the date.

The version number will only appear on the "Last updated" DT line.  If an  entry
has  not  been updated since it was created, it will still have two DT lines and
the "Last updated" line will have the same date  (and  release  number)  as  the
"Created"  line.  The date supplied on each DT line indicates when the entry was
created or updated; that will usually also be the date when the new or  modified
entry  became  publically  visible,  via  our  file  server.  The release number
indicates the first quarterly release made *after* the entry was created or last
updated.


2.3  Lowercase Sequences

The EMBL Data Library and  GenBank,  along  with  many  other  groups  who  deal
extensively  with  sequence  data,  have  long  noted  that  the presentation of
sequences using lowercase letters significantly improves the accuracy  of  human
readers  who  have to deal with them.  Since the use of lowercase letters is now
allowed in the IUPAC-IUB standard, we will switch to a lowercase presentation of
sequences as from the next quarterly release (i.e. Release 25 in November 1990).


2.4  Taxonomic Information

We will make the following changes to the way in which taxonomic information  is
represented in the database as from Release 26 in February 1991.


2.4.1  New OG (Organelle) Line

A new linetype will be introduced,  to  indicate  the  location  of  non-nuclear
sequences.   It will only be present in entries containing non-nuclear sequences
and will appear after the last OC line in such entries.

The OG line will contain one data item, either  "Mitochondrion",  "Chloroplast",
"Kinetoplast" or a plasmid name (e.g.  "Plasmid pBR322").

OS lines of non-nuclear entries will no longer be prefixed  by  "Mitochondrion",
"Chloroplast"  or  "Kinetoplast";  this  information  will only appear on the OG
line.   We  will  also  abandon  the  use  of  separate  taxonomic   trees   for
chloroplastida and mitochondria.

For example, the current:

     OS   Chloroplast Euglena gracilis (green algae)
     OC   Chloroplastida; Planta; Phycophyta; Euglenophyceae.

will become:

     OS   Euglena gracilis (green algae)
     OC   Eukaryota; Planta; Phycophyta; Euglenophyceae.
     XX
     OG   Chloroplast


2.4.2  Hybrids

Hybrids will be handled by repeating the OS/OC lines for each source organism in
the hybrid.  A human/mouse hybrid, for example, will appear as follows:

     OS   Homo sapiens (human)
     OC   ... OC for humans ...
     XX
     OS   Mus musculus (mouse)
     OC   ... OC for mice ...



2.4.3  Unknown Sources

In cases where the source organism is unknown, the taxonomy on  the  OC  line(s)
will  be  as  specific  as  possible  and the OS line will be "OS Unknown".  For
example:

     OS   Unknown
     OC   Prokaryota; Bacteria.



2.4.4  Artificial Sequences

A new taxonomic node, "Artificial sequences", will be  introduced  at  the  same
level  as "Prokaryota", "Eukaryota", etc.  It will have (at least initially) two
child nodes:  "Cloning vectors" and "Synthetic genes".



2.4.5  Plasmids

For naturally occurring  plasmids  the  OS/OC  lines  will  contain  the  source
organism and the plasmid name will appear on an OG line.  For example:

     OS   Escherichia coli
     OC   Prokaryota; ... Enterobacteriaceae.
     XX
     OG   Plasmid colE1

For artificial plasmids the OS line will be "OS None" and the sequence  will  be
classified  as  a  cloning  vector.  The plasmid name will appear on an OG line.
For example:

     OS   None
     OC   Artificial sequences; Cloning vectors.
     XX
     OG   Plasmid pBR322

Where only a naturally occurring part of a plasmid is reported, the plasmid name
will appear on the OG line and the OS/OC lines will describe the natural source.
For example:

     OS   Escherichia coli
     OC   Prokaryota; ... Enterobacteriaceae.
     XX
     OG   Plasmid pUC8