[bionet.molbio.genbank] how to submit discontinuous sequence?

gilbertd@cricket.bio.indiana.edu (Don Gilbert) (02/01/91)

Is there a consensus view on the proper way to enter discontinuous 
sequences to GenBank?  An otherwise continuous length of 
molecule contains regions which were not sequenced, of many bases
in length.  Options seem to be 
  a) enter in databank under one accession number, with feature 
     notations indicated where regions with no data exist. Drawback:
     users can miss feature info and incorrectly use such data
     as a continuous sequence.
  b) enter in databank under separate accession numbers for each
     continuous region.  Drawback: sequential nature of data is
     obscured by separate entries.
  c) enter as one accession, with unsequenced regions (whose size
     is known, I believe, by alignment with related sequences) 
     indicated with "N" or other symbol.  Drawback: the
     N symbol may not be appropriate.
     
-- Don

-- 
Don Gilbert    gilbertd@cricket.bio.indiana.edu
biocomputing office, biology dept., indiana univ., bloomington, in 47405

wmf@CAAT.LANL.GOV (Will Fischer) (02/02/91)

Don Gilbert asks:

>> Is there a consensus view on the proper way to enter discontinuous 
>> sequences to GenBank?  An otherwise continuous length of 
>> molecule contains regions which were not sequenced, of many bases
>> in length.  Options seem to be 
>>   a) enter in databank under one accession number, with feature 
>>      notations indicated where regions with no data exist. Drawback:
>>      users can miss feature info and incorrectly use such data
>>      as a continuous sequence.
>>   b) enter in databank under separate accession numbers for each
>>      continuous region.  Drawback: sequential nature of data is
>>      obscured by separate entries.
>>   c) enter as one accession, with unsequenced regions (whose size
>>      is known, I believe, by alignment with related sequences) 
>>      indicated with "N" or other symbol.  Drawback: the
>>      N symbol may not be appropriate.

GenBank generally handles this (rather common) case with option "b".

We strenously avoid option "a" (making a continuous sequence out of sequences 
that are non-contiguous in vivo).  The only _common_ exception to this is that 
class of sequences that are presented as a fusion of a genomic promoter region
followed by a mRNA-derived cDNA sequence (a peculiar practice, but not an
uncommon one).  As Greg Lennon points out, collapsing spaces in a sequence will
generate an incorrect total length; it will also force pattern matching 
attempts to intersperse gaps (with some likelihood of error).

We occasionally use strings of "n"s to pad out gaps in a sequence, but only
when the length of the gap is _precisely_ known (e.g., sequenced but omitted in
a figure). Generally, we will break even such a sequence up if the string of
included "n"s would exceed 10% of the total sequence length, since such lengths of n's cause considerable noise in similarity searches.

As for the "sequential nature of data" being "obscured by separate entries",
the "SEGMENT" linetype does clearly specify that a group of entries occur
in linear sequence.  Any future format changes will continue to present this
type of data in a parsable fashion.

-- Will Fischer
   GenBank  (genbank@life.lanl.gov)