gilbertd@cricket.bio.indiana.edu (Don Gilbert) (02/01/91)
Is there a consensus view on the proper way to enter discontinuous sequences to GenBank? An otherwise continuous length of molecule contains regions which were not sequenced, of many bases in length. Options seem to be a) enter in databank under one accession number, with feature notations indicated where regions with no data exist. Drawback: users can miss feature info and incorrectly use such data as a continuous sequence. b) enter in databank under separate accession numbers for each continuous region. Drawback: sequential nature of data is obscured by separate entries. c) enter as one accession, with unsequenced regions (whose size is known, I believe, by alignment with related sequences) indicated with "N" or other symbol. Drawback: the N symbol may not be appropriate. -- Don -- Don Gilbert gilbertd@cricket.bio.indiana.edu biocomputing office, biology dept., indiana univ., bloomington, in 47405
wmf@CAAT.LANL.GOV (Will Fischer) (02/02/91)
Don Gilbert asks: >> Is there a consensus view on the proper way to enter discontinuous >> sequences to GenBank? An otherwise continuous length of >> molecule contains regions which were not sequenced, of many bases >> in length. Options seem to be >> a) enter in databank under one accession number, with feature >> notations indicated where regions with no data exist. Drawback: >> users can miss feature info and incorrectly use such data >> as a continuous sequence. >> b) enter in databank under separate accession numbers for each >> continuous region. Drawback: sequential nature of data is >> obscured by separate entries. >> c) enter as one accession, with unsequenced regions (whose size >> is known, I believe, by alignment with related sequences) >> indicated with "N" or other symbol. Drawback: the >> N symbol may not be appropriate. GenBank generally handles this (rather common) case with option "b". We strenously avoid option "a" (making a continuous sequence out of sequences that are non-contiguous in vivo). The only _common_ exception to this is that class of sequences that are presented as a fusion of a genomic promoter region followed by a mRNA-derived cDNA sequence (a peculiar practice, but not an uncommon one). As Greg Lennon points out, collapsing spaces in a sequence will generate an incorrect total length; it will also force pattern matching attempts to intersperse gaps (with some likelihood of error). We occasionally use strings of "n"s to pad out gaps in a sequence, but only when the length of the gap is _precisely_ known (e.g., sequenced but omitted in a figure). Generally, we will break even such a sequence up if the string of included "n"s would exceed 10% of the total sequence length, since such lengths of n's cause considerable noise in similarity searches. As for the "sequential nature of data" being "obscured by separate entries", the "SEGMENT" linetype does clearly specify that a group of entries occur in linear sequence. Any future format changes will continue to present this type of data in a parsable fashion. -- Will Fischer GenBank (genbank@life.lanl.gov)