[bionet.general] Sequence accuracy and genbank

murphy@PHRI.NYU.EDU (Ellen Murphy) (12/09/89)

In article <8912080616.AA04908@genie.gene.com> botstein@gene.com
(David Botstein) writes:
>... the level of accuracy required is dependent upon the use to
>which the sequence is to be put.  More important ... is an agreement
> on a labeling scheme that associates with each sequenced base
> (or stretch of bases) an estimate of the expected reliability...

   I agree completely.  There is definitely a place in the databases
for less than perfect sequences.  I'm not sure what kind of accuracy
we're talking about, though:  95% sounds a bit low-one reading
of one gel on one strand should produce better than that.
(I would also be surprised if the databases were really only
99% accurate).  But how do you define accuracy?  What about the
1 kb of sequence that I have 99.8% confidence in (there's two
uncertain bases) that is almost 100% single-strand generated?
I had to walk that far to find the bit I was interested in, and
I'm unlikely to sequence the other strand in the near future.
In the meantime, that sequence could be useful to somebody else.

  I don't think the search algorithms will be too much bothered;
a frameshift would simply cause tfasta to score a hit against two
reading frames in the same entry.  It ought to be possible to
determine what level of inaccuracy, randomly distributed (which
is probably not usually the case), would be required to miss
scoring a hit at a given level of similarity.  This is the number
we should be aiming for.

   On the related subject of when people should be required to
submit sequences to the databases, I think that those who are
using "it's not perfect yet" as an excuse will just find themselves
some other reason for delay.  Submission of sequences upon
acceptance of a manuscript is something that has to be enforced
by the journals.  I find the "it's not our responsibility" attitude
of some respected journals quite offensive.

Ellen Murphy
murphy@phri.nyu.edu