[bionet.general] Woftrap missile response

botstein@gene.com (David Botstein) (12/08/89)

It seems to me that a key problem (possible THE key problem) in the
issue of when sequencers should be obliged (as opposed to volunteering)
to release new sequences relates to the definition of "finished se-
quence". 

It seems to me also that the parent problem has to do with the level
of accuracy of sequence that is:

	a. minimally acceptable for a database

	b. sufficient to be called "finished"

	c. "optimal", i.e. the ideal compromise between accuracy and
	cost, which will be directly related.

I think it clear that there is a level of accuracy that is too
expensive to obtain for every base of a mega-sequence project.
Further, the level of accuracy required is dependent upon the use to
which the sequence is to be put.

More important than agreement on the levels of accuracy to be aimed
at, it seems to me, is an agreement on a labeling scheme that
associates with each sequenced base (or stretch of bases) an estimate
of the expected reliability. If this is done honestly and correctly,
then many nasty misunderstandings will be avoidable and the sequencers
can properly say "caveat emptor", for after all if the user needs more
accuracy for his or her particular purpose, he or she can improve the
sequence. 

I raise these issues because I think there are scientific questions
that should be answered before we become unduly emotional about
who will owe what to whom when: These include:

	1. What is the relationship between the error rate of a DNA
sequence and it's usefulness in searches aimed at:

	a. homology at the amino acid sequence level

	b. homology at the DNA "consensus sequence level"

	2. What can be done to allow search algorithms to become
relatively insensitive to sequence errors? For instance, I can
envision programs that search for continued open reading in alternate
frames when searching DNA sequences labeled as having higher than
normal predicted error rates.

This is only a partial list. I would like to encourage everyone to
think about accuracy and its impact, for the cost of having a sequence
is strongly affected by the needed accuracy. We should not set out to
determine the sequence too accurately, lest we make the cost prohibitive.
It seems quite possible to me that for most purposes much less than
the current publication standard will suffice as a first pass that
will allow extraction of most of the information; refinement could
then be done selectively as needed. Such a strategy might reduce the
cost significantly, and provide much of the value of having lots of
sequence much earlier. What I am missing is good estimates of the
consequences of having 2 , 3 or even 5% error instead of the ca. 1% in
the current databases.

I look forward to replies. Quantitative estimates concerning what is
possible at what level of accuracy would be most valuable in laying a
rational foundation for policy in this arena.

Thanks for your attention

David Botstein