botstein@gene.com (David Botstein) (12/08/89)
It seems to me that a key problem (possible THE key problem) in the issue of when sequencers should be obliged (as opposed to volunteering) to release new sequences relates to the definition of "finished se- quence". It seems to me also that the parent problem has to do with the level of accuracy of sequence that is: a. minimally acceptable for a database b. sufficient to be called "finished" c. "optimal", i.e. the ideal compromise between accuracy and cost, which will be directly related. I think it clear that there is a level of accuracy that is too expensive to obtain for every base of a mega-sequence project. Further, the level of accuracy required is dependent upon the use to which the sequence is to be put. More important than agreement on the levels of accuracy to be aimed at, it seems to me, is an agreement on a labeling scheme that associates with each sequenced base (or stretch of bases) an estimate of the expected reliability. If this is done honestly and correctly, then many nasty misunderstandings will be avoidable and the sequencers can properly say "caveat emptor", for after all if the user needs more accuracy for his or her particular purpose, he or she can improve the sequence. I raise these issues because I think there are scientific questions that should be answered before we become unduly emotional about who will owe what to whom when: These include: 1. What is the relationship between the error rate of a DNA sequence and it's usefulness in searches aimed at: a. homology at the amino acid sequence level b. homology at the DNA "consensus sequence level" 2. What can be done to allow search algorithms to become relatively insensitive to sequence errors? For instance, I can envision programs that search for continued open reading in alternate frames when searching DNA sequences labeled as having higher than normal predicted error rates. This is only a partial list. I would like to encourage everyone to think about accuracy and its impact, for the cost of having a sequence is strongly affected by the needed accuracy. We should not set out to determine the sequence too accurately, lest we make the cost prohibitive. It seems quite possible to me that for most purposes much less than the current publication standard will suffice as a first pass that will allow extraction of most of the information; refinement could then be done selectively as needed. Such a strategy might reduce the cost significantly, and provide much of the value of having lots of sequence much earlier. What I am missing is good estimates of the consequences of having 2 , 3 or even 5% error instead of the ca. 1% in the current databases. I look forward to replies. Quantitative estimates concerning what is possible at what level of accuracy would be most valuable in laying a rational foundation for policy in this arena. Thanks for your attention David Botstein