BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) (12/05/89)
David, In regard to the recent discussions regarding Genome Data Submission and NIH Policy, I also was at the Wolf Trap meeting and was bothered by two points. 1. The submission issue and hoarding of data. 2. A discussion of how many errors should be tolerated in the sequence data in the human genome project. This message addresses my concerns about the second issue. How accurate should data be before it is published and entered into the databases. The discussion at Wolf Trap ranged from 99% accuracy (1 mistake/100 bases) to 99.9% accuracy (1 mistake/1000 bases). In this regard, recent mail was posted stating that the Automated DNA sequencing instruments should submit the data directly to the databases. It should be pointed out that the accuracy of the data collected by these instruments ranges from 80% to 99% depending on the quality of the DNA sequenced, the protocols used for sequencing, the instrument used and its associated programs, and the ability of the individual scientist to proof read his/her sequence. To suggest that these instruments should enter the data directly to the databases is absurd as no one would want to submit data which was so filled with errors and sequenced only once. It also should be pointed out that each base in a final sequence should be the result of several experiments in which each base is sequenced a minimum of twice in one orientation and at least once in the other orientation. This rule to insure the greatest accuracy in a sequenced region was termed the "rule of 3" in Sanger's laboratory at the MRC while we were sequencing the human mitochondrial genome and has been widely adopted by sequencers. Once a sequence has been completed and is submitted for publication and to one of the databases, one usually is convinced that it is 100% accurate. However, in reality the sequence usually does contain errors which can be minimized by sequencing separate isolates, sequencing following the "rule of 3", extensive proofreading, and altering the sequencing conditions such that gel artifacts (compressions) and sequence ambiguities are as low as possible. With this in mind, the issue is to discuss the extent of acceptable error. A quick and dirty road map of the human genome would probably have an error of 1-2% if those of us working on this project were not concerned with minimizing the errors. But how minimal is minimal? What is an acceptable error to those doing evolutionary comparisons? The goal of keeping the errors to less than 1 in 10,000 or even 1 in 1000 may be unrealistic but what good is the data if it is filled with errors? At present we are sequencing both the human c-abl gene on chromosome 9 and the break point cluster gene (bcr) region on chromosome 22. Sequencing both regions will entail determining the sequence of over 500,000 unique nucleotides. To follow the "rule of 3" we must obtain sequence data from over 1.5 million bases to complete the sequence of the 500,000 nucleotides. If our accuracy is 99.99% then 50 nucleotide will be in error which may well be an acceptable level. However, if our accuracy is 99.9%, 500 nucleotides will be incorrect. The error of 500 nucleotides out of 500,000 to me is unacceptable. Thus, methods will be needed to rapidly find these errors and correct them before the sequenced region is entered into the literature and available in the databases. Finding and correcting any sequencing errors takes the major portion of a sequencers time. 90% of the sequence data usually is collected in 10% of the time and the remaining 90% of the time for a sequencing project entails collecting the remaining 10% of the data, most of which is resolving artifacts and proofreading to minimize the errors. To address this issue adequately, it is important that we also discuss the cost factors. If, as several groups have suggested, it now is possible to collect roughly 10,000 nucleotides of sequence data per automated DNA sequencing instrument/day with one or more persons feeding the instrument with DNA sequencing reactions. Therefore, it will take 150 days to collect the 1.5 million nucleotides necessary for us to complete the sequence of the human genomic c-abl and bcr regions. But in light of my above discussion, these 150 days represent only 10% of the actual work, because clones have to be isolated, and the sequence needs proofreading, etc. Thus it really will take 1350 instrument-days to complete the entire sequence with an error of roughly 500 nucleotides. That's almost 4 years (depending on how you count weekends). If the human genome contains roughly 3 billion nucleotides, then it would take 24,000 years to complete the sequence of the entire human genome with one instrument and 24 years using 1000 instruments. At $100,000 per instrument, that's $100 million in instruments alone. Add to that the cost of scientists feeding the instruments (say 2 scientists/instrument at $30k/year for salary/fringe and another $12k for supplies and $8K for maintenance on the instrument) and you have another roughly $100,000 per year times 1000 instruments for 24 years. This brings the total to $2.4 billion over the 24 years. Because the cost of supplies may be in excess of $50k/year/automated DNA sequencing instrument these figure may be a little on the low side. Add another $100 million for administrative and other expenses and that's $2.6 billion to sequence the 3 billion nucleotides of the human genome at 99.9% accuracy, or roughly $1 a base. The final figure of $1/base is interestingly close to exactly what it costs my laboratory today to determine a final sequence. To reduce the cost/base to $0.1 (10 cents a base), suggested as being a figure which would make the human genome project realistic, would require something to give. Either the instruments become 10 fold more efficient or the accuracy becomes 10 fold less, or some combination of the two. In any case, something has to give and I am afraid that accuracy will suffer because I do not see more than a 2 fold increase in efficiency of the automated DNA sequencing instruments in the near future. This tome has broken all the rules of e-mail except for using any vulgar words or being an unauthorized advertisement. I apologize for its length but not for its contents. The question remains "How accurate is acceptable and is the cost worth it?" I await comments from my colleagues. Bruce A. Roe Professor of Chemistry and Biochemistry INTERNET: BROE@aardvark.ucs.uoknor.edu AT&TNET: 405-325-4912 SnailNet: Department of Chemistry and Biochemistry University of Oklahoma 620 Parrington Oval, Rm 208 Norman, Oklahoma 73017