[bionet.general] Sequence accuracy

BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) (12/05/89)
David,
In regard to the recent discussions regarding Genome Data Submission
and NIH Policy, I also was at the Wolf Trap meeting and was bothered
by two points. 

1. The submission issue and hoarding of data.
2. A discussion of how many errors should be tolerated in the sequence
   data in the human genome project.

	This message addresses my concerns about the second issue.

	How accurate should data be before it is published and entered
into the databases.  The discussion at Wolf Trap ranged from 99% accuracy
(1 mistake/100 bases) to 99.9% accuracy (1 mistake/1000 bases).
	In this regard, recent mail was posted stating that the Automated
DNA sequencing instruments should submit the data directly to the databases.
It should be pointed out that the accuracy of the data collected by
these instruments ranges from 80% to 99% depending on the quality of the
DNA sequenced, the protocols used for sequencing, the instrument used and
its associated programs, and the ability of the individual scientist to 
proof read his/her sequence.  To suggest that these instruments should enter
the data directly to the databases is absurd as no one would want to submit
data which was so filled with errors and sequenced only once.  It also
should be pointed out that each base in a final sequence should be the result
of several experiments in which each base is sequenced a minimum of twice
in one orientation and at least once in the other orientation.  This rule
to insure the greatest accuracy in a sequenced region was termed the
"rule of 3" in Sanger's laboratory at the MRC while we were sequencing
the human mitochondrial genome and has been widely adopted by sequencers.
	Once a sequence has been completed and is submitted for publication
and to one of the databases, one usually is convinced that it is 100%
accurate.  However, in reality the sequence usually does contain errors
which can be minimized by sequencing separate isolates, sequencing following
the "rule of 3", extensive proofreading, and altering the sequencing
conditions such that gel artifacts (compressions) and sequence ambiguities
are as low as possible.
	With this in mind, the issue is to discuss the extent of acceptable
error.  A quick and dirty road map of the human genome would probably have
an error of 1-2% if those of us working on this project were not concerned
with minimizing the errors.  But how minimal is minimal?  What is an
acceptable error to those doing evolutionary comparisons?  The goal of
keeping the errors to less than 1 in 10,000 or even 1 in 1000 may be
unrealistic but what good is the data if it is filled with errors? 
	At present we are sequencing both the human c-abl gene on chromosome
9 and the break point cluster gene (bcr) region on chromosome 22.  Sequencing
both regions will entail determining the sequence of over 500,000 unique
nucleotides. To follow the "rule of 3" we must obtain sequence data from
over 1.5 million bases to complete the sequence of the 500,000 nucleotides.
If our accuracy is 99.99% then 50 nucleotide will be in error which may
well be an acceptable level.  However, if our accuracy is 99.9%, 500
nucleotides will be incorrect.  The error of 500 nucleotides out of
500,000 to me is unacceptable. Thus, methods will be needed to rapidly
find these errors and correct them before the sequenced region is entered
into the literature and available in the databases.  Finding and correcting
any sequencing errors takes the major portion of a sequencers time. 
90% of the sequence data usually is collected in 10% of the time and the
remaining 90% of the time for a sequencing project entails collecting the
remaining 10% of the data, most of which is resolving artifacts and
proofreading to minimize the errors.

	To address this issue adequately, it is important that we also 
discuss the cost factors.  If, as several groups have suggested, it now
is possible to collect roughly 10,000 nucleotides of sequence data per
automated DNA sequencing instrument/day with one or more persons feeding
the instrument with DNA sequencing reactions. Therefore, it will take
150 days to collect the 1.5 million nucleotides necessary for us to
complete the sequence of the human genomic c-abl and bcr regions.  But
in light of my above discussion, these 150 days represent only 10% of
the actual work, because clones have to be isolated, and the sequence needs
proofreading, etc. Thus it really will take 1350 instrument-days to complete
the entire sequence with an error of roughly 500 nucleotides.  That's almost
4 years (depending on how you count weekends).
	If the human genome contains roughly 3 billion nucleotides,
then it would take 24,000 years to complete the sequence of the entire
human genome with one instrument and 24 years using 1000 instruments.
At $100,000 per instrument, that's $100 million in instruments alone.
Add to that the cost of scientists feeding the instruments (say 2
scientists/instrument at $30k/year for salary/fringe and another $12k
for supplies and $8K for maintenance on the instrument) and you have
another roughly $100,000 per year times 1000 instruments for 24 years.
This brings the total to $2.4 billion over the 24 years. Because the
cost of supplies may be in excess of $50k/year/automated DNA sequencing
instrument these figure may be a little on the low side. Add another
$100 million for administrative and other expenses and that's $2.6 billion 
to sequence the 3 billion nucleotides of the human genome at 99.9% accuracy,
or roughly $1 a base.  The final figure of $1/base is interestingly close
to exactly what it costs my laboratory today to determine a final sequence.
	To reduce the cost/base to $0.1 (10 cents a base),  suggested as
being a figure which would make the human genome project realistic, would
require something to give.  Either the instruments become 10 fold more
efficient or the accuracy becomes 10 fold less, or some combination of
the two.  In any case, something has to give and I am afraid that accuracy
will suffer because I do not see more than a 2 fold increase in efficiency
of the automated DNA sequencing instruments in the near future.
	This tome has broken all the rules of e-mail except for using any
vulgar words or being an unauthorized advertisement.  I apologize for its
length but not for its contents.  The question remains "How accurate is
acceptable and is the cost worth it?"  I await comments from my colleagues.

	Bruce A. Roe
	Professor of Chemistry and Biochemistry
        INTERNET: BROE@aardvark.ucs.uoknor.edu
	AT&TNET:  405-325-4912
	SnailNet: Department of Chemistry and Biochemistry
		  University of Oklahoma
		  620 Parrington Oval, Rm 208
		  Norman, Oklahoma 73017