daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
It is trivial to assert that computer programs of the type currently
available on BIONET and numerous other general purpose molecular biology
software packages will play a key role in the massive sequencing projects
envisioned for the near future. The difficult tasks of accumulating the
large number of fragment sequences required and of assembling these
fragments into coherent sequences, and to keep track of the immense volume
of sequence data are natural applications for computers. However,
management of cloning and sequencing is only the smallest and most
mechanical aspect of the application of computers to the large genome
project. It would be more accurate to say that there would be no point to
accumulating large amounts of sequence data if electronic data processing
methods were not available to make the data useful to the scientific
community. Further, in anticipation of the glut of data to come, it is
likely that currently available software and hardware will not be adequate
to prevent us from drowning in our own data.
-------daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
At present, the Genbank database consists of approximately 14,000
sequences comprising ~15 mb. To search this entire database using IFIND on
BIONET is already not practical, and searches are often restricted to
subsets of the total database. For example, the mammalian and unannotated
sequences comprise about 7000 sequences consisting of more than 6 mb. A
recent search of this segment of Genbank using a 1.6 kb probe required
approximately 3 hours of cpu time on the BIONET computer in batch mode.
The same program running on a VAX (about 5 times faster for the Sieve of
Erosthenes benchmark) required about 45 minutes of cpu time. Using the
faster Lipman and Pearson algorithm, XFASTN on BIONET, 1.7 mb in 1565
sequences were searched in 9 min, while another implementation of the
Wilbur and Lipman method on a VAX searched the mammalian and unannotated
lists in about 20 min. As the search time is highly dependant on the probe
size (smaller is faster) and the word size (larger is faster), these
searches were conducted with approximately similar parameters. It was also
somewhat distressing that several of these searches returned different
lists of similar sequences. Clearly, attempts to apply these techniques
to the complete human genome (~ 30 gb; 2000 times larger than the current
Genbank database) will strain all available facilities beyond the breaking
point. The recent proposal to begin accumulating sequence data of this
magnitude poses a clear challenge to the molecular biology software
community to develop new and faster algorithms for new and faster hardware
in order to provide tools capable of practical utilization of
gigabase-databases.
-------daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
The current DNA and protein databases are unidimensional and
idiosyncratic. Future databases should be relational so that each sequence
is linked to: 1) biologically related DNA sequences, 2) physically related
DNA (map coordinates perhaps) and 3) derived sequence and other data
including protein sequences, regulatory elements and other features.
Moreover, these linkages will have to be standardized and pre-indexed, so
that each database query does not have to begin from scratch. By heavily
coding and indexing the data, search times can be brought to manageable
limits, and relational patterns amongst sequences will become evident. In
addition, considerable thought must be given to the mechanism of coding
"features" associated with sequence data. The present method of including
comments as keys to the structure of the sequence, as well as the location
of functional sites and chemical modifications is not optimal for rapid
searching and relational indexing. Two alternative schemes seem worth
considering: 1) an "obligate" feature table for all sequences with a
defined data structure which can be compactly coded (ASCII text tags waste
space and time) and rapidly analyzed, or 2) sequence punctuation in the
form of extended character sets or parenthetical signals. The criteria for
evaluation of sequence annotation should be focussed on the generality,
openness and utility of the method, rather than on parochial
considerations involving current methods. In the gigabase future, there is
no place for random comments and ad hoc structure definitions. We must
choose rational and utilitarian methods for sequence data storage and
management or face the prospect of a modern Tower of Babel.
-------daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
The recent major advances in molecular biology have demonstrated that
future progress in this discipline cannot occur without coordinate
development of computer methods for data storage, data retrieval and
analysis. We stand now on the threshold of a new age where the volume and
density of data can be expected to increase by at least three orders of
magnitude. In large part, how this data is managed will determine whether
the large genome sequencing project will be worth the effort.
-------