[bionet.molbio.news] CSLG|COMMENTARY: from Ellis Golub

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes

    It is trivial to assert that computer programs of the type currently 
available on BIONET and numerous other general purpose molecular biology 
software packages will play a key role in the massive sequencing projects 
envisioned for the near future. The difficult tasks of accumulating the 
large number of fragment sequences required and of assembling these 
fragments into coherent sequences, and to keep track of the immense volume 
of sequence data are natural applications for computers. However, 
management of cloning and sequencing is only the smallest and most 
mechanical aspect of the application of computers to the large genome 
project. It would be more accurate to say that there would be no point to 
accumulating large amounts of sequence data if electronic data processing 
methods were not available to make the data useful to the scientific 
community. Further, in anticipation of the glut of data to come, it is 
likely that currently available software and hardware will not be adequate 
to prevent us from drowning in our own data.

-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes

    At present, the Genbank database consists of approximately 14,000 
sequences comprising ~15 mb. To search this entire database using IFIND on 
BIONET is already not practical, and searches are often restricted to 
subsets of the total database. For example, the mammalian and unannotated 
sequences comprise about 7000 sequences consisting of more than 6 mb. A 
recent search of this segment of Genbank using a 1.6 kb probe required 
approximately 3 hours of cpu time on the BIONET computer in batch mode. 
The same program running on a VAX (about 5 times faster for the Sieve of 
Erosthenes benchmark) required about 45 minutes of cpu time. Using the 
faster Lipman and Pearson algorithm, XFASTN on BIONET, 1.7 mb in 1565 
sequences were searched in 9 min, while another implementation of the 
Wilbur and Lipman method on a VAX searched the mammalian and unannotated 
lists in about 20 min. As the search time is highly dependant on the probe 
size (smaller is faster) and the word size (larger is faster), these 
searches were conducted with approximately similar parameters. It was also 
somewhat distressing that several of these searches returned different 
lists of similar sequences.  Clearly, attempts to apply these techniques 
to the complete human genome (~ 30 gb; 2000 times larger than the current 
Genbank database) will strain all available facilities beyond the breaking 
point. The recent proposal to begin accumulating sequence data of this 
magnitude poses a clear challenge to the molecular biology software 
community to develop new and faster algorithms for new and faster hardware 
in order to provide tools capable of practical utilization of 
gigabase-databases.

-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes

    The current DNA and protein databases are unidimensional and 
idiosyncratic. Future databases should be relational so that each sequence 
is linked to: 1) biologically related DNA sequences, 2) physically related 
DNA (map coordinates perhaps) and 3) derived sequence and other data 
including protein sequences, regulatory elements and other features. 
Moreover, these linkages will have to be standardized and pre-indexed, so 
that each database query does not have to begin from scratch. By heavily 
coding and indexing the data, search times can be brought to manageable 
limits, and relational patterns amongst sequences will become evident. In 
addition, considerable thought must be given to the mechanism of coding 
"features" associated with sequence data. The present method of including 
comments as keys to the structure of the sequence, as well as the location 
of functional sites and chemical modifications is not optimal for rapid 
searching and relational indexing. Two alternative schemes seem worth 
considering: 1) an "obligate" feature table for all sequences with a 
defined data structure which can be compactly coded (ASCII text tags waste 
space and time) and rapidly analyzed, or 2) sequence punctuation in the 
form of extended character sets or parenthetical signals. The criteria for 
evaluation of sequence annotation should be focussed on the generality, 
openness and utility of the method, rather than on parochial 
considerations involving current methods. In the gigabase future, there is 
no place for random comments and ad hoc structure definitions. We must 
choose rational and utilitarian methods for sequence data storage and 
management or face the prospect of a modern Tower of Babel. 

-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes


    The recent major advances in molecular biology have demonstrated that 
future progress in this discipline cannot occur without coordinate 
development of computer methods for data storage, data retrieval and 
analysis. We stand now on the threshold of a new age where the volume and 
density of data can be expected to increase by at least three orders of 
magnitude. In large part, how this data is managed will determine whether 
the large genome sequencing project will be worth the effort. 


-------