daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes It is trivial to assert that computer programs of the type currently available on BIONET and numerous other general purpose molecular biology software packages will play a key role in the massive sequencing projects envisioned for the near future. The difficult tasks of accumulating the large number of fragment sequences required and of assembling these fragments into coherent sequences, and to keep track of the immense volume of sequence data are natural applications for computers. However, management of cloning and sequencing is only the smallest and most mechanical aspect of the application of computers to the large genome project. It would be more accurate to say that there would be no point to accumulating large amounts of sequence data if electronic data processing methods were not available to make the data useful to the scientific community. Further, in anticipation of the glut of data to come, it is likely that currently available software and hardware will not be adequate to prevent us from drowning in our own data. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes At present, the Genbank database consists of approximately 14,000 sequences comprising ~15 mb. To search this entire database using IFIND on BIONET is already not practical, and searches are often restricted to subsets of the total database. For example, the mammalian and unannotated sequences comprise about 7000 sequences consisting of more than 6 mb. A recent search of this segment of Genbank using a 1.6 kb probe required approximately 3 hours of cpu time on the BIONET computer in batch mode. The same program running on a VAX (about 5 times faster for the Sieve of Erosthenes benchmark) required about 45 minutes of cpu time. Using the faster Lipman and Pearson algorithm, XFASTN on BIONET, 1.7 mb in 1565 sequences were searched in 9 min, while another implementation of the Wilbur and Lipman method on a VAX searched the mammalian and unannotated lists in about 20 min. As the search time is highly dependant on the probe size (smaller is faster) and the word size (larger is faster), these searches were conducted with approximately similar parameters. It was also somewhat distressing that several of these searches returned different lists of similar sequences. Clearly, attempts to apply these techniques to the complete human genome (~ 30 gb; 2000 times larger than the current Genbank database) will strain all available facilities beyond the breaking point. The recent proposal to begin accumulating sequence data of this magnitude poses a clear challenge to the molecular biology software community to develop new and faster algorithms for new and faster hardware in order to provide tools capable of practical utilization of gigabase-databases. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes The current DNA and protein databases are unidimensional and idiosyncratic. Future databases should be relational so that each sequence is linked to: 1) biologically related DNA sequences, 2) physically related DNA (map coordinates perhaps) and 3) derived sequence and other data including protein sequences, regulatory elements and other features. Moreover, these linkages will have to be standardized and pre-indexed, so that each database query does not have to begin from scratch. By heavily coding and indexing the data, search times can be brought to manageable limits, and relational patterns amongst sequences will become evident. In addition, considerable thought must be given to the mechanism of coding "features" associated with sequence data. The present method of including comments as keys to the structure of the sequence, as well as the location of functional sites and chemical modifications is not optimal for rapid searching and relational indexing. Two alternative schemes seem worth considering: 1) an "obligate" feature table for all sequences with a defined data structure which can be compactly coded (ASCII text tags waste space and time) and rapidly analyzed, or 2) sequence punctuation in the form of extended character sets or parenthetical signals. The criteria for evaluation of sequence annotation should be focussed on the generality, openness and utility of the method, rather than on parochial considerations involving current methods. In the gigabase future, there is no place for random comments and ad hoc structure definitions. We must choose rational and utilitarian methods for sequence data storage and management or face the prospect of a modern Tower of Babel. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes The recent major advances in molecular biology have demonstrated that future progress in this discipline cannot occur without coordinate development of computer methods for data storage, data retrieval and analysis. We stand now on the threshold of a new age where the volume and density of data can be expected to increase by at least three orders of magnitude. In large part, how this data is managed will determine whether the large genome sequencing project will be worth the effort. -------