[bionet.molbio.news] CSLG: COMMENTARY: From Ellis Golub

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes


    What will we be searching for in the new, large databases?  One use 
for total genomic DNA will be to locate genomic sequences for isolated 
cDNA. Thus, after sequencing the cDNA for a message of interest, one would 
usually like to obtain the genomic sequence, including the structure of 
the gene and the upstream and downstream regulatory elements. Rather than 
having to fish a genomic clone out of a library, as is the current 
practice, it will be possible to find the gene, along with possible 
pseudogenes, in the database. Indeed, from what is now known, one might 
only need a small amount of cDNA sequence to deduce the entire message 
sequence and genomic structure from the database. In addition, the 
chromosomal location of the gene will also be known, and the possible 
relationship of the gene to genetically mapped heritable diseases will 
also be evident. The rapid availability of such information could well 
justify the difficulty and expense of building the database.

-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes

    To make such information retrieval possible, computer programs capable 
of very rapid scanning and filtering of sequence similarities must be 
developed. To facilitate these developments, unification of the present 
databases must begin now, before the task of integrating them becomes 
unmanageable. Second, new algorithms must be developed to scan the data at 
much faster rates than are available today. It may be necessary to develop 
secondary data bases which consist of sequence "words" of 10 - 20 
nucleotides in length, thus reducing the number of comparisons made in 
individual searches. Another approach might involve a cumulative data base 
which would "learn" from each new search, the location of sequence 
patterns of interest, thus making available to the next searcher, a fast 
track to the same sequences. Third, new hardware will be brought online 
including supercomputers, parallel processors and dedicated sequence 
search engines. These machines, when coupled to efficient integrated 
software can allow useful data retrieval from the proposed databases.

-------

daemon@ig.UUCP (12/01/87)

From: Sunil Maulik <MAULIK@BIONET-20.ARPA>

         Computer Applications in the Sequencing of Large Genomes

    Easy access to coding and regulatory sequences for the entire human 
genome will also lead to an unprecedented growth in sequence derived data 
such as consensus sequences for transcriptional regulatory sites, splice 
junctions and other RNA processing signals, as well as a windfall in 
putative protein sequence data from open reading frames in the nucleotide 
sequences. This latter class of data will bring pressure on biochemists to 
locate the proteins coded by sequences of interest, and to determine their 
properties. One route currently employed to approach such problems is to 
predict the structure of the protein from its sequence, and to use the 
predicted structure as a basis for designing useful probes to study the 
actual protein, either in its natural tissue of origin, or in engineered 
expression systems. For example, prediction of continuous epitope 
locations and synthesis of isosequential peptides has been successful in 
eliciting the production of antibodies which are specific for the protein 
from which the sequence data was derived. The state of the art in protein 
structure prediction is quite primitive as yet and essentially empirical. 
To fully take advantage of the large genome sequence database, much effort 
will have to be expended in the further development of these methods. 



-------