daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
What will we be searching for in the new, large databases? One use
for total genomic DNA will be to locate genomic sequences for isolated
cDNA. Thus, after sequencing the cDNA for a message of interest, one would
usually like to obtain the genomic sequence, including the structure of
the gene and the upstream and downstream regulatory elements. Rather than
having to fish a genomic clone out of a library, as is the current
practice, it will be possible to find the gene, along with possible
pseudogenes, in the database. Indeed, from what is now known, one might
only need a small amount of cDNA sequence to deduce the entire message
sequence and genomic structure from the database. In addition, the
chromosomal location of the gene will also be known, and the possible
relationship of the gene to genetically mapped heritable diseases will
also be evident. The rapid availability of such information could well
justify the difficulty and expense of building the database.
-------daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
To make such information retrieval possible, computer programs capable
of very rapid scanning and filtering of sequence similarities must be
developed. To facilitate these developments, unification of the present
databases must begin now, before the task of integrating them becomes
unmanageable. Second, new algorithms must be developed to scan the data at
much faster rates than are available today. It may be necessary to develop
secondary data bases which consist of sequence "words" of 10 - 20
nucleotides in length, thus reducing the number of comparisons made in
individual searches. Another approach might involve a cumulative data base
which would "learn" from each new search, the location of sequence
patterns of interest, thus making available to the next searcher, a fast
track to the same sequences. Third, new hardware will be brought online
including supercomputers, parallel processors and dedicated sequence
search engines. These machines, when coupled to efficient integrated
software can allow useful data retrieval from the proposed databases.
-------daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA>
Computer Applications in the Sequencing of Large Genomes
Easy access to coding and regulatory sequences for the entire human
genome will also lead to an unprecedented growth in sequence derived data
such as consensus sequences for transcriptional regulatory sites, splice
junctions and other RNA processing signals, as well as a windfall in
putative protein sequence data from open reading frames in the nucleotide
sequences. This latter class of data will bring pressure on biochemists to
locate the proteins coded by sequences of interest, and to determine their
properties. One route currently employed to approach such problems is to
predict the structure of the protein from its sequence, and to use the
predicted structure as a basis for designing useful probes to study the
actual protein, either in its natural tissue of origin, or in engineered
expression systems. For example, prediction of continuous epitope
locations and synthesis of isosequential peptides has been successful in
eliciting the production of antibodies which are specific for the protein
from which the sequence data was derived. The state of the art in protein
structure prediction is quite primitive as yet and essentially empirical.
To fully take advantage of the large genome sequence database, much effort
will have to be expended in the further development of these methods.
-------