daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes What will we be searching for in the new, large databases? One use for total genomic DNA will be to locate genomic sequences for isolated cDNA. Thus, after sequencing the cDNA for a message of interest, one would usually like to obtain the genomic sequence, including the structure of the gene and the upstream and downstream regulatory elements. Rather than having to fish a genomic clone out of a library, as is the current practice, it will be possible to find the gene, along with possible pseudogenes, in the database. Indeed, from what is now known, one might only need a small amount of cDNA sequence to deduce the entire message sequence and genomic structure from the database. In addition, the chromosomal location of the gene will also be known, and the possible relationship of the gene to genetically mapped heritable diseases will also be evident. The rapid availability of such information could well justify the difficulty and expense of building the database. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes To make such information retrieval possible, computer programs capable of very rapid scanning and filtering of sequence similarities must be developed. To facilitate these developments, unification of the present databases must begin now, before the task of integrating them becomes unmanageable. Second, new algorithms must be developed to scan the data at much faster rates than are available today. It may be necessary to develop secondary data bases which consist of sequence "words" of 10 - 20 nucleotides in length, thus reducing the number of comparisons made in individual searches. Another approach might involve a cumulative data base which would "learn" from each new search, the location of sequence patterns of interest, thus making available to the next searcher, a fast track to the same sequences. Third, new hardware will be brought online including supercomputers, parallel processors and dedicated sequence search engines. These machines, when coupled to efficient integrated software can allow useful data retrieval from the proposed databases. -------
daemon@ig.UUCP (12/01/87)
From: Sunil Maulik <MAULIK@BIONET-20.ARPA> Computer Applications in the Sequencing of Large Genomes Easy access to coding and regulatory sequences for the entire human genome will also lead to an unprecedented growth in sequence derived data such as consensus sequences for transcriptional regulatory sites, splice junctions and other RNA processing signals, as well as a windfall in putative protein sequence data from open reading frames in the nucleotide sequences. This latter class of data will bring pressure on biochemists to locate the proteins coded by sequences of interest, and to determine their properties. One route currently employed to approach such problems is to predict the structure of the protein from its sequence, and to use the predicted structure as a basis for designing useful probes to study the actual protein, either in its natural tissue of origin, or in engineered expression systems. For example, prediction of continuous epitope locations and synthesis of isosequential peptides has been successful in eliciting the production of antibodies which are specific for the protein from which the sequence data was derived. The state of the art in protein structure prediction is quite primitive as yet and essentially empirical. To fully take advantage of the large genome sequence database, much effort will have to be expended in the further development of these methods. -------