chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) (12/10/90)
Hi, I am a new reader to this group. I am wondering if somebody in the net can briefly explain this genome program or refer me to the associated arcticles/books. In particular, I am interested in knowing how as a computer scientist can help in this project (e.g. database and CAD) and what is the state of the art? Thank you. tzi-cker
aoki@postgres.Berkeley.EDU (Paul M. Aoki) (12/10/90)
chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes: > I am a new reader to this group. I am wondering if somebody in the net can > briefly explain this genome program or refer me to the associated arcticles/books. > In particular, I am interested in knowing how as a computer scientist can help > in this project (e.g. database and CAD) and what is the state of the art? I'm going to take a stab at this, not because I claim to be an expert on the subject, but because I've been digging into this issue (database support for genome mapping) for a couple of weeks and want to bounce my results against the experts .. First, some specific advice. Frank Olken up on the hill (Lawrence Berkeley Labs) is involved with the project and can probably steer you in the appropriate direction. I assume you're still at UCB (I'm not) so he should be easy to find. There are a bunch of pop-science books on the US Human Genome Project; you might get some background out of them, but none of them seem to focus much on computational aspects of the problem :-) One reference I found particularly useful was a journal called "Computer Applications in the Biological Sciences" (often cited as CABIOS). The articles are written by a mishmash of molbio types, CS theorists and hackers. Not much background, but a lot of stuff on what people are working on in terms of real software. Other journals on computers in medical or biological science tend to focus on problems other than genome mapping. The idea behind genome mapping is that if you can find the areas in a given section of DNA that control certain biological functions, you can perform all sorts of genetic engineering (elimination of genetic/inherited diseases, selection of "good" genetic traits, etc.). To this end many nations, including the US, are pouring hundreds of millions of dollars into their various national projects. The biggest problem appears to be sequence matching. Once molecular biologists extract gene segments and identify the base pair (nucleic acids) or amino acid sequences, you wind up with a bunch of long text strings ("ACCGTAA...") that describe these sequences. A variety of archives and formats exist for these sequences (for more information, poke around on genbank.bio.net); suffice it to say that people are compiling sequence maps of many different species and annotating the database entries with what they think different parts of the sequences do. The function of new sequences can be determined by comparing them to current database entries, finding sequences that are "close" by some metric or that share many close subsequences. For example, one might find that the intestinal peptides of horses and humans are very similar and that much of the mapping done on the horse protein can be applied to the corresponding human protein. Some algorithms to compute "closeness" are: 1. Alignment: compute the "evolutionary distance" between one protein and another in terms of simple operations (substitution, insertion, deletion) and cost functions for the operations. Various dynamic programming techniques are commonly used. 2. Subsequence search: proteins that share many subsequences get high scores. Hashing is useful here. This method is limited, however. First, applying it to long sequences and large databases is hard because the distance functions tend to be computationally expensive. There are many optimizations (hybrid dynamic programming/hashing, complex data structures like that found in the latest issue of "ACM Trans. on Information Systems") but it's still expensive to compare against the whole database. Second, plain sequence matching ignores a great deal of physical and chemical information. More complex comparison methods take these factors into account and are even slower. Third, distance (using any distance metric) is only a push in the right direction. The "closest" match in terms of genetic function may well be one of the top 10 matches by (say) the Dayhoff PAM metrics but not the top. The insensitivity of most metrics to "weak matches" is commonly cited. So, what it to be done. For some reason, AI techniques aren't very popular -- people like brute-force, optimal-cost methods. Parallel programming is popular, since the dynamic programming computations are easily parallelized (one group is using a Connection Machine, another uses a Sequent, yet another uses the ICL DAP array processor). Most database technology flies right out the window because the databases are still small enough that a system that goes to disk a lot will have horrible performance relative to more ad-hoc, main-memory- oriented search software. It seems to me the best bet for database research is in the development of data structures like that in the TOIS article (but more space-efficient). There are other computational problems, but most are like the basic form of sequence matching in that they involve some kind of text string matching. (Computers are also being used to automate the sequencing process, controlling lab equipment and whatnot, but that isn't much of a CS research problem.) Anyway, I'll stop here and let people take potshots at this. Let me know what I'm misunderstanding and what I'm missing in terms of major approaches or research areas. Do bear in mind that two weeks ago I didn't know a ribosome from a RFLP so be patient with my misuse of biological terminology. Also, I'd be very interested in hearing what people are working on now. (I noticed at least one USENETter, Roy Smith at PHRI, published in CABIOS ..) -- Paul M. Aoki | aoki@postgres.Berkeley.EDU | ...!ucbvax!aoki
hunter@work.nlm.nih.gov (Larry Hunter) (12/11/90)
In response to Tzi-cker Chiueh's query about what a computer scientist
can do for/with genomic data, Paul Aoki writes:
For some reason, AI techniques aren't very popular -- people like
brute-force, optimal-cost methods. Parallel programming is popular,
since the dynamic programming computations are easily parallelized
(one group is using a Connection Machine, another uses a Sequent, yet
another uses the ICL DAP array processor). Most database technology
flies right out the window because the databases are still small
enough that a system that goes to disk a lot will have horrible
performance relative to more ad-hoc, main-memory- oriented search
software.
Although Aoki's opinions are a helpful beginning, I have to take issue
with a couple of points. There is actually quite a bit of AI being
done in genome-related areas, and the requirements of genome-related
databases (not only sequence, but protein structure, coarser grained
genetic maps, etc.) place significant pressure on existing database
technologies.
As for AI, I can point to more than 100 people listed in a database of
ai & molecular biology researchers that I maintain, doing work in very
diversse areas. I have an article which surveys some of this work
(based on the talks given at 1990 AAAI Spring Symposium on AI &
Molecular Biology) which will appear in the next issue of the AI
Magazine. You may note that the predicted secondary structure of the
principle neutralization determinant of HIV-1 on the cover of the 24
August 1990 issue of Science was generated by a neural network.
BTW, the AI/MB database, which contains information on research
interests and current projects of many people from around the world,
is publicly available. It can be obtained by anonymous ftp from the
host lhc.nlm.nih.gov in the directory /pub/aimb-db, or by request to
the University of Houston email server.
Finally, although I am not an expert in database issues, I would
suggest contacting the National Center for Biotechnology Information
to find out about work in biosequence and other databases. You can
download information from ncbi.nlm.nih.gov using anonymous ftp, or
send mail to federhen@ncbi.nlm.nih.gov.
Good luck. There are many good computer science problems involved in
genome work; we need good computer scientists to attack them.
--
Lawrence Hunter, PhD.
National Library of Medicine
Bldg. 38A, MS-54
Bethesda. MD 20894
(301) 496-9300
(301) 496-0673 (fax)
hunter@nlm.nih.gov (internet)
hunter%nlm.nih.gov@nihcu (bitnet/earn)
kristoff@genbank.bio.net (David Kristofferson) (12/11/90)
Your question about what computer scientists can do is being debated, but so far not on-line (which is something of a pity because the subscribership to this newgroup is larger than that of any other BIOSCI group by now, I think, and must include computer scientists who are also interested in this issue). I recently returned from a meeting in DC of the NIH/DOE Joint Informatics Task Force (JITF). The meeting was designed in investigate issues concerning sequence and mapping databases so that JITF could make recommendations in this area. While several interesting talks were given, your posting concerning what computer scientists could do raises a point which I believe was not fully explored. We heard from the various groups involved with producing databases and heard something about future plans, but there was not an extensive discussion of the suitability of the various database designs vis-a-vis potantial genome-sized computing problems. Perhaps this is the topic of another JITF meeting (?), but it might be helpful if a discussion along these lines was raised. For example, currently everyone is familiar with the problem of having to have a variety of formats of the sequence databases on the same machine, each format designed/optimized for a different specific task. What efforts are being made to reduce this problem in future database designs? I'm sure that there must be other nagging computational questions/problems that people can not get efficiently resolved at present (ones which will become even worse as the amount of data increases). This newsgroup is a chance for you to provide your input on these issues and to make contacts with others who might help. I should also add that members of the JITF and NIH officials also monitor these postings, so good ideas will not go unnoticed. -- Sincerely, Dave Kristofferson GenBank Manager kristoff@genbank.bio.net
lloyd@bruce.cs.monash.OZ.AU (lloyd allison) (12/11/90)
In <39971@ucbvax.BERKELEY.EDU> aoki@postgres.Berkeley.EDU (Paul M. Aoki) writes: >computationally expensive. There are many optimizations (hybrid >dynamic programming/hashing, complex data structures like that found >in the latest issue of "ACM Trans. on Information Systems") but it's >still expensive to compare against the whole database. Second, plain anyone have the full ref to this plsz? Lloyd ALLISON Department of Computer Science, UUCP:lloyd@bruce.cs.monash.edu.au Monash University, Clayton, or :uunet!munnari!bruce.cs.monash.edu.au!lloyd VICTORIA 3168, AUSTRALIA Tel :565-5205 FAX: +61 3 565 5146