[bionet.molbio.genome-program] General Reference

chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) (12/10/90)

Hi, 

I am a new reader to this group. I am wondering if somebody in the net can 
briefly explain this genome program or refer me to the associated arcticles/books.
In particular, I am interested in knowing how as a computer scientist can help
in this project (e.g. database and CAD) and what is the state of the art?
Thank you.

tzi-cker

aoki@postgres.Berkeley.EDU (Paul M. Aoki) (12/10/90)

chiueh@sprite.Berkeley.EDU (Tzi-cker Chiueh) writes:
> I am a new reader to this group. I am wondering if somebody in the net can 
> briefly explain this genome program or refer me to the associated arcticles/books.
> In particular, I am interested in knowing how as a computer scientist can help
> in this project (e.g. database and CAD) and what is the state of the art?

I'm going to take a stab at this, not because I claim to be an expert on
the subject, but because I've been digging into this issue (database
support for genome mapping) for a couple of weeks and want to bounce my
results against the experts ..

First, some specific advice.  Frank Olken up on the hill (Lawrence
Berkeley Labs) is involved with the project and can probably steer
you in the appropriate direction.  I assume you're still at UCB (I'm
not) so he should be easy to find.

There are a bunch of pop-science books on the US Human Genome Project;
you might get some background out of them, but none of them seem to 
focus much on computational aspects of the problem :-)  One reference 
I found particularly useful was a journal called "Computer Applications 
in the Biological Sciences" (often cited as CABIOS).  The articles are 
written by a mishmash of molbio types, CS theorists and hackers.  Not 
much background, but a lot of stuff on what people are working on in 
terms of real software.  Other journals on computers in medical or
biological science tend to focus on problems other than genome mapping.

The idea behind genome mapping is that if you can find the areas in
a given section of DNA that control certain biological functions,
you can perform all sorts of genetic engineering (elimination of
genetic/inherited diseases, selection of "good" genetic traits,
etc.).  To this end many nations, including the US, are pouring
hundreds of millions of dollars into their various national projects.

The biggest problem appears to be sequence matching.  Once molecular 
biologists extract gene segments and identify the base pair (nucleic 
acids) or amino acid sequences, you wind up with a bunch of long text 
strings ("ACCGTAA...") that describe these sequences.  A variety of
archives and formats exist for these sequences (for more information,
poke around on genbank.bio.net); suffice it to say that people are 
compiling sequence maps of many different species and annotating
the database entries with what they think different parts of the
sequences do.  The function of new sequences can be determined by
comparing them to current database entries, finding sequences that
are "close" by some metric or that share many close subsequences.
For example, one might find that the intestinal peptides of horses
and humans are very similar and that much of the mapping done on
the horse protein can be applied to the corresponding human protein.
Some algorithms to compute "closeness" are:
	1. Alignment: compute the "evolutionary distance" between
	   one protein and another in terms of simple operations
	   (substitution, insertion, deletion) and cost functions
	   for the operations.  Various dynamic programming 
	   techniques are commonly used.
	2. Subsequence search: proteins that share many subsequences
	   get high scores.  Hashing is useful here.
This method is limited, however.  First, applying it to long sequences
and large databases is hard because the distance functions tend to be
computationally expensive.  There are many optimizations (hybrid
dynamic programming/hashing, complex data structures like that found
in the latest issue of "ACM Trans. on Information Systems") but it's
still expensive to compare against the whole database.  Second, plain
sequence matching ignores a great deal of physical and chemical
information.  More complex comparison methods take these factors into
account and are even slower.  Third, distance (using any distance
metric) is only a push in the right direction.  The "closest" match
in terms of genetic function may well be one of the top 10 matches
by (say) the Dayhoff PAM metrics but not the top.  The insensitivity
of most metrics to "weak matches" is commonly cited.

So, what it to be done.  For some reason, AI techniques aren't very
popular -- people like brute-force, optimal-cost methods.  Parallel
programming is popular, since the dynamic programming computations
are easily parallelized (one group is using a Connection Machine,
another uses a Sequent, yet another uses the ICL DAP array processor).
Most database technology flies right out the window because the
databases are still small enough that a system that goes to disk a lot
will have horrible performance relative to more ad-hoc, main-memory-
oriented search software.  It seems to me the best bet for database
research is in the development of data structures like that in the 
TOIS article (but more space-efficient).

There are other computational problems, but most are like the basic
form of sequence matching in that they involve some kind of text
string matching.  (Computers are also being used to automate the
sequencing process, controlling lab equipment and whatnot, but that 
isn't much of a CS research problem.)

Anyway, I'll stop here and let people take potshots at this.  Let
me know what I'm misunderstanding and what I'm missing in terms of
major approaches or research areas.  Do bear in mind that two weeks
ago I didn't know a ribosome from a RFLP so be patient with my misuse
of biological terminology.  Also, I'd be very interested in hearing 
what people are working on now.  (I noticed at least one USENETter,
Roy Smith at PHRI, published in CABIOS ..)
--
    Paul M. Aoki   |   aoki@postgres.Berkeley.EDU   |   ...!ucbvax!aoki

hunter@work.nlm.nih.gov (Larry Hunter) (12/11/90)

In response to Tzi-cker Chiueh's query about what a computer scientist
can do for/with genomic data, Paul Aoki writes:

   For some reason, AI techniques aren't very popular -- people like
   brute-force, optimal-cost methods.  Parallel programming is popular,
   since the dynamic programming computations are easily parallelized
   (one group is using a Connection Machine, another uses a Sequent, yet
   another uses the ICL DAP array processor).  Most database technology
   flies right out the window because the databases are still small
   enough that a system that goes to disk a lot will have horrible
   performance relative to more ad-hoc, main-memory- oriented search
   software.

Although Aoki's opinions are a helpful beginning, I have to take issue
with a couple of points.  There is actually quite a bit of AI being
done in genome-related areas, and the requirements of genome-related
databases (not only sequence, but protein structure, coarser grained
genetic maps, etc.) place significant pressure on existing database
technologies.

As for AI, I can point to more than 100 people listed in a database of
ai & molecular biology researchers that I maintain, doing work in very
diversse areas.  I have an article which surveys some of this work
(based on the talks given at 1990 AAAI Spring Symposium on AI &
Molecular Biology) which will appear in the next issue of the AI
Magazine.  You may note that the predicted secondary structure of the
principle neutralization determinant of HIV-1 on the cover of the 24
August 1990 issue of Science was generated by a neural network.

BTW, the AI/MB database, which contains information on research
interests and current projects of many people from around the world,
is publicly available.  It can be obtained by anonymous ftp from the
host lhc.nlm.nih.gov in the directory /pub/aimb-db, or by request to
the University of Houston email server.

Finally, although I am not an expert in database issues, I would
suggest contacting the National Center for Biotechnology Information
to find out about work in biosequence and other databases.  You can
download information from ncbi.nlm.nih.gov using anonymous ftp, or
send mail to federhen@ncbi.nlm.nih.gov.

Good luck.  There are many good computer science problems involved in
genome work; we need good computer scientists to attack them.
--
Lawrence Hunter, PhD.
National Library of Medicine
Bldg. 38A, MS-54
Bethesda. MD 20894
(301) 496-9300
(301) 496-0673 (fax)
hunter@nlm.nih.gov (internet)
hunter%nlm.nih.gov@nihcu (bitnet/earn)

kristoff@genbank.bio.net (David Kristofferson) (12/11/90)

Your question about what computer scientists can do is being debated,
but so far not on-line (which is something of a pity because the
subscribership to this newgroup is larger than that of any other
BIOSCI group by now, I think, and must include computer scientists who
are also interested in this issue).  I recently returned from a
meeting in DC of the NIH/DOE Joint Informatics Task Force (JITF).  The
meeting was designed in investigate issues concerning sequence and
mapping databases so that JITF could make recommendations in this
area.  While several interesting talks were given, your posting
concerning what computer scientists could do raises a point which I
believe was not fully explored.  We heard from the various groups
involved with producing databases and heard something about future
plans, but there was not an extensive discussion of the suitability of
the various database designs vis-a-vis potantial genome-sized
computing problems.  Perhaps this is the topic of another JITF meeting
(?), but it might be helpful if a discussion along these lines was
raised.  

For example, currently everyone is familiar with the problem of having
to have a variety of formats of the sequence databases on the same
machine, each format designed/optimized for a different specific task.
What efforts are being made to reduce this problem in future database
designs?  I'm sure that there must be other nagging computational
questions/problems that people can not get efficiently resolved at
present (ones which will become even worse as the amount of data
increases).  This newsgroup is a chance for you to provide your input
on these issues and to make contacts with others who might help.  I
should also add that members of the JITF and NIH officials also
monitor these postings, so good ideas will not go unnoticed.
-- 
				Sincerely,

				Dave Kristofferson
				GenBank Manager

				kristoff@genbank.bio.net

lloyd@bruce.cs.monash.OZ.AU (lloyd allison) (12/11/90)

In <39971@ucbvax.BERKELEY.EDU> aoki@postgres.Berkeley.EDU (Paul M. Aoki) writes:

>computationally expensive.  There are many optimizations (hybrid
>dynamic programming/hashing, complex data structures like that found
>in the latest issue of "ACM Trans. on Information Systems") but it's
>still expensive to compare against the whole database.  Second, plain

anyone have the full ref to this plsz?

Lloyd ALLISON
Department of Computer Science, UUCP:lloyd@bruce.cs.monash.edu.au
Monash University, Clayton,     or  :uunet!munnari!bruce.cs.monash.edu.au!lloyd
VICTORIA 3168, AUSTRALIA        Tel :565-5205               FAX: +61 3 565 5146