[bionet.molbio.bio-matrix] In defense of the Genome Boondoggle

gunnell@FCRFV1.NCIFCRF.GOV ("Gunnell, Mark") (02/12/91)

In article <9102111731.AA00773@genbank.bio.net>
Ellington@frodo.mgh.harvard.edu (Deaddog) writes:

> In article <9102111606.AA25622@genbank.bio.net> 
> gribskov@FCRFV1.NCIFCRF.GOV ("Gribskov, Michael") writes:
> > I suppose the that the cataloging of galaxies is a similar boondoggle,
> > in spite of the fact that this effort is currently leading to some of
> > the most important and interesting progress in astrophysics.  I guess
> > the real problem with these kinds of projects is that the day-to-day
> > work is tedious, and results only come in the long term.  Strange how
> > much of science falls in that category isn't it?
> 
> Ah, Michael, you really should ask for my opinion rather than just
> making one up for me.
> 
> Catalogue them galaxies!  Discover the secrets of cosmology; see how stars
> form; determine the mass of the Universe and how it is distributed; find
> amazing physical phenomena never before observed by human eyes.  Yes,
> all these and more can be yours if you just continue to fund astrophysics.
> A noble and worthy cause.
> 
> Make me a list of similar worth that has to do with the Genome Boondoggle.

Catalogue all human genes! Discover the functions of mapped genes; see how 
genes evolve; evaluate molecular evolution theories and how species originate;
find amazing biological phenomena never before observed by human eyes.  Yes,
all these and more can ...  etc.,etc. 8-)

Mark A. Gunnell
gunnell@ncifcrf.gov

elmo@troi.cc.rochester.edu (Eric Cabot) (02/12/91)

In article <9102111942.AA08834@genbank.bio.net> gunnell@FCRFV1.NCIFCRF.GOV ("Gunnell, Mark") writes:
>In article <9102111731.AA00773@genbank.bio.net>
>Ellington@frodo.mgh.harvard.edu (Deaddog) writes:
>
>> 
>> Make me a list of similar worth that has to do with the Genome Boondoggle.
>
>Catalogue all human genes! Discover the functions of mapped genes; see how 
>genes evolve; evaluate molecular evolution theories and how species originate;
>find amazing biological phenomena never before observed by human eyes.  Yes,
>all these and more can ...  etc.,etc. 8-)

You *must* be either kidding us or yourself!

But seriously, item 1 is hardly possible, item
2 is probably not possible, and the remaining items are not even
close to possible from a mere sequence determination of the (a?)
human genome.


--
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Eric Cabot                              |    elmo@uhura.cc.rochester.edu
      "insert your face here"           |    elmo@uordbv.bitnet
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=

Ellington@Frodo.MGH.Harvard.EDU (Deaddog) (02/12/91)

In article <9102111942.AA08834@genbank.bio.net> gunnell@FCRFV1.NCIFCRF.GOV 
("Gunnell, Mark") writes:
> Catalogue all human genes! Discover the functions of mapped genes; see 
how 
> genes evolve; evaluate molecular evolution theories and how species 
originate;
> find amazing biological phenomena never before observed by human eyes.

Ah, finally some meat.

1) Discover the function of mapped genes.  The Genome Initiative is not
necessary for this.  If the phenotype is important, then a directed effort 
can be made to clone and sequence a given gene.  The sequence of the human
genome can of course be used to find the sequence of genes mapped *after*
the genome sequence has been determined.  However, I submit that mapping/
sequencing which targets a specific human diseases will proceed much 
faster and with less waste than the Genome Initiative itself.  Further, I
submit that theories having to do with developmental biology, organization of 
transcription units, regulatory phenomena, and so forth are most readily
answered by testing specific hypotheses/cloning specific genes.  The
Genome Initiative is massive overkill for these answers (see also (3),
below.)  

2) See how genes evolve; evaluate molecular evolution theories and how
species originate.  As a card-carrying molecular evolutionist (burn him!), 
I agree that these are indeed mouth-watering problems.  It is sad that
they are receiving so little support now.  But, given the Scourge of AIDS,
this is perhaps understandable.  What is not understandable is why the 
Herculean efforts required to sequence the Human Genome will yield more
information than comparative sequence analysis of limited regions or of
specific genes.  Imagine a grant proposal in which I proposed to sequence
lactate dehydrogenase genes from a huge diversity of organisms; this 
proposal would be at the very bottom of the funding heap.  Yet it would
say an immense amount about how genes evolve.  Imagine a grant proposal in
which I proposed to clone/sequence "all zinc finger proteins from three
developmentally different organisms."  This might be more fundable, would
again yield an immense amount of information about protein evolution 
and perhaps transcription/gene regulation, but would not require the
Human Genome Initiative.

3) Find amazing biological phenomena never before observed by human eyes.
I think not.  The genotype is relatively uninterpretable without
corresponding phenotypes (which is why I support Drosophila, Coli, etc.
sequencing but cannot get behind the human effort).  The only biological
phenomena that I can see being elucidated by the sequence of the human 
genome would be instances where phenotype = genotype; that is,
selfish DNAs such as transposable elements.  A very interesting 
phenomena, but one that would again probably be nuked by a review
section if the experiment proposed was "I want to sequence the human
genome so that I can know the distribution of transposable elements
in an individual human."

So:  back to (one of) my original point(s):  In an era of limited funding,
directed research is essential.  Let each of the putative benefits of the
Genome Initiative be put up against other research proposals.  Let the 
molecular evolutionists who want to study bacterial speciation fight for 
the same money as those that want the sequence of every repetitive 
element on each human chromosome.  Let the developmental Drosophilists 
who are feeling the crunch compete against those who would assert that 
we can decode the series of steps by which an organ takes shape from the 
sequence of the human genome.  Let biological phenomena from neural 
networks to nematodes compete against whatever 'new' biological phenomena
the Genome Initiative will proport to discover.

There is so much *INTERESTING* science to be done.  And so little of it
will come from this genetic telephone book.  The question comes back to
how to get the most bang out of your buck.  For each question you say
the Genome Initiative will answer, I believe I can give you a dozen that
are cheaper and more interesting.  

Non-woof

(Getting less vitriolic, but not less arrogant or obnoxious.  Who knows,
maybe I'll even start using smileys or something.)

(Hmmm.  I actually had a thought.  How about we put a little check-off
box on NIH grant proposals--you know, just like on your tax returns.  "I
would like to earmark $5 for the Human Genome Initiative."  That way,
everyone that thinks it's a great idea can devote some of their funding
to it.  Then they could get discounts on the sequences of genes
of interest when they start flowing out of the sequencing sweatshops.
Just an idea.)

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/13/91)

In article <12145@ur-cc.UUCP> elmo@troi.cc.rochester.edu (Eric Cabot) writes:
>In article <9102111942.AA08834@genbank.bio.net> gunnell@FCRFV1.NCIFCRF.GOV ("Gunnell, Mark") writes:
>>In article <9102111731.AA00773@genbank.bio.net>
>>Ellington@frodo.mgh.harvard.edu (Deaddog) writes:
>>
>>> 
>>> Make me a list of similar worth that has to do with the Genome Boondoggle.
>>
>>Catalogue all human genes! Discover the functions of mapped genes; see how 
>>genes evolve; evaluate molecular evolution theories and how species originate;
>>find amazing biological phenomena never before observed by human eyes.  Yes,
>>all these and more can ...  etc.,etc. 8-)

>You *must* be either kidding us or yourself!
>But seriously, item 1 is hardly possible, item
>2 is probably not possible, and the remaining items are not even
>close to possible from a mere sequence determination of the (a?)
>human genome.

I think that Mark is exactly correct, and you have missed the point.  Having a
huge database full of human sequences opens vistas for those of us who know how
to use statistical tools to analyse sequences.  There are many things that can
be done.  Some of them include learning how to identify genes from raw
sequences alone.  Predictions can be tested - which leads to rapid discovery of
new genes.  I have been involved in two cases of this already (see Stormo et al
NAR 10:2997 1982 for the first example of gene identification by computer; the
second one is in preparation), and it will certainly will happen more as people
use neural nets more.

A straight sequencing of the genome will avoid the terrible biases that we
currently have in the GenBank database.  For example, the database is missing
the insides of introns.  If you think that these are not important, then you
may well be in for some super surprises later.  The phrase "junk DNA" is a
statement of ignorance, not scientific fact.  People currently chop off the
bases near the 3' sides of introns and don't report them in the database.  The
proof is that they often end 10, 20 or 30 bases from the splice junction.  This
would not happen if people reported all their data.  Unfortunately, this means
that people have thrown out important parts of splice junctions BECAUSE THEY
THOUGHT THEY WERE UN-IMPORTANT.  Do you follow?  People think something is not
important, so they don't report it in the database, or limit the reports, so
nobody discovers that it IS important!  Another example is the reporting of
only the coding sequence of a procaryotic gene, even though we KNOW that there
is a region upstream (the Shine/Dalgarno) which is important for translational
initiation.  Any statistical analysis of human sequences must be done carefully
to avoid biases from the highly over-represented immunoglobulin and MHC
sequences.  I'm sure you can think of other examples.  A complete sequence,
without any bias is the best way to get around this.  I think that that alone
justifies the project.

The second major justification is the enormous boost to sequencing technology
that the project is making.  We are eventually going to be able to sequence
everybody's DNA in a few minutes.  This will have enormous medical implications,
since it will remove much guess work from medicine.

I also used to think that the project was foolish, but these reasons have
convinced me that it is worthwhile.

There is also the spirit of adventure.  Fred Blattner once pointed out that it
would be really neat (my words, not his) to have the entire sequence of E. coli
- simply because it would be the first time that we knew the entire
specification of a living organism.  (Viruses don't count since they are
dependent on the host.)

>Eric Cabot elmo@uhura.cc.rochester.edu elmo@uordbv.bitnet

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

Ellington@Frodo.MGH.Harvard.EDU (Deaddog) (02/13/91)

In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom 
Schneider) defends the faith:

> learning how to identify genes from raw
> sequences alone.  Predictions can be tested - which leads to rapid 
> discovery of new genes.

As does PCR amplification or hybridization:  the analogue versions of your
digital statistical analyses.  The question is not whether some genes will
be identified, the question is (a) how many could already be identified 
without the sequence of the genome, and (b) whether the (IMO paltry)
number that remain be worth the enormous cost?

Statisticians drool at the mounds of data to be created.  Researchers
who go begging want to shoot the statisticians.  Rather than pistols
at ten paces, how about each side trying to justify expenditures for
the same set of money?  

>  avoid the terrible biases that we
> currently have in the GenBank database.

I'm sorry, but this does not seem like a terribly important
problem.  GenBank is skewed.  Big deal.  It gets the job done.
We find genes, we miss some stuff.  Science slops along and
we still find those self-splicing introns and centromeres and
other cool things.  Without the sequence of the human genome.
And with many people happily employed (for now) producing 
gobs of worthwhile data. 

I mean, what's a good example of what we have missed?  We know 
the Shine/Dalgarno sequences.  We have learned far more from 
mutation than we would by sequencing a bacterial genome (note:  
sequencing the Coli genome is indeed a cool thing to do). 

And will the "insides of introns" generate data for 2 PNAS papers and 
a TIBS review, or will it actually be worth the billions of 
dollars it will take to properly correct this horrific accounting
error?

> I think that that alone justifies the project.

Please, go speak to any faculty of any public university.  Wear body
armor.

> The second major justification is the enormous boost to sequencing 
> technology that the project is making.

Good sequencing technology stands on its own.  It does not need the Genome
Boondoggle to help it along.  

> We are eventually going to be able to sequence
> everybody's DNA in a few minutes.

Matrix-teers:  Is this nuts or what?  I've never seen this before, but
if it is even remotely true, I'll eat the small plastic rats that reside 
on the top of my terminal.

> There is also the spirit of adventure.

There is also the whiff of despair pouring out of research labs across
the U.S.  Alleviate that stench, then sequence your genome.

Non-woof

elmo@troi.cc.rochester.edu (Eric Cabot) (02/14/91)

In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>I think that Mark is exactly correct, and you have missed the point.  Having a
>huge database full of human sequences opens vistas for those of us who know how
>to use statistical tools to analyse sequences.  There are many things that can
>be done.  Some of them include learning how to identify genes from raw

(much stuff deleted)
Ok, I agree that it is possible to use statistical methods to infer that
a given sequence contains a "gene". If I read your perspective correctly
(and ignoring the self-back patting) the main goal is to beef up the
database so that we can find new genes, whether functional or not.  I 
I'm sorry, but I just see that as cost effective,  given that we won't
have the slightest inkling of what most of these genes are supposed to
do. 

>
>A straight sequencing of the genome will avoid the terrible biases that we
>currently have in the GenBank database.  For example, the database is missing

Oh really? Wouldn't you say that concentrating on coli, fly, worm, yeast
human and maybe a plant species puts a bit of bias into the database?

>the insides of introns.  If you think that these are not important, then you
>may well be in for some super surprises later.  The phrase "junk DNA" is a
>statement of ignorance, not scientific fact.  People currently chop off the
>bases near the 3' sides of introns and don't report them in the database.  The
>proof is that they often end 10, 20 or 30 bases from the splice junction.  This
>would not happen if people reported all their data.  Unfortunately, this means
>that people have thrown out important parts of splice junctions BECAUSE THEY
>THOUGHT THEY WERE UN-IMPORTANT.  Do you follow?  People think something is not
>important, so they don't report it in the database, or limit the reports, so
>nobody discovers that it IS important!  Another example is the reporting of

(Nothing deleted because I am in complete agreement. Oh how I have ranted
and raved about missing intron sequences.)  But
frankly, I don't follow if this is part of the defense of the genome project.
Sure it'd be great to have chromosome long tracts of sequences to infer
gemone organization but will we really be able to make sense out of
it all using the sequence data alone? Take the case of upstream control
regions, their significance was worked for the most part by experimental
techinques.  Those results are the stuff that are used to generate rules
for sequence analysis. Not the other way around. 

>
>The second major justification is the enormous boost to sequencing technology
>that the project is making.  We are eventually going to be able to sequence
>everybody's DNA in a few minutes.  This will have enormous medical implications

Ok, that's a valid argument. There's nothing like technological advancement.

--
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=
Eric Cabot                              |    elmo@uhura.cc.rochester.edu
      "insert your face here"           |    elmo@uordbv.bitnet
=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=

BROE@AARDVARK.UCS.UOKNOR.EDU (Bruce Roe) (02/14/91)

Andrew,
	I'm sending you this directly rather than posting to the
net.  If you think it will add to the public discussion I'll post
it, or modify it before posting...............

	I've been very quiet during the recent rash of discussion
regarding funding, the Human Genome Project (HGP), etc. and enjoyed
reading all the FLAMES, SLAMS, and APPOLOGIES as well as the more
serious discussions.

	As I see it, the HGP is a very inovative way to increase
funding opportunities for serious science and alot of serious,
hard core science is being supported by the HGP.  Having been on
several Human Genome study sections I can assure the community that
these are of the same rigor as other study sections I have served
on (including Biochemistry and Physiological Chem. study sections).
Only the BEST science is being supported after a thorough peer review.

Andrew D. Ellington writes:
> Let these projects compete in the normal grant pool:
> if they are worthy they will be funded and science will be better off.

	This is the bottom line to his argument, and the REAL problem
he has with the HGP.  It is a fear that the HGP will drain funds from
other research programs.  The truth is that it will, but not in the way
he and others think.  Read on please.....

I would argue that Dr. Ellington really doesn't want the HGP grants
to be put in the general grant pool and compete with others BECAUSE,
the grants I've reviewed were so good and filled with so much real
science and would be given such high priority scores that they not
only would be funded but would be at the top of the heep.  Sure,
some of the grants were cow-dung and given low scores but others
contained extremely exciting science and given priority scores
accordingly.  The rigor of grant review at the HGP study sections
is at the highest standards and even higher than other study sections
I've been on.  The members of the HGP study sections are scientists
such as ourselves with diverse interests but with the commom goal of
rigorously reviewing each proposal and scoring on merit. Dr. Ellington
is way off base here and may be should re-think his position based on
the above.

Dr. Ellington also writes:
> If not, at least we won't have the built-in impetus to clone and
> sequence random DNA for no good reason.

	This is pure crap.  No one is sequencing random DNA for no
good reason.

Andrew, the following is FYI:
	At this point in the HGP very little human DNA sequencing is
being supported.  The bulk of the efforts, on the ACTUAL SEQUENCING
projects is aimed at model organisms, E. coli, Yeast, C. elegans, and 
Mycobacteria.  The work on mammalian genomes mostly involves mapping,
determining STS's, etc.  A couple of groups, us included, are attempting
to sequence small (ca. several hundred thousand base pairs) regions of
extreme biological significance.  Our bit is the c-abl gene on chrom. 9
and the bcr gene on chrom. 22, which are involved in the chromosomal
translocation which cause the major forms of leukemia.  Our purpose
is to understand the sequences and subsequent events which result
in this biological phenomenon (chromosomal translocation).  That's
hard core science as I see it and that's the kind of stuff given
the highest scores by peer reviews.

	To address this issue we have to figure out how to sequence
almost half a million bases without going competely crazy and without
creating a "Sequencing Sweat Shop".  With the present sequencing 
approaches, we only can do this if we first design the experiments
aimed at understanding the physics of how we can improve the separation
of nested fragment sets generated during the dideoxynucleotide sequencing
reaction, the biochemical parameters (Km's, Vmax's, and Kd's, etc) for
various DNA polymerases and the reactions they catalyze, all the way
to develop new algorithms for obtaining relevant information from the
massive data we will generate.  There's science all along the way and
many guestions for which we do not yet know the answer.

	The HGP also is bringing a new group of scientists into biology.
These are physicists, engineers, chemists and others who are investigating
new and more efficient ways of maping and sequencing.  In the long term
it may be possible to "sequence a single human's genome in a few minutes"
but clearly this is a long way off.  Yes, PCR, hybridization blots, etc
are clinically useful today but for the future who knows.  Automated
sequencing of PCR fragments in every hospital within a decade?

	When you drive from Boston to New York City you need a road
map.  That's really what we will be getting from the HGP, a road map
of every gene, every alu sequence, every intron, every CG island, etc.
This will be extremely useful information and the cost will be much
less than if done piece meal.  Speaking of costs, there is a big debate
within the sequencing community and the HGP regarding the actual real
cost of sequencing.  Believe it or not, the REAL cost of sequencing
in the average molecular biology lab. is over $10/base final sequence.
You may not believe it since we're not used to calculating costs including
indirect costs, equipment already in place, etc >$10/base is a good number.
For those of us whose labs are involved in lots of sequencing, the cost
is something between $2.50 and $3.50/base final sequence and it will take
alot to cut the cost to < $0.50/base within the first 5 years of the HGP.
If we (the "professional sequencers") can cut the cost by a factor of 20,
think of the benefit to the average mol. biol. lab.  That's more money
for Joe Biologist to spend doing other experiments.

	I suggest that as a critic of the Human Genome Project you should:

1. You get your facts straight and see who and what is being supported.
2. You get involved in this project and help us solve the basic,
   fundamental questions which MUST be solved before we realistically
   can begin to actually sequence and understand what we've sequenced.

	There are many new scientific observations that will come out
of this project and many new scientific questions for which experiments
must be designed.  We now must begin to remove our heads from the sand
and start thinking of how we will address these scientific issues rather
that wasting our time on the political discussion.

	By the way, don't take those "remove my name from the list"
messages personally.  They occur all the time and I can't see how
this discussion has in any way prompted more of them.

Best regards,
--Bruce Roe

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/15/91)

In article <5714@husc6.harvard.edu> Ellington@Frodo.MGH.Harvard.EDU (Deaddog):
>In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom 
>Schneider):

>> learning how to identify genes from raw
>> sequences alone.  Predictions can be tested - which leads to rapid 
>> discovery of new genes.
>
>As does PCR amplification or hybridization:  the analogue versions of your
>digital statistical analyses.

Wrong.  Those techniques only allow one to jump from previously identified
sequences in other species to the human sequence.  This is a wonderful thing,
but it does not allow one to take a pure raw sequence and identify the genetic
control systems in it.  The difference is that those techniques are only
techniques, not theoretical understanding.  And if you are going to poo-pa
theoretical understanding, then I have some papers for you to read!  Start
with:

@article{StormoPerceptron1982,
author = "G. D. Stormo
 and T. D. Schneider
 and L. Gold
 and A. Ehrenfeucht",
title = "Use of the `Perceptron' algorithm to distinguish translational
initiation sites in {E. coli.}",
year = "1982",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "2997-3011"}

>  The question is not whether some genes will
>be identified, the question is (a) how many could already be identified 
>without the sequence of the genome, and (b) whether the (IMO paltry)
>number that remain be worth the enormous cost?

I'm sure that we can continue on the blind route we are following and find lots
of interesting things eventually.  The US road system comes to mind.  Sure, we
could have survived without a network of major roads.  But having started on
the big project, we were able to become much more integrated as a society, and
now it is hard to imagine not having superhighways (or are they merely
PARKways?  And why is the place one parks the car in the DRIVEway?? :-).
Similar things could be said about a uniform telephone system:  we have (had??)
the best in the world because people at Bell labs thought big.  A third example
is the improvement in making maps that Landsat and other satellites have given
us.  And, yes, Arpanet turned into internet.  In all these cases we start off
ad hoc and then eventually learn to do things systematically.  Consider the cow
paths you use to get to work!  (I refer to the roads of Boston.)  Would you
like to use muddy winding paths?  The genome project is merely a recognition
that we are close to the time that we can make our maps in a direct logical way
rather than piece meal.

>Statisticians drool at the mounds of data to be created.

And so might the rest of the biologists.  They can use the data to direct
their experiments more effectively.  If they are afraid of math and computers
(is that your problem?? :-) then there are plenty of theoretical-types whom
they can team up with.

>>  avoid the terrible biases that we
>> currently have in the GenBank database.
>
>I'm sorry, but this does not seem like a terribly important
>problem.  GenBank is skewed.  Big deal.  It gets the job done.
>We find genes, we miss some stuff.  Science slops along and
>we still find those self-splicing introns and centromeres and
>other cool things.  Without the sequence of the human genome.
>And with many people happily employed (for now) producing 
>gobs of worthwhile data. 

The problem is here, and getting worse.  You apparently haven't tried
to make a consistent dataset from the data in GenBank.  It's a tough job!
The point about the genome project is that we don't need to miss anything
anymore.  You seem to have the idea that some genes are not important,
and that 'junk' DNA exists in the genome.  Consider that this merely
is a way for you to express to the rest of us how ignorant you are.
(We are also, but we admit it.  Do you admit that you are ignorant?)

>I mean, what's a good example of what we have missed?  We know 
>the Shine/Dalgarno sequences.

Well, you missed the other statistically important features that were
discovered by looking at the sites more carefully.  See:

@article{Gold1981,
author = "L. Gold
 and D. Pribnow
 and T. Schneider
 and S. Shinedling
 and B. S. Singer
 and G. Stormo",
title = "Translational initiation in prokaryotes.",
year = "1981",
journal = "Annu. Rev. Microbiol.",
volume = "35",
pages = "365-403"}

@article{StormoInitiation1982,
author = "G. D. Stormo
 and T. D. Schneider
 and L. M. Gold",
title = "Characterization of translational initiation sites
in {{E. coli.}}",
year = "1982",
journal = "Nucl. Acids Res.",
volume = "10",
pages = "2971-2996"}

@article{Schneider1986,
author = "T. D. Schneider
 and G. D. Stormo
 and L. Gold
 and A. Ehrenfeucht",
title = "Information content of binding sites on nucleotide
sequences",
year = "1986",
journal = "J. Mol. Biol.",
volume = "188",
pages = "415-431"}

> We have learned far more from 
>mutation than we would by sequencing a bacterial genome (note:  
>sequencing the Coli genome is indeed a cool thing to do). 

This is a completely flip statement, with no foundation since you didn't
quantitate your answer and the experiment has not been done.  (But I do agree
that getting that sequence will be cool.)  Genetics is certainly a powerful way
to approach biological problems.  But once one has defined a biolgically
interesting system, direct methods can produce answers that would be difficult
if not impossible to get by genetics.  For example, the sequence of a gene, or
exactly what bases are important for a promoter to function.  See:

@article{Schneider1989,
author = "T. D. Schneider
 and G. D. Stormo",
title = "Excess Information at Bacteriophage {T7} Genomic Promoters
Detected by a Random Cloning Technique",
year = "1989",
journal = "Nucl. Acids Res.",
volume = "17",
pages = "659-674"}

>And will the "insides of introns" generate data for 2 PNAS papers and 
>a TIBS review,

yes.  The work of Andrez Konopka is an example you seem to have missed.

>or will it actually be worth the billions of 
>dollars it will take to properly correct this horrific accounting
>error?

Your mistake here is to suggest that the genome project would only give these
data.  It would give much other data also.

>> The second major justification is the enormous boost to sequencing 
>> technology that the project is making.
>
>Good sequencing technology stands on its own.  It does not need the Genome
>Boondoggle to help it along.

You have missed the point.  The project will focus more people on the
problems of sequencing, and the art will improve as a result.

>> We are eventually going to be able to sequence
>> everybody's DNA in a few minutes.
>
>Matrix-teers:  Is this nuts or what?  I've never seen this before, but
>if it is even remotely true, I'll eat the small plastic rats that reside 
>on the top of my terminal.

Ever heard of nanotechnology?  Well, bone up if you are ignorant.  I'll forgive
you, you don't need to eat those rats.

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

toms@fcs260c2.ncifcrf.gov (Tom Schneider) (02/15/91)

In article <12180@ur-cc.UUCP> elmo@troi.cc.rochester.edu (Eric Cabot) writes:
>In article <2050@fcs280s.ncifcrf.gov> toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:
>>I think that Mark is exactly correct, and you have missed the point.  Having a
>>huge database full of human sequences opens vistas for those of us who know how
>>to use statistical tools to analyse sequences.  There are many things that can
>>be done.  Some of them include learning how to identify genes from raw
>
>(much stuff deleted)

>Ok, I agree that it is possible to use statistical methods to infer that
>a given sequence contains a "gene". If I read your perspective correctly
>(and ignoring the self-back patting) the main goal is to beef up the

sorry about that.

>database so that we can find new genes, whether functional or not.  I 
>I'm sorry, but I just see that as cost effective,  given that we won't
>have the slightest inkling of what most of these genes are supposed to
>do. 

I would not take it as the main goal; it is one of many worthy goals.

Also, once a gene is located, it can be deleted (in mouse or whereever), and
then one can play the usual powerful genetic tricks to figure out the
function.  The sequence is merely a starting point.  Since it (supposedly!)
contains all the information about the biology, in encripted form, it is nice
to have it for starters.  I've always been amazed when people would put off
sequencing "their" gene for a long time, since one gets such a huge amount of
solid data from the sequence.

>>A straight sequencing of the genome will avoid the terrible biases that we
>>currently have in the GenBank database.  For example, the database is missing
>
>Oh really? Wouldn't you say that concentrating on coli, fly, worm, yeast
>human and maybe a plant species puts a bit of bias into the database?

Interesting point.  I suppose it comes from my biased view of analyzing the
binding sites from one species at a time so as to avoid the assumption that the
recognizer (ie DNA binding protein, ribosome, polymerase, repressor or
whatever) is the same in all species.  (We know lots of cases where it's not.)
So I am happy if I have a complete genome to work within.  But after finishing
one, one needs to do the others to answer evolutionary questions, and you are
right, there is a huge diversity out there to be sequenced.  So until we can
sequence genomes quickly (minutes), I suppose the best we can do is to chose
the few organisms which have had lots of good genetics done on them.  I'm glad
to see that these other organisms are considered part of the project!  When I
first heard of the project I disliked it because I thought that coli wouldn't
get done first as a 'pilot'.

>>the insides of introns.  If you think that these are not important, then you
>>may well be in for some super surprises later.  The phrase "junk DNA" is a
>>statement of ignorance, not scientific fact.  People currently chop off the
>>bases near the 3' sides of introns and don't report them in the database.  The
>>proof is that they often end 10, 20 or 30 bases from the splice junction.  This
>>would not happen if people reported all their data.  Unfortunately, this means
>>that people have thrown out important parts of splice junctions BECAUSE THEY
>>THOUGHT THEY WERE UN-IMPORTANT.  Do you follow?  People think something is not
>>important, so they don't report it in the database, or limit the reports, so
>>nobody discovers that it IS important!  Another example is the reporting of

>(Nothing deleted because I am in complete agreement. Oh how I have ranted
>and raved about missing intron sequences.)  But
>frankly, I don't follow if this is part of the defense of the genome project.
>Sure it'd be great to have chromosome long tracts of sequences to infer
>gemone organization but will we really be able to make sense out of
>it all using the sequence data alone? Take the case of upstream control
>regions, their significance was worked for the most part by experimental
>techinques.  Those results are the stuff that are used to generate rules
>for sequence analysis. Not the other way around. 

That's because theoretical concepts have not been strong enough to date.  I
think that this will change.  Not to be back patting (will you excuse me?? :-),
but the example I know best is my own.  E. coli ribosome binding sites have
about 11.0 bits of pattern.  I was pretty surprised to find that the
information needed to locate the sites in the genome is about 10.6 bits!  This
correlation seems to hold for other genetic systems.  The idea (working
hypothesis) is that the amount of pattern at binding sites is in general just
enough to locate the sites in the genome.  Then I studied T7 RNA polymerase
promoters and found that they contained too much sequence pattern (35 bits of
pattern) compared to what is needed to locate them in the genome (16 to 17
bits).  This meant that either the hypothesis was wrong or something
interesting was happening at T7 promoters.  Perhaps another protein binds
there, and this accounts for the "excess" information.  If so, I should be able
to delete the excess information.  It took me a while, but I did the experiment
and found that 18 +/- 2 bits are all that the polymerase needs!  So the
hypothesis survived.  The experiment would not have been done without the
theoretical analysis.  I have another case like this that I'm writing up now.
So the idea of doing experiments first is only a tradition of molecular
biology.  Theoretical understanding can also play a role.  References to this
story can be found in:

@article{Schneider.Stephens.Logo,
author = "T. D. Schneider
 and R. M. Stephens",
title = "Sequence Logos: A New Way to Display Consensus Sequences",
journal = "Nucl. Acids Res.",
volume = "18",
pages = "6097-6100",
year = "1990"}

>Eric Cabot                              |    elmo@uhura.cc.rochester.edu
>      :-):-):-):-):-):-):-):-)          |    elmo@uordbv.bitnet

  Tom Schneider
  National Cancer Institute
  Laboratory of Mathematical Biology
  Frederick, Maryland  21702-1201
  toms@ncifcrf.gov

elliston@av8tr.UUCP (Keith Elliston) (02/16/91)

In article <2054@fcs280s.ncifcrf.gov>, toms@fcs260c2.ncifcrf.gov (Tom Schneider) writes:

> And if you are going to poo-pa
> theoretical understanding, then I have some papers for you to read!  Start
> with:
> 
> author = "G. D. Stormo
>  and T. D. Schneider
>  and L. Gold
>  and A. Ehrenfeucht",
> title = "Use of the `Perceptron' algorithm to distinguish translational
> initiation sites in {E. coli.}",
> year = "1982",
> journal = "Nucl. Acids Res.",
> volume = "10",
> pages = "2997-3011"}
.
> author = "L. Gold
>  and D. Pribnow
>  and T. Schneider
>  and S. Shinedling
>  and B. S. Singer
>  and G. Stormo",
> title = "Translational initiation in prokaryotes.",
> year = "1981",
> journal = "Annu. Rev. Microbiol.",
> volume = "35",
> pages = "365-403"}
> 
> author = "G. D. Stormo
>  and T. D. Schneider
>  and L. M. Gold",
> title = "Characterization of translational initiation sites
> in {{E. coli.}}",
> year = "1982",
> journal = "Nucl. Acids Res.",
> volume = "10",
> pages = "2971-2996"}
> 
> author = "T. D. Schneider
>  and G. D. Stormo
>  and L. Gold
>  and A. Ehrenfeucht",
> title = "Information content of binding sites on nucleotide
> sequences",
> year = "1986",
> journal = "J. Mol. Biol.",
> volume = "188",
> pages = "415-431"}
.
> author = "T. D. Schneider
>  and G. D. Stormo",
> title = "Excess Information at Bacteriophage {T7} Genomic Promoters
> Detected by a Random Cloning Technique",
> year = "1989",
> journal = "Nucl. Acids Res.",
> volume = "17",
> pages = "659-674"}

Nice C.V. Tom....   

Not be pick nits or anything, but this discussion seems to be going over
board.  Perhaps it is time to drop the dueling keyboards for a moment, and
spend some time in the thought process.  I would dearly love to see a sort
of debate go on on this subject (genome sequencing), but would like to suggest
a more refined type of discussion.  Perhaps we could have an address by
the proponents of the genome sequence initiative, formulated collectively
by a group via e-mail.  We could then also have an address by a group
that is against the genome initiative, again collectively written by 
a group.  This might reduce the flaming, and bring us all to a more in-depth
understanding of the issues from both sides.  We could then discuss the
issues that are disparate and not have to be reduced to singular views
or flaming wars.

I suspect that a large number of the people reading these messages are like
me.... in that I see many advantages to genome sequencing, but also quite
a few shortcomings/disadvantes as well.

So... who wants to be the head of the Pro Genome sequencing group (Tom??)
How about the Anti Genome sequencing group? (Did I hear a woof.. or was
it a non-woof?)

Well....

Keith "where did Rob Harper go?" Elliston
elliston@msdrl.com

... inews fodder .....

asdflkjadlfkljasdlfj
klasdjfljalsdjfljasldfj
lkasdjfljlsdjfljasdfj
klasdjfljsdlfjads;lfkj
klasjdfljlajsdfj;lajksdf
klasjdfljalsjdfjasldjf
klajsdfjlaskdjflajsdfl
jaksdjflkasdjfjasdfjaj
klasdjfkl;jklasjdfjadsf
kljasd;fjakljsdf;lajsdf
kljasd;fj;aljdsflk;ajsd;fl
kljasdlfkjlakjsdflkjasldkfj
kljasdfjlaksjdf;lkajsdf
lkjasdfljlkadsjf;la
kljasdfjljasdflkjklasdf
kljalsd;fjl;ajsdf;lasdf
klasjdf;ljasdf;la;lsdfj

-- 
Keith O. Elliston          elliston@av8tr.UUCP           elliston@msdrl.com
AA5A N9734U                elliston@mbcl.rutgers.edu     elliston@biovax.bitnet

"Fly because you have to, to keep some semblance of sanity."

kristoff@genbank.bio.net (David Kristofferson) (02/16/91)

Keith,

	Sounds like an excellent suggestion if someone will take up
the challenge.  I suggest a chronological format such as the
following:

1) Initial Arguments for and against the Genome Project - these should
   be posted essentially simultaneously without either side getting to
   review the others comments.  I would be happy to hold the finished 
   reports and release them together when both are completed.

2) Rebuttals to the above - again released simultaneously

3) return to free-for-all 8-)???

Any takers?

Dave