[bionet.molbio.evolution] Sequence similarity: statistical analysis of

MJB1@PHX.CAM.AC.UK (04/06/88)

From: MJB1%PHX.CAM.AC.UK@CUNYVM.CUNY.EDU

Surely the point here is that there are an infinite number of statistical
models of sequence similarity.  There is no problem in assigning significance
under a particular model, thought there may well be a problem in assessing
its biological relevance.  I think the questions being asked should be
(1) What is a good model for the similarity of molecular sequences.
(2) How can one assess the biological relevance of statistical significance
in relation to a particular model.

Put in this way, one soon realises that the original problem has been framed
in too broad a way.  What are the conditions relating to the comparison,
surely not just that we have sequenced too bits of DNA and want to know
how similar they are (though it could be that if you insist).

People should worry more about the conditions relating to the particular
problem and try to get experimental evidence about biologically relevant
parameters.  To emphasise the point about conditions consider the old coin
tossing problem. We all know that we come up heads half the time and tails
half the time.  But do we... the coin rolled down the drain and the
result was indeterminate.  My friend has made a ballistic machine which
tosses the coin so that the way it lands depends which way it was placed
on the machine before tossing.

How much more complex then are the conditions under which DNA evolves.
Trying to improve our knowledge about that for specific gene families
would be a good thing to attempt.  A completely general model is
too broad and naive to be useful, I suspect.

dbd%benden@LANL.GOV (04/08/88)

From: dbd%benden@LANL.GOV (Dan Davison)


 Surely the point here is that there are an infinite number of statistical
 models of sequence similarity. 

Yes.  

 There is no problem in assigning significance
 under a particular model, thought there may well be a problem in assessing
 its biological relevance. 

One of the most common mistakes I encounter among molecular biologists who
are looking at search results is "This result isn't statistically significant,
so I can ignore it".  Enhancer and TATA boxes are examples of statistically
insignificant matches that are biologically significant.  In these cases
there is an additional element--position--that determines the biological
signifcance.  Only the biologist (or an AI tool) can do such analysis.



 I think the questions being asked should be
 (1) What is a good model for the similarity of molecular sequences.

How about "what is a good model for assessing the similarity of molecular
sequences"? I think this is what you mean.

 (2) How can one assess the biological relevance of statistical significance	 in relation to a particular model.
 Put in this way, one soon realises that the original problem has been framed 	 in too broad a way.  What are the conditions relating to the comparison, 	 surely not just that we have sequenced too bits of DNA and want to know
 how similar they are (though it could be that if you insist).

As you can tell from my remarks above, I agree with this statement, with a
caveat.  I have next to my desk a printout 14 inches thick of 8 point type.
It is the result (accidental) of specifying too low a similarity criterion 
to a library search routine.  Suppose that this search was an enhancer core
against all of GenBank.  Every bit of that printout would be potentially
biologically significant.  However, it would take a month (or more) of effort
to check the biological significance of each result.  We must have ways of
sifting through the incredible amount of output that will be generated by
similarity comparisons.  The best method at the moment is by using statistical
significance.  The quality of the statistical model used will determine how
much of the search space is *productively* examined.  This certainly cuts
out much information that is of biological significance, but *at present there
is no automated way of assessing biological sigificance*.

 People should worry more about the conditions relating to the particular 	 problem and try to get experimental evidence about biologically relevant	 parameters.  To emphasise the point about conditions consider the old coin	 tossing problem. We all know 
that we come up heads half the time and tails 	 half the time.  But do we... the coin rolled down the drain and the
 result was indeterminate.  My friend has made a ballistic machine which	 tosses the coin so that the way it lands depends which way it was placed	 on the machine before tossing.

We use the statistical methods and the parameter choices in similarity
searching to do precisely this., ie to make up for the lack of time to
get the experimental evidence about biologically relevant parameters.
No one has the time the necessary expertise in all 20,000 sequences 
in the nucleic acid databanks.  
	
 How much more complex then are the conditions under which DNA evolves.
 Trying to improve our knowledge about that for specific gene families
 would be a good thing to attempt.  A completely general model is
 too broad and naive to be useful, I suspect.
	
I not sure what "a completely general model" refers to here, but if you
mean a completely general model of statistical similarity of genetic 
sequences: Yes, it would be naive, but not "too broad".  It would lack
the biological knowledge, which is what the "too broad" probably refers
to.  The quantification of knowledge is a risky business.  In this context,
biologists are not going to be unemployed for a long, long time.

Given the concerns we have both stated, can you imagine how much fun it is
going to be to have complete "real" (mycoplasma & up) genomes to analyze?

dan davison / theoretical biology / los alamos national laboratory