MJB1@PHX.CAM.AC.UK (04/06/88)
From: MJB1%PHX.CAM.AC.UK@CUNYVM.CUNY.EDU Surely the point here is that there are an infinite number of statistical models of sequence similarity. There is no problem in assigning significance under a particular model, thought there may well be a problem in assessing its biological relevance. I think the questions being asked should be (1) What is a good model for the similarity of molecular sequences. (2) How can one assess the biological relevance of statistical significance in relation to a particular model. Put in this way, one soon realises that the original problem has been framed in too broad a way. What are the conditions relating to the comparison, surely not just that we have sequenced too bits of DNA and want to know how similar they are (though it could be that if you insist). People should worry more about the conditions relating to the particular problem and try to get experimental evidence about biologically relevant parameters. To emphasise the point about conditions consider the old coin tossing problem. We all know that we come up heads half the time and tails half the time. But do we... the coin rolled down the drain and the result was indeterminate. My friend has made a ballistic machine which tosses the coin so that the way it lands depends which way it was placed on the machine before tossing. How much more complex then are the conditions under which DNA evolves. Trying to improve our knowledge about that for specific gene families would be a good thing to attempt. A completely general model is too broad and naive to be useful, I suspect.
dbd%benden@LANL.GOV (04/08/88)
From: dbd%benden@LANL.GOV (Dan Davison) Surely the point here is that there are an infinite number of statistical models of sequence similarity. Yes. There is no problem in assigning significance under a particular model, thought there may well be a problem in assessing its biological relevance. One of the most common mistakes I encounter among molecular biologists who are looking at search results is "This result isn't statistically significant, so I can ignore it". Enhancer and TATA boxes are examples of statistically insignificant matches that are biologically significant. In these cases there is an additional element--position--that determines the biological signifcance. Only the biologist (or an AI tool) can do such analysis. I think the questions being asked should be (1) What is a good model for the similarity of molecular sequences. How about "what is a good model for assessing the similarity of molecular sequences"? I think this is what you mean. (2) How can one assess the biological relevance of statistical significance in relation to a particular model. Put in this way, one soon realises that the original problem has been framed in too broad a way. What are the conditions relating to the comparison, surely not just that we have sequenced too bits of DNA and want to know how similar they are (though it could be that if you insist). As you can tell from my remarks above, I agree with this statement, with a caveat. I have next to my desk a printout 14 inches thick of 8 point type. It is the result (accidental) of specifying too low a similarity criterion to a library search routine. Suppose that this search was an enhancer core against all of GenBank. Every bit of that printout would be potentially biologically significant. However, it would take a month (or more) of effort to check the biological significance of each result. We must have ways of sifting through the incredible amount of output that will be generated by similarity comparisons. The best method at the moment is by using statistical significance. The quality of the statistical model used will determine how much of the search space is *productively* examined. This certainly cuts out much information that is of biological significance, but *at present there is no automated way of assessing biological sigificance*. People should worry more about the conditions relating to the particular problem and try to get experimental evidence about biologically relevant parameters. To emphasise the point about conditions consider the old coin tossing problem. We all know that we come up heads half the time and tails half the time. But do we... the coin rolled down the drain and the result was indeterminate. My friend has made a ballistic machine which tosses the coin so that the way it lands depends which way it was placed on the machine before tossing. We use the statistical methods and the parameter choices in similarity searching to do precisely this., ie to make up for the lack of time to get the experimental evidence about biologically relevant parameters. No one has the time the necessary expertise in all 20,000 sequences in the nucleic acid databanks. How much more complex then are the conditions under which DNA evolves. Trying to improve our knowledge about that for specific gene families would be a good thing to attempt. A completely general model is too broad and naive to be useful, I suspect. I not sure what "a completely general model" refers to here, but if you mean a completely general model of statistical similarity of genetic sequences: Yes, it would be naive, but not "too broad". It would lack the biological knowledge, which is what the "too broad" probably refers to. The quantification of knowledge is a risky business. In this context, biologists are not going to be unemployed for a long, long time. Given the concerns we have both stated, can you imagine how much fun it is going to be to have complete "real" (mycoplasma & up) genomes to analyze? dan davison / theoretical biology / los alamos national laboratory