[bionet.molbio.evolution] multivariate sequence analysis reply

CMATHEWS.KRAMER@BIONET-20.ARPA (05/14/88)
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>

Joe,


I agree with almost everything you said so I must not have made myself clear in the message to Dan.  I will try to answer your response paragraph by paragraph.  All statements are personal beliefs and are thrown out to maybe help liven up this board.


"David Sankoff did two things long ago that are relevant to your
inquiry.  One was to show that we can incorporate an arbitrary
distance D(i,j) between pairs of amino acids into sequence alignment
algorithms -- we don't just have to score them as the same or
different.  This distance could in principle take into account any
multidimensional data you wanted it to.  The reference -- not terribly
readable -- is Sankoff and Rosseau, Mathematical Programming, 9:240-246
1975.  A related paper is Sankoff, SIAM Journal of Applied Mathematics,
28: 35-42  1975."

I believe this is true if one is only interested in applying the distance directly to some problems such as phylogeny inference by distance matrix methods. In other cases such as searching for analogous subsequences that represent a biologically meaningful semantic pattern such as a promoter, active site, etc this may not be what is wanted.  If the application of a perceptron learning algorithm to find a weighting matrix which can discriminate between groups of ambiguous consensus subsequences is the goal 
(it is one of mine) the utilization of only the magnitude of a property vector throws away some very important information.  All of the information in the individual coordinates seems to me to be lost when only the magnitude of the vector is used.  For examnple, in a 3 D case the single sphere radius locus is the degenerate representation of the set of individual 3 tuples which it defines.  At each position in the sequence it may be a different subset of coordinates which defines the uniqueness of that pos
i
tion. And as is the case for amino acids( and nucleotide WORDS) the overlap of properties of monomers requires the separate weighting of property coordinates at each sequence position.  The scalar D(i,j,k,...) is not the same as the vector and I do not think that this is always kept clear ahen computing sequence distances.  I would extend this even further and propose that the Levensthein string edit distance is not the best representation of sequence distance unless higher order semantics of the strings b
eyond the alphabet are somehow taken into account.

"The other thing he pointed out back then was that, contrary to your
assertion, one should NOT rigidly separate the stages of assessment
of homology and inference of phylogenies.  Sooner or later the people
who are so heavily into multiple sequence aligment will discover, to
their amazement, that he did the essential work on this in the mid '70s.
One should not align a set of sequences considering them symmetrically,
but one has to take account of the fact that they may be related, and
come to you in clusters.  I believe that Russell Doolittle has been
saying something similar recently, and I think David Lippman is too."


Here I agree again, but with another BUT.  The two should not be separated  if the ultimate goal is detection of homology and its use to infer phylogeny.  I think that similarity is assessed and homology and phylogeny subsequently inferred.  If convergence occurs at the molecular level( I believe it must) then there is reason to do sequence analysis t search for analogy as well as homology.  I think the point I was trying to emphasize is that there are many ways to detect similarity and that the similarity
 can then be used to infer either homology or analogy.  In the case of analogy, using an inferred phylogeny may give misleading results because of an intentional or otherwise disregard of the possibility of convergence.  Application of combined similarity analysis and homology/phylogeny to known related families can be fairly safe.  But the extrapolation fo the ideas and methods where the probability of similarity due to convergence is real must be handled with much more caution.  From what I know of Dooli
t
tle's recent work I think he has stayed mostly within the safe bounds.  I am not as convinced of this in Lippman's case where limited numbers of distantly related proteins with unknown affinities are analyzed.

I also I think I detected a faint hint at a subliminal preference for a cladistic approach in your reference to the symmetry, relatedness and clustering.  I am all for this approach, mostly because I think it helps explicitly identify problems in the construction of phylogenies which might otherwise fall through the cracks. (I would add direction to relatedness and clustering)


"There is a more recent chapter on this by Sankoff and Cedergren
in the Sankoff and Kruskal book on Time Warps, String Edits, and Macromolecules.

One has to carry out alignment at the same time as estimating the
phylogeny, finding that aligment and that phylogeny that optimize
whatever criterion you are using (such as likelihood or parsimony).
Only then can you see whether there is significant evidence for homology.

This of course ducks the issue of what is a good distance matrix D(i,j)
between amino acids, as well as all the algorithmic practicality
issues.  Therein lies much work."


Much of what I said above applies.  I might add that I am using my second copy of the Sankoff and Cedergren book, the first having literally disintegrated form use.  The book does have chapters which are more closely related to what I tried to convey to Dave, it is not in the molecular sequence sections though, but in the chapters on continuous sequences and especially those concerned with the methodology of speech recognition.  This is the direction I am headed in my own investigations of molecular sequen
ces.




Jack Kramer
Center for Gene Research
Oregon State University
cmathews.kramer@BIONET-20.ARPA

-------