CMATHEWS.KRAMER@BIONET-20.BIO.NET (Jack Kramer) (07/04/88)
Your reply covered several points so I will address them separately. > Another type of multidimensional sequence representation and objective > analysis has been found convenient by us when you dont know what to search > for, and where, in a sequence. This is a combination of discriminant method, > k-tuple representation and multidimensional representation. > Bougueleret et al. (1988) Nucl. Acids Res. 16, 1729-1738 > Claverie & Bougueleret (1986) Nucl. Acids Res. 14, 179-196. Thanks for reminding me of your papers. They represent one type of initial approach to what I call the semantic analysis of NA sequences. One thing I might add which is not always made clear in this type of study is that any particular stretch of DNA probably has many superimposed functions and this must be taken into account. In this particular case properties of tuples were used to distinguish coding and non coding regions. These same regions whether coding or not also must contain information that determines the structure of the DNA itself; for example, nucleosome binding preference information may be independent functionally from the coding requirement. Also any of the structure requirements of the intermediate RNA must be simultaneously coded in the DNA template. Another that comes readily to mind are the structural requirements for the functioning of the DNA as a chromosome which includes replication, mitosis and meiosis, etc. Stretching this concept even further I believe that lack of consideration of all of the different types of phenotypic information beyond protein coding contained in the DNA is what allows the apparent "selective neutrality" hypotheses to exist. It is also probably the major reason for the failure of most (maybe all at present) simple attempts at pattern recognition. I believe that the maintenance of the degeneracy of the genetic code, which is very expensive to the cell or organism on any energy budget, througout the eons of evolution is directly attributable to the phenotypic requirements of the DNA molecule itself. By this I include all of the interactions of the DNA molecule for reasons other than the direct coding of protein. I cannot believe that all of these simultaneous requirements are serially separated along any sequence but that they are superimposed and have been optimized and balanced over the course of evolution of both prokaryotic and eukariotic genomes. It is precisely this class of problem that is best addressed by multivariate methods. If nothing else the fact that more variables are needed can show up more readily. > As for the debate on the need for a vectorial representation of a.a.: I feel > it misses the point, because the main problem we are facing in sequence > analysis is that the information is not STRICTLY linear, but only > approximatively. Here are all the situations in real life: > - at some precise position a given aa is needed (ex: catalysis) > - around a position a given aa is needed (ex: the "conserved K in prot. kinases) > - at some precise position everything BUT a given aa is needed (RAS activation) > - at some precise position a given type (+,-,aromat) of aa is needed (binding) > - around a position a given aa is needed (C-C link) > - around a position a vague type of aa is needed (secondary structure conserv.) The reasons you give here are exactly why I think the vector representation IS necessary. Each amino acid is a real molecule with many chemical and physical properties. What complicates the issues you address is the overlap of these properties among the residues. At any position, for the functions you cite and many more, probably all, the protein doesn't know about our reasons for using single letters to represent the residue at any position, it only knows that some property is needed. The protein through evolution has selected some combination of these properties with differing emphasis on the individual properties at each position. If only strong basicity is needed then lysine or arginine will be ok, but if the excluded volume of the side chain is also important because it interferes with tertiary structure when the volumes of this and some other residue side chains are summed, then one or the other may be required depending on the particular combinations of properties in other sequence elements, etc., etc., throughout the protein. Multivariate statistical and algebraic methods were developed to handle cases just such as this, they are available, they work, they are easy to use and understand and they very readily adapt to serial processing in todays available computers. Even more importantly however, the advent of parallel processing machines will alleviate many of the space-time restrictions inherent in the matrix solutions on serial processors. (This all assumes that every sequence element has some selective effect - i. e. that the appearance of neutrality is due to lack of knowledge to a sufficient depth. This is another entire subject which could be discussed on this board). > With one more difficulty: we dont know where is that "position" ... The same multivariate methods allow for the horizontal as well as vertical (and almost every angle in between via linear and time warping type transformations) processing of vector coefficients. We just don't usually think that way. (Maybe we do think that way in a parallel fashion but we are severly hindered by the necessity to serialize the thoughts for written and verbal communication) I agree that this is a difficulty, but only a technical one, and neither a theoretical nor operational impossibility. > It is clearly illusory to think that a new computer algorithm might solve > all these problems at once, given the evolutionary noise stored in all > these sequence which, dont forget, have all evolved from very few ancestors > (may be only able to fold in a somewhat compact shape to resist hydrolysis). The "noise" you introduce here is an excellent example of the lack of consideration of a possible meaning for results which don't fit the theory which is applied at the time. Again I believe that what you call noise does have some function which we just don't understand yet. this is what I was trying to convey above. > For instance, how to instruct a multialignment program, very proud to have > located a conserved KKKK, or LLLL motif, that it has almost no biological > signification whatsoever, in MOST of the case. At the present, most of > the alignment program will focus on those, and mis the more subtil, isolated > and approximate, but functionaly relevant truly homologous positions. If I restrict all attempts to analyse sequence data to the algorithms now in vogue in molecular biology I would probably agree. Since very little of the reservoir of mathematics and computer science has yet entered this arena I am much more hopeful and optimistic. Therefore I, and I assume many others will continue to attack these very interesting problems, trying even new and possibly unorthodox, even heretical to the "good old boys", methods. Most probably will fail, but then ...... > Matrices, "perceptron" (a sexy name for the former) algorithms will fail > lamentably anytime that a strict positional constraint is not to be respected. Twenty years ago I would have agreed with this, but times have changed. Matrices are one of the algebraic structures which are used in the perceptron METHOD. In their very simplest form, perceptrons were shown not to be able to solve some very easy problems. Since this proof was published by one or the established giants of the field, extensions of the concept were suppressed for many years. New connectionist models have overcome many of the previous restrictions of single level perceptrons and have ushered an explosion of new development and enthusiam. If what you say of perceptrons is true then many researchers time and a great deal of money is being wasted on other pattern recognition problems such as visual and speech perception. Perhaps these workers are just imagining all the results they are achieving in these fields and we should avoid wasting our time trying to extend the techniques to other signal analysis and pattern recognition analogs such as molecular sequence interpretation. > Representing a.a by vectors will not alleviate this difficulty inherent to > the fundamental fact that proteins are 3-D objects which only Nature (I wonder > if even God knows how ...) knows how to fold from an evolutionary-slopy > 1-D aa sequence. I have no evidence on "Gods" so I'll leave that out of my discussion. Methods to predict higher order structure of macromolecules from a sequence of scalar representations of the monomers are apriori destined to failure due to neeglect of the real physical and chemical properties of the monomers. Even more important here is the degeneracy between different amino acids, which cannot be accounted for without a vector representation. Also I don't think we should automatically attribute sloppyness to a process just because we don't understand all the ramifications. > .. and please, dont talk about A.I. (Absolutely Incompetent) programming. I think time will provide the only answer to this comment. Tradition and dogma die very slowly. Jack Kramer kramerj@ucs.orst.edu cmathews.kramer@bionet-20.arpa -------