[bionet.molbio.proteins] reply to CLAVERIE

CMATHEWS.KRAMER@BIONET-20.BIO.NET (Jack Kramer) (07/04/88)
Your reply covered several points so I will address them separately.  


>	Another type of multidimensional sequence representation and objective
>	analysis has been found convenient by us when you dont know what to search
>	for, and where, in a sequence. This is a combination of discriminant method,
>	k-tuple representation and multidimensional representation.

>	Bougueleret et al. (1988) Nucl. Acids Res. 16, 1729-1738
>	Claverie & Bougueleret (1986) Nucl. Acids Res. 14, 179-196.


Thanks for reminding me of your papers.  They represent one type of initial
approach to what I call the semantic analysis of NA sequences.  One thing I
might add which is not always made clear in this type of study is that any
particular stretch of DNA probably has many superimposed functions and this
must be taken into account.  In this particular case properties of tuples
were used to distinguish coding and non coding regions.  These same regions
whether coding or not also must contain information that determines the
structure of the DNA itself; for example, nucleosome binding preference
information may be independent functionally from the coding requirement.
Also any of the structure requirements of the intermediate RNA must be
simultaneously coded in the DNA template.  Another that comes readily to 
mind are the structural requirements for the functioning of the DNA as a 
chromosome which includes replication, mitosis and meiosis, etc.  Stretching 
this concept even further I believe that lack of consideration of all of
the different types of phenotypic information beyond protein coding contained
in the DNA is what allows the apparent "selective neutrality" hypotheses to
exist.  It is also probably the major reason for the failure of most
(maybe all at present) simple attempts at pattern recognition.  I believe that
the maintenance of the degeneracy of the genetic code, which is very expensive
to the cell or organism on any energy budget, througout the eons of evolution
is directly attributable to the phenotypic requirements of the DNA molecule 
itself.  By this I include all of the interactions of the DNA molecule for 
reasons other than the direct coding of protein.
I cannot believe that all of these simultaneous requirements are serially
separated along any sequence but that they are superimposed and have
been optimized and balanced over the course of evolution of both prokaryotic
and eukariotic genomes.  It is precisely this class of problem that is
best addressed by multivariate methods.  If nothing else the fact that
more variables are needed can show up more readily.



>	As for the debate on the need for a vectorial representation of a.a.: I feel
>	it misses the point, because the main problem we are facing in sequence
>	analysis is that the information is not STRICTLY linear, but only
>	approximatively. Here are all the situations in real life:

>	- at some precise position a given aa is needed (ex: catalysis)
>	- around a position a given aa is needed (ex: the "conserved K in prot. kinases)

>	- at some precise position everything BUT a given aa is needed (RAS activation)
>	- at some precise position a given type (+,-,aromat) of aa is needed (binding)
>	- around a position a given aa is needed (C-C link)
>	- around a position a vague type of aa is needed (secondary structure conserv.)

The reasons you give here are exactly why I think the vector representation
IS necessary.  Each amino acid is a real molecule with many chemical and
physical properties.  What complicates the issues you address is the 
overlap of these properties among the residues.  At any position, for the
functions you cite and many more, probably all, the protein doesn't know about 
our reasons for using single letters to represent the residue at any
position,  it only knows that some property is needed.  The protein through
evolution has selected some combination of these properties with differing
emphasis on the individual properties at each position.  If only strong
basicity is needed then lysine or arginine will be ok,  but if the
excluded volume of the side chain is also important because it interferes
with tertiary structure when the volumes of this and some other
residue side chains are summed, then one or the other may be required depending
on the particular combinations of properties in other sequence elements,
etc., etc., throughout the protein.  Multivariate statistical and algebraic
methods were developed to handle cases just such as this, they are available,
they work, they are easy to use and understand and they very readily
adapt to serial processing in todays available computers.  Even more
importantly however,  the advent of parallel processing machines will 
alleviate many of the space-time restrictions inherent in the matrix
solutions on serial processors.

(This all assumes that every sequence element has some selective 
effect - i. e. that the appearance of neutrality is due to lack 
of knowledge to a sufficient depth.  This is another entire subject
which could be discussed on this board).  

>	With one more difficulty: we dont know where is that "position" ...
 
 The same multivariate methods allow for the horizontal as well as
 vertical (and almost every angle in between via linear and time warping 
 type transformations) processing of vector coefficients.  We just don't usually
 think that way.  (Maybe we do think that way in a parallel fashion but we
 are severly hindered by the necessity to serialize the thoughts for written
 and verbal communication)  I agree that this is a difficulty,  but only
 a technical one, and neither a theoretical nor operational impossibility.

>	It is clearly illusory to think that a new computer algorithm might solve
>	all these problems at once, given the evolutionary noise stored in all
>	these sequence which, dont forget, have all evolved from very few ancestors
>	(may be only able to fold in a somewhat compact shape to resist hydrolysis).

The "noise" you introduce here is an excellent example of the lack of 
consideration of a possible meaning for results which don't fit the
theory which is applied at the time.  Again I believe that what you call
noise does have some function which we just don't understand yet.  this is what
I was trying to convey above.

>	For instance, how to instruct a multialignment program, very proud to have
>	located a conserved KKKK, or LLLL motif, that it has almost no biological
>	signification whatsoever, in MOST of the case. At the present, most of
>	the alignment program will focus on those, and mis the more subtil, isolated
>	and approximate, but functionaly relevant truly homologous positions.

If I restrict all attempts to analyse sequence data to the algorithms now 
in vogue in molecular biology I would probably agree.  Since very little of the 
reservoir of mathematics and computer science has yet entered this arena I am
much more hopeful and optimistic.  Therefore I, and I assume many others will
continue to attack these very interesting problems, trying even new and 
possibly unorthodox, even heretical to the "good old boys",  methods.  
Most probably will fail, but then ......


>	Matrices, "perceptron" (a sexy name for the former) algorithms will fail
>	lamentably anytime that a strict positional constraint is not to be respected.
 
Twenty years ago I would have agreed with this, but times have changed.
Matrices are one of the algebraic structures which are used in the perceptron
METHOD.  In their very simplest form, perceptrons were shown not to be able
to solve some very easy problems.  Since this proof was published by one or
the established giants of the field, extensions of the concept were
suppressed for many years.  New connectionist models have overcome
many of the previous restrictions of single level perceptrons and have
ushered an explosion of new development and enthusiam.  If what you say
of perceptrons is true then many researchers time and a great deal of money
is being wasted on other pattern recognition problems such as visual and
speech perception.  Perhaps these workers are just imagining all the 
results they are achieving in these fields and we should avoid wasting
our time trying to extend the techniques to other signal analysis and
pattern recognition analogs such as molecular sequence interpretation.

>	Representing a.a by vectors will not alleviate this difficulty inherent to
>	the fundamental fact that proteins are 3-D objects which only Nature (I wonder
>	if even God knows how ...) knows how to fold from an evolutionary-slopy
>	1-D aa sequence.

I have no evidence on "Gods" so I'll leave that out of my discussion.
Methods to predict higher order structure of macromolecules from a
sequence of scalar representations of the monomers are apriori destined
to failure due to neeglect of the real physical and chemical properties
of the monomers.  Even more important here is the degeneracy between
different amino acids, which cannot be accounted for without a vector
representation.  Also I don't think we should automatically attribute
sloppyness to a process just because we don't understand all the ramifications.


>	.. and please, dont talk about A.I. (Absolutely Incompetent) programming.

I think time will provide the only answer to this comment.  Tradition and 
dogma die very slowly.

Jack Kramer
kramerj@ucs.orst.edu
cmathews.kramer@bionet-20.arpa

-------