[bionet.molbio.evolution] Statistical significance of "PERCENT" homology.

SYEH@BIONET-20.ARPA (04/01/88)

From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>

Winston,

	Percent similarity can be used to imply homology (CELL 50:667 Aug 87)
as well as analogy.  The statistical significance of sequence similarity
has been extensively analysed. A couple of good places to start looking
are the appropriate chapters in von Heijne's "Sequence Analysis in Molecular
Biology", reviews and articles in Computer Applications in the Biosciences
(e.g. 3:1 Mar 87) and the Application of Computers to Research on Nucleic
Acids I, II and III supplements to NAR.

Jack
-------

CMATHEWS.KRAMER@BIONET-20.ARPA (04/06/88)

From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>

Mail-From: GOAD.DAVISON created at  1-Apr-88 18:52:09
Date: Fri 1 Apr 88 18:52:09-PST
From: Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>
Subject: Re: Statistical significance of "PERCENT" homology.
To: CMATHEWS.KRAMER@BIONET-20.ARPA
In-Reply-To: <12386849069.45.CMATHEWS.KRAMER@BIONET-20.ARPA>
Message-ID: <12387121770.15.GOAD.DAVISON@BIONET-20.ARPA>



You might check out the method of Manske and Chapman, in the second issue
of the current volume of the J. of Molecular Evolution.  It's an interesting
attempt in the general direction you describe.  

Would it be IK to post your note to MOLECULAR-EVOLUTION?  I wouldn't mind
starting a fight or two.  I also think it's unlikely that *anything*
other than Dr. Who's Tardis would allow "the detection of evolutionary
relationships objectively".  Perhaps I'm too jaded from the SUNY
Stony Brook wars.

dan davison
-------
-------

GOAD.DAVISON@BIONET-20.ARPA (04/07/88)

From: Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>

Mail-From: CMATHEWS.KRAMER created at 31-Mar-88 17:54:09
Date: Thu 31 Mar 88 17:54:09-PST
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: Re: Statistical significance of "PERCENT" homology.
To: GOAD.DAVISON@BIONET-20.ARPA
In-Reply-To: <12386835730.28.GOAD.DAVISON@BIONET-20.ARPA>
Message-ID: <12386849069.45.CMATHEWS.KRAMER@BIONET-20.ARPA>

Dan,

	I strongly second your "not well understood".

	A significant reason for this, in my opinion, is the use of the
scalar alphabet used as the basis for almost all studies.  I am very
interested in the assignment multiple coefficient atribute vectors
to the alphabet and "words" and the application of neural net semantic 
pattern recognition AI techniques to this problem.  Some very
rudementary work has been done along these lines has been done by
Stormo and Wold et al.  The great flurry of current activity in
massively parallel fine grained hardware and software architectures will
eventually percolate into the molecular biology arena.  Most of the 
advances now being made along these lines in speech recognition will 
directly apply.  Illucidation of the syntactic and semantic patterns in
biomacromolecules should be a direct fallout from this work and if 
combined with cladistic clustering will finally allow the detection
of evolutionary relationships objectively.  (my opinions - and I love
to argue, especially about mol evol).

	I think it would be very interesting to participate in electronic
debate ala Farris/Felsenstein through this media.  Anybody out there
want to start stiring things up?

Jack Kramer
Center for Gene Research
Oregon State University
-------
-------

CMATHEWS.KRAMER@BIONET-20.ARPA (05/13/88)

From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>

Mail-From: CMATHEWS.KRAMER created at 13-May-88 09:24:18
Date: Fri 13 May 88 09:24:18-PDT
From: Jack Kramer  <CMATHEWS.KRAMER@BIONET-20.ARPA>
Subject: Re: [Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>: Re: Statistical significance of "PERCENT" homology.]
To: CMATHEWS.KRAMER@BIONET-20.ARPA
In-Reply-To: <12388310066.28.CMATHEWS.KRAMER@BIONET-20.ARPA>
Message-ID: <12398017522.38.CMATHEWS.KRAMER@BIONET-20.ARPA>

Dan,

	I finally had a chnce to go back and review the Manske and
Chapman article.  As I thought it is not ong the lines of my interests.

	The idea I was trying to convey was that a single letter is a
very poor representation of an amino acid for the purposes of all but the
very simplest sequence analysis tasks.  Each amino acid is a real molecule
which can be sescribed by a set of many chemical and physical and chemical
parameters.  thus I feel that the best way to represent this is by using a
vector which is some linear combination of all these properties for each 
sequence element at the primary level.  At higher levels groups of
contiguous vectors can be used  to extend syntactic analysis.  Abstractions
of these which cluster with biological properties would then open the 
door to semantic analysis.
	To make the idea more concrete the concept is very similar to the
manual graphically assisted analysis performed by plotting several
parameters such as hydropahty, structural propensity predictors, mass, etc.
for the concerned sequences on the same scale and then visually comparing
the graphs to detect any pattern similarity.  The Modelevsky article in the
recent April addition of CABIOS presents annother first step approach.
DeLisi's paper describes some efforts to use perceptrons as automated
learning machines to extend the concept to the pattern analysis of the 
biological semantics level.  Many papers have recently described initial
attempts and using multivariate statistical analysis of vector representations
of sequences.  I cite Gribskov, Kubota, and Sjostrom.  The venn diagram
classification approach of Taylor could provide a basis for initial
weighting matrices for learning based on the primary sequence elements.
Another paper by the same Taylor (which I remember but don't have the 
citation at hand) describes another extension to abstract secondary
structure domains and the semantic database type analyses that would be
possible.
	I believe the increasing availability of supercomputers and
vector and array processors to those working in these areas will make
these multivariate techniques the basis of the nest generation of 
sequence analysis software.  (I'm sure that could elicit some response)

Modelevskey and Akers(1988) 3-D Multivariate data display tool as a protein
design aid.  CABIOS 4:2 308 April 1988

DeLisi(1988)  Computers in Molecular Biology.  Science 240:47-52 April 1988

Gribskov et al  Profile analysis: Detection fo distantly related proteins.
PNAS 84:4355-4358   July 1987

Kubota et al  Correspondence of homologies in amino acid sequence and
tertiary structure of protein molecules.  BiochimBioophys Acta 701:242-252
1982

Sjostrom and Wold(1987)  Signal peptide amino acid sequences in E. coli
contain information related to final protein localization. A multi-variate
data analysis.  EMBO Journal 6:3 823-831

Sjostrom and Wold(1985)  A multi variate study of the relationship between the
genettic code and the physical-chemical properties of amino acids.
J Mol Evol  22:272-277

	This list is by no means exhaustive but does provide a reasonable 
intro to the possibilities and limitation(current) of the ideas I meant to 
describe initially.

	If you feel that others may be interested please feel free to forward
this message to appropriated bboards.

Jack Kramer

PS  I had to add one comment on one of my pet peeves, the misuse of the
word homology as it applies to sequences comparisons.  the series of 
messages on the statistical significanc fo sequence comparison demonstrates
the confusion taht results when "homology" is diluted to include analogy
and similarity and even more.  Here it is even worse because not only
are similarity, homology and analogy indiscriminantly intermixed but the
two different levels of sequence comparison and the subsequent phylogeny
inference were distinguished.  Maintaining these distinctions is absolutely
necessary for thses discussions ( and those int the literature to make
sense across the multidisicplinary related fields.

et al et et al  "Homology" on proteins and Nucleic Acids: A terminology 
muddle and a way out of it.  CELL 50:667 Aug 28, 1987
-------
-------