SYEH@BIONET-20.ARPA (04/01/88)
From: Jack Kramer <CMATHEWS.KRAMER@BIONET-20.ARPA> Winston, Percent similarity can be used to imply homology (CELL 50:667 Aug 87) as well as analogy. The statistical significance of sequence similarity has been extensively analysed. A couple of good places to start looking are the appropriate chapters in von Heijne's "Sequence Analysis in Molecular Biology", reviews and articles in Computer Applications in the Biosciences (e.g. 3:1 Mar 87) and the Application of Computers to Research on Nucleic Acids I, II and III supplements to NAR. Jack -------
CMATHEWS.KRAMER@BIONET-20.ARPA (04/06/88)
From: Jack Kramer <CMATHEWS.KRAMER@BIONET-20.ARPA> Mail-From: GOAD.DAVISON created at 1-Apr-88 18:52:09 Date: Fri 1 Apr 88 18:52:09-PST From: Dan Davison <GOAD.DAVISON@BIONET-20.ARPA> Subject: Re: Statistical significance of "PERCENT" homology. To: CMATHEWS.KRAMER@BIONET-20.ARPA In-Reply-To: <12386849069.45.CMATHEWS.KRAMER@BIONET-20.ARPA> Message-ID: <12387121770.15.GOAD.DAVISON@BIONET-20.ARPA> You might check out the method of Manske and Chapman, in the second issue of the current volume of the J. of Molecular Evolution. It's an interesting attempt in the general direction you describe. Would it be IK to post your note to MOLECULAR-EVOLUTION? I wouldn't mind starting a fight or two. I also think it's unlikely that *anything* other than Dr. Who's Tardis would allow "the detection of evolutionary relationships objectively". Perhaps I'm too jaded from the SUNY Stony Brook wars. dan davison ------- -------
GOAD.DAVISON@BIONET-20.ARPA (04/07/88)
From: Dan Davison <GOAD.DAVISON@BIONET-20.ARPA> Mail-From: CMATHEWS.KRAMER created at 31-Mar-88 17:54:09 Date: Thu 31 Mar 88 17:54:09-PST From: Jack Kramer <CMATHEWS.KRAMER@BIONET-20.ARPA> Subject: Re: Statistical significance of "PERCENT" homology. To: GOAD.DAVISON@BIONET-20.ARPA In-Reply-To: <12386835730.28.GOAD.DAVISON@BIONET-20.ARPA> Message-ID: <12386849069.45.CMATHEWS.KRAMER@BIONET-20.ARPA> Dan, I strongly second your "not well understood". A significant reason for this, in my opinion, is the use of the scalar alphabet used as the basis for almost all studies. I am very interested in the assignment multiple coefficient atribute vectors to the alphabet and "words" and the application of neural net semantic pattern recognition AI techniques to this problem. Some very rudementary work has been done along these lines has been done by Stormo and Wold et al. The great flurry of current activity in massively parallel fine grained hardware and software architectures will eventually percolate into the molecular biology arena. Most of the advances now being made along these lines in speech recognition will directly apply. Illucidation of the syntactic and semantic patterns in biomacromolecules should be a direct fallout from this work and if combined with cladistic clustering will finally allow the detection of evolutionary relationships objectively. (my opinions - and I love to argue, especially about mol evol). I think it would be very interesting to participate in electronic debate ala Farris/Felsenstein through this media. Anybody out there want to start stiring things up? Jack Kramer Center for Gene Research Oregon State University ------- -------
CMATHEWS.KRAMER@BIONET-20.ARPA (05/13/88)
From: Jack Kramer <CMATHEWS.KRAMER@BIONET-20.ARPA> Mail-From: CMATHEWS.KRAMER created at 13-May-88 09:24:18 Date: Fri 13 May 88 09:24:18-PDT From: Jack Kramer <CMATHEWS.KRAMER@BIONET-20.ARPA> Subject: Re: [Dan Davison <GOAD.DAVISON@BIONET-20.ARPA>: Re: Statistical significance of "PERCENT" homology.] To: CMATHEWS.KRAMER@BIONET-20.ARPA In-Reply-To: <12388310066.28.CMATHEWS.KRAMER@BIONET-20.ARPA> Message-ID: <12398017522.38.CMATHEWS.KRAMER@BIONET-20.ARPA> Dan, I finally had a chnce to go back and review the Manske and Chapman article. As I thought it is not ong the lines of my interests. The idea I was trying to convey was that a single letter is a very poor representation of an amino acid for the purposes of all but the very simplest sequence analysis tasks. Each amino acid is a real molecule which can be sescribed by a set of many chemical and physical and chemical parameters. thus I feel that the best way to represent this is by using a vector which is some linear combination of all these properties for each sequence element at the primary level. At higher levels groups of contiguous vectors can be used to extend syntactic analysis. Abstractions of these which cluster with biological properties would then open the door to semantic analysis. To make the idea more concrete the concept is very similar to the manual graphically assisted analysis performed by plotting several parameters such as hydropahty, structural propensity predictors, mass, etc. for the concerned sequences on the same scale and then visually comparing the graphs to detect any pattern similarity. The Modelevsky article in the recent April addition of CABIOS presents annother first step approach. DeLisi's paper describes some efforts to use perceptrons as automated learning machines to extend the concept to the pattern analysis of the biological semantics level. Many papers have recently described initial attempts and using multivariate statistical analysis of vector representations of sequences. I cite Gribskov, Kubota, and Sjostrom. The venn diagram classification approach of Taylor could provide a basis for initial weighting matrices for learning based on the primary sequence elements. Another paper by the same Taylor (which I remember but don't have the citation at hand) describes another extension to abstract secondary structure domains and the semantic database type analyses that would be possible. I believe the increasing availability of supercomputers and vector and array processors to those working in these areas will make these multivariate techniques the basis of the nest generation of sequence analysis software. (I'm sure that could elicit some response) Modelevskey and Akers(1988) 3-D Multivariate data display tool as a protein design aid. CABIOS 4:2 308 April 1988 DeLisi(1988) Computers in Molecular Biology. Science 240:47-52 April 1988 Gribskov et al Profile analysis: Detection fo distantly related proteins. PNAS 84:4355-4358 July 1987 Kubota et al Correspondence of homologies in amino acid sequence and tertiary structure of protein molecules. BiochimBioophys Acta 701:242-252 1982 Sjostrom and Wold(1987) Signal peptide amino acid sequences in E. coli contain information related to final protein localization. A multi-variate data analysis. EMBO Journal 6:3 823-831 Sjostrom and Wold(1985) A multi variate study of the relationship between the genettic code and the physical-chemical properties of amino acids. J Mol Evol 22:272-277 This list is by no means exhaustive but does provide a reasonable intro to the possibilities and limitation(current) of the ideas I meant to describe initially. If you feel that others may be interested please feel free to forward this message to appropriated bboards. Jack Kramer PS I had to add one comment on one of my pet peeves, the misuse of the word homology as it applies to sequences comparisons. the series of messages on the statistical significanc fo sequence comparison demonstrates the confusion taht results when "homology" is diluted to include analogy and similarity and even more. Here it is even worse because not only are similarity, homology and analogy indiscriminantly intermixed but the two different levels of sequence comparison and the subsequent phylogeny inference were distinguished. Maintaining these distinctions is absolutely necessary for thses discussions ( and those int the literature to make sense across the multidisicplinary related fields. et al et et al "Homology" on proteins and Nucleic Acids: A terminology muddle and a way out of it. CELL 50:667 Aug 28, 1987 ------- -------