[bionet.molbio.evolution] Homology/similarity/identity: proper usage.

steffen@mbir.bcm.tmc.edu (David Steffen) (01/31/91)

  I am again struggling with the proper use of the words "homology",
"similarity", and "identity" in comparing sequences.  Specifically, we
have cloned and sequenced (a bit of) the rat homologue of the _lck_
gene.  The sequence of the mouse and human _lck_ genes is known.  How
do we know what we have is the rat homologue?  Because when we compare our
sequence to the published sequence, most of the nucleotides can be made
to match up with minimal futzing.  So how do I say that?  At present,
we are saying:
 "In all four cases, the inserts were found to contain sequences
  homologous to human and mouse lck..."
but one of my grad students points out that the word homologous is
incorrect, since it represents an inference about evolution rather
than a statement of fact.  My objection to replacing the word
"homologous" with the word "similar" is that is gives the impression
that the sequences don't match all that well.  My objection to
replacing the word "homologous" with the word "identical" is that the
sequences are not identical.  My objection to replacing the word
"homologous" with the words "##% identical" is that I would need four
different numbers for the four different tumors, making the sentence
practically unreadable.

  I guess if "similar" is the only correct word in this context, I
could live with that.  However, since I believe that we are dealing
with homologous sequences, is the word "homologous" really incorrect?
(I understand that "##% homologous" is always wrong; sequences are
either homologous or they are not.)

  Email me if you wish, but I suspect others may wonder about this as
well and that a discussion might be a "good thing".

-- 
David Steffen
Department of Cell Biology, Baylor College of Medicine, Houston TX 77030
Telephone = (713) 798-6655, FAX = (713) 790-0545
Internet = steffen@mbir.bcm.tmc.edu

boycet%frodob.dnet@ASMUS1.GENETICS.UGA.EDU (01/31/91)

Why don't you simply say that "based on the XX % similarity found 
among these sequences, we conclude that we have cloned the
gene homologous to the human _whatever_ gene"  ?

I personally feel that its preferable to be as explicit as possible
when making hypotheses of homology among genes, structures, etc.

Good luck.

T. M. Boyce

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (01/31/91)

	I would feel very uncomfortable calling some cloned inserts
from an insect "homologous" to mammalian genes.  I would say that they
were "X - Y% identical", or that they were "very similar."  The point
you want to make, of course, is that you now know the sequence of the
insect homologues of some mammalian genes, but it is the insect "genes"
that are homologous. I would put the cloned inserts into a different
category.

Bill Pearson

beckfdp@pallas.network.com (D. Pat Beckfield) (01/31/91)

In article <3824@gazette.bcm.tmc.edu> steffen@mbir.bcm.tmc.edu (David Steffen) writes:
>
>  I am again struggling with the proper use of the words "homology",
>"similarity", and "identity" in comparing sequences.  Specifically, we
>have cloned and sequenced (a bit of) the rat homologue of the _lck_
>gene.  The sequence of the mouse and human _lck_ genes is known.  How
>do we know what we have is the rat homologue?  Because when we compare our
>sequence to the published sequence, most of the nucleotides can be made
>to match up with minimal futzing.  So how do I say that?  
>
		  [discussion of rationale deleted]
>
>-- 
>David Steffen
>Department of Cell Biology, Baylor College of Medicine, Houston TX 77030
>Telephone = (713) 798-6655, FAX = (713) 790-0545
>Internet = steffen@mbir.bcm.tmc.edu

As you're discussing semantics, it seems appropriate that I (a writer and
BS in zoology) respond.

The first question is does the rat sequence you're working with execute
the same function as the similar sequences for mice and humans?  If you
don't know, is there a way to test it?  Until you know this, what you
have is a homomorphic string -- similar structure, but not necessarily
having the same function.

If they do carry out the same functions, but you're still concerned by
evolutionary relations, you can call them "analogous" -- having the
same function, but not necessarily the same origin.

I hope this helps.
-- 

D. Patrick Beckfield                          pat.beckfield@network.com
7600 Boone Ave N                              (612) 424-4888
Network Systems Corporation

ronald@uhunix1.uhcc.Hawaii.Edu (Ronald A. Amundson) (02/01/91)

In article <3824@gazette.bcm.tmc.edu> steffen@mbir.bcm.tmc.edu (David Steffen) writes:
>
>  I am again struggling with the proper use of the words "homology",
>"similarity", and "identity" in comparing sequences.  ...

I agree with some some of the followup already given, and I'm not a
great expert on molecular genetics.  But there is an interesting
problem here.  The term "homology" clearly is being used differently
in molecular genetics from its usage in traditional evolutionary
biology.  Steve Gould comments on the issue in his Natural History
column for Feb. 1988, BTW, wishing that the molecular biologists would
talk more like macro-biologists.  

The problem with calling identical molecular sequences "homologies" is
not _just_ that it implies a common source for the two sequences.  One
of the commentators is correct that _any_ evolutionary use of
"homology" infers a common source on less-than-certain evidence.  The
problem is that the criteria by which the common source is identified
is different in the molecular and "macroscopic" inferences of
homology.  I can think of two differences -- forgive my ignorance if
I've got facts wrong.

1)  Good macroscopic evolutionary inferences of homology are based on
"shared derived" characteristics.  The nests of other sets of traits
disallow certain similarities to count as homologies.  Mere similarity
alone can never be used to judge two traits as homologous.  (Unless
I'm wrong) the "mere similarity" (i.e. molecular identity or
similarity, in the absence of evidence provided by other hierarchies
of traits) of molecular sequences is used as a sufficient criterion
for the term "homology" in molecular genetics.

2) It seems to me (insert disclaimer again) that when molecular
biologists call sequences homologous, they mean that the two were
copied from a similar ancestral _molecular sequence_.  But the
processes of copying molecular sequences are not identical to the
processes of reproducing organisms.  As I understand it, sequences can
be copied within a genome, and with manipulation (and maybe some kinds
of viral infection and other exotic stuff) between genomes.  So the
geneological tree connecting up similar sequences with their molecular
ancestors will not be isomorphic with the geneological tree connecting
organisms with their ancestors.  

So it looks as if the molecular use of "homology" is a _different_ use
from the normal evolutionary use of the same term.  Whether that is a
tragedy or not depends on how confused we get by it.  But I think it
is worth noting that different concepts are being used.  There are
_lots_ of cases in the history of biology where different uses of the
same word led to long futile disputes (e.g. the term "mutation" at the
beginning of the century).

Ron Amundson
Dept. of Philosophy
University of Hawaii at Hilo
Hilo, HI  96720-4091
ronald@uhunix.bitnet

owhite@nmsu.edu (smouldering dog) (02/01/91)

In article <3824@gazette.bcm.tmc.edu> steffen@mbir.bcm.tmc.edu (David Steffen) writes:
>     I am again struggling with the proper use of the words "homology",
>   "similarity", and "identity" in comparing sequences.  Specifically, we
>   have cloned and sequenced (a bit of) the rat homologue of the _lck_
>   etc......
this is a topic worth discussing.  my understanding is that when you
are referring to a sequence that has _some_ nucleotides in a close
approximation to another sequence you should say:
	sequence A has __% similarity to sequence B.  

alternatively you can say:
	sequence A has __% identity to sequence B.

to refer to two sequences being homologous means they are a  _strict_
(nucleotide for nucleotide) match.  as in:

	the cDNA sequence to gene A is homologous to region X of the
genomic clone of gene A.

alternatively you can say:
	the cDNA sequence to gene A is identical to region X of the
genomic clone of gene A.

however, I am curious if the rest of the community agrees with the
above usages of identity, similarity and homology.

owen white
--

	owen white		(owhite@nmsu.edu)

-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-
               got my head on a pole (for better reception)
-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-

fish@AMOEBA.LLNL.GOV (02/01/91)

This is in response to a recent query by David Steffen regarding the use of 
the term "homology."  This controversy has received much attention in recent
years as most of the subscribers of this bulletin board will attest to.  The
author should probably consult these commentaries on the subject: Reeck et 
al., "Homology in proteins and nucleic acids: A terminology muddle and a
way out of it," Cell 50: 667 (1987); Lewin, "When does homology mean 
something else?" Science 237: 1570 (1987).

I tend to agree with William Pearson's suggestions, i.e., without
prior knowledge of whether given genes are "orthologous" it is probably 
a good idea not to say there are homologies between them.  For the past
few years, I have been researching the evolution of the immunoglobulin
multigene family and have encountered similar terminology problems to
that described by Dr. Steffen.

Chris T. Amemiya
Lawrence Livermore National Laboratory
fish@amoeba.llnl.gov

owhite@nmsu.edu (smouldering dog) (02/01/91)

In article <1991Jan31.155713.27154@ns.network.com> beckfdp@pallas.network.com (D. Pat Beckfield) writes:
>   As you're discussing semantics, it seems appropriate that I (a writer and
>   BS in zoology) respond.
>   discussion deleted...
>   If they do carry out the same functions, but you're still concerned by
>   evolutionary relations, you can call them "analogous" -- having the
>   same function, but not necessarily the same origin.
>
>   D. Patrick Beckfield                          pat.beckfield@network.com

in the literature, two similar sequences are RARELY referred to as
"analogous" 
--

	owen white		(owhite@nmsu.edu)

-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-
               got my head on a pole (for better reception)
-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-=-*-=-*-=-=-*-=-

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (02/02/91)

	Dr. Amundson from Hawaii is precisely correct, molecular biologists
use the term "homology" to denote "common ancestry."  In talking to
evolutionary biologists, it is my understanding that their use of
the term is precisely the same.  However, evolutionary biologists
often refer to two distinct types of homology:

	orthology - where the two sequences encode the same protein,
	e.g. a myoglobin in a human and a myoglobin in a whale

	paralogy - where the relationship between the two sequences
	is not consistent with the phylogeny, e.g. myoglobin in a human
	and beta-globin in a human (here, the ancestor of myoglobin
	and hemoglobin is much older than recent ancestors of humans).

	Evolutionary biologists discuss a similarity relationship that
is the converse of homology - analogy, or similarity due to convergent
evolution.  It is my opinion there are no good examples of convergent
evolution based on protein or DNA sequence.

	The main point, of course, is to distinguish between the
supposition - homology - and the fact - similarity or percent identity.

Bill Pearson

wrp@biochsn.acc.Virginia.EDU (William R. Pearson) (02/02/91)

In article <11223@uhccux.uhcc.Hawaii.Edu> ronald@uhunix1.uhcc.Hawaii.Edu (Ronald A. Amundson) writes:
>In article <3824@gazette.bcm.tmc.edu> steffen@mbir.bcm.tmc.edu (David Steffen) writes:
>>
>>  I am again struggling with the proper use of the words "homology",
>>"similarity", and "identity" in comparing sequences.  ...
>
>problem here.  The term "homology" clearly is being used differently
>in molecular genetics from its usage in traditional evolutionary
>biology.  Steve Gould comments on the issue in his Natural History
>column for Feb. 1988, BTW, wishing that the molecular biologists would
>talk more like macro-biologists.  

 I do not believe that the use of the term is different

>The problem with calling identical molecular sequences "homologies" is
>not _just_ that it implies a common source for the two sequences. 

	My understanding is that homology implies common
	ancestry (source) - nothing more or less.


>1)  Good macroscopic evolutionary inferences of homology are based on
>"shared derived" characteristics.  The nests of other sets of traits
>disallow certain similarities to count as homologies.  Mere similarity
>alone can never be used to judge two traits as homologous.

	I would argue that while "mere" similarity is
	insufficient, there are levels of similarity that
	allow one to infer homology and never be mistaken.
	Everyone accepts that two sequences that are 100%
	identical are homologous, and clearly one should
	not feel too uncomfortable with two sequences that
	are 90% identical (over their entire length).  The
	issue arrises when two protein sequences share
	less than 20 - 25% identity.  But there are a
	series of tests, based on sequence similarity
	alone, that make it very unlikely that the
	inference of homology is incorrect.

> (Unless
>I'm wrong) the "mere similarity" (i.e. molecular identity or
>similarity, in the absence of evidence provided by other hierarchies
>of traits) of molecular sequences is used as a sufficient criterion
>for the term "homology" in molecular genetics.

>2) It seems to me (insert disclaimer again) that when molecular
>biologists call sequences homologous, they mean that the two were
>copied from a similar ancestral _molecular sequence_.  But the
>processes of copying molecular sequences are not identical to the
>processes of reproducing organisms.  As I understand it, sequences can
>be copied within a genome, and with manipulation (and maybe some kinds
>of viral infection and other exotic stuff) between genomes.  So the
>geneological tree connecting up similar sequences with their molecular
>ancestors will not be isomorphic with the geneological tree connecting
>organisms with their ancestors.  

	Here, the terms "homologous" and "orthologous" are being confused.
	Two sequences are homologous if they share a common ancestor, no
	matter how complex and exotic the evolutionary path between that
	ancestor and the present.

>So it looks as if the molecular use of "homology" is a _different_ use
>from the normal evolutionary use of the same term.

	This is not true. We all agree on common ancestry.
	We may disagree on the amount of evidence required
	to support the assertion of common ancestry, but
	we all mean the same thing.

Bill Pearson

frist@ccu.umanitoba.ca (02/02/91)

I think it may be useful, at this point in the discussion, to construct a
chart relating the terms under discussion. Comments, anyone?

                                 IDENTICAL
                When two structures share common substructures,
                those common substructures are said to be identical.
                                     |
                                  SIMILAR
                                     |
                 Similarity is the degree to which two structures share
                 common substructures.
                                     |
               +---------------------+---------------------+
               |                                           |
           HOMOLOGOUS                                  ANALOGOUS
    Similarity due to common                 Similarity due to convergent
          ancestry                                     evolution
               |
               +---------------------+         
               |                     |
           ORTHOLOGOUS           PARALOGOUS
    Homologous structure with  Homologous structure with
    with conserved function    divergent function
                 
===============================================================================
Brian Fristensky                | "The important thing is to develop the
Dept. of Plant Science          |  capacity to see one kernel that is differ-
University of Manitoba          |  ent, and make that understandable. If it
Winnipeg, MB R3T 2N2  CANADA    |  doesn't fit, there's a reason, and you find
frist@ccu.umanitoba.ca          |  out what it is."
Office phone:   204-474-6085    | Barbara McClintock, from A FEELING FOR THE
FAX:            204-275-5128    | ORGANISM by Evelyn Fox Keller 
===============================================================================

Doug_Eernisse@UB.CC.UMICH.EDU (02/02/91)

This is another response to the recent query by David Steffen regarding the use of
the term "homology." I thought I might as well throw my two cents worth in.
 
 Many molecular biologists commonly use informal arbitrary criteria as 
 support for statements of the "homology" of two genes. For example, they 
 might suggest that if two peptide sequences in the same organism were 
 highly similar (e.g., 85 percent) then one could be confident that the
 proteins were "homologous", due to a gene duplication event, as opposed to
 similarity due to parallel evolution for similar function.
 
 It seems to me that hypotheses of homology are only relevant to phylogenetic
 inference at the level they are proposed to be synapomorphic (shared derived
 similarities) on a cladogram. Therefore, it is hopeless to try to provide 
 evidence for homology by the comparison of two taxa or sequences. The only 
 interesting evidence one can bring to bear on the issue of common ancestry 
 is shared "special" similarity relative to one or more outgroup taxa or 
 sequences. This issue is, I think, distinct from the issue of whether
 homology is used as many of us use the term synapomorphy, as a proposal 
 of homology, or as the actual similarity due to common ancestry which is 
 ultimately impossible to prove.
 
 With sequence data, there are also problems of specifying the level of
 homology. For example, Michael Ghiselin (Syst. Zool. 18: 148-149 (1969) 
 uses the following hypothetical example:
 
 A  Asp-Val-Glu-Met-Ala
 B  Asp-Pro-Glu-Met-Ala
 C  Asp-Pro-Thr-Met-Ala
 D  Gly-Pro-Thr-Met-Ala
 E  Gly-Pro-Thr-Tyr-Ala
 F  Gly-Pro-Thr-Tyr-Ser
 
 Ghiselin argues that similarity is a relation between the peptides as
 wholes, which decreases from A to F, while homology is a relation between 
 the parts. He argues, for example, that Asp is hypothesized to be homologous
 to Gly in A and F, respectively, given this alignment of the sequences.
 He also argues that the peptide sequence A could be homologous to F even
 though they are completely dissimilar. One can speak of the correspondence
 between nucleotides or amino acids in terms of their position in a sequence
 which is hypothesized to be homologous. Although Ghiselin doesn't consider
 this use of homology, one more normally may also speak of the shared
 similarity of D, E and F at site 1, relative to A, B and C, which could
 be a synapomorphy (hypothesis of homology), depending on the outgroup(s)
 one selects which in turn determines the cladogram topology. One can also 
 hypothesize that peptide F is homologous to peptide A, or more precisely, 
 hypothesize that the shared ancestor of A and F had single protein-coding 
 gene which is traceable, by descent, to the genes in A and F which produced 
 these peptides.
 
 Confusing, isn't it?
 
 Doug Eernisse
 usergdef@ub.cc.umich.edu
 usergdef@umichub.bitnet
 Museum of Zoology and Dept. of Biology
 University of Michigan
 

beckfdp@pallas.network.com (D. Pat Beckfield) (02/02/91)

In article <OWHITE.91Feb1084317@haywire.nmsu.edu> owhite@nmsu.edu (smouldering dog) writes:
>In article <1991Jan31.155713.27154@ns.network.com> beckfdp@pallas.network.com (D. Pat Beckfield) writes:
>>   If they do carry out the same functions, but you're still concerned by
>>   evolutionary relations, you can call them "analogous" -- having the
>>   same function, but not necessarily the same origin.
>>
>>   D. Patrick Beckfield                          pat.beckfield@network.com
>
>in the literature, two similar sequences are RARELY referred to as
>"analogous" 
>--
>
>	owen white		(owhite@nmsu.edu)
>
I must point out that this is a failing of the authors of the literature,
not the English language.  

If the author is not confident enough of the origins of the material, but
is confident that it is homomorphic with and carries out the same functions
as other material, then the correct word in the English language is
"analogous".  The word is ideally suited to the situation as described.
I can state this emphatically as a professional technical writer, 
condescension not withstanding.

Respectfully,
D.P.B.


-- 
D. Patrick Beckfield                          pat.beckfield@network.com
7600 Boone Ave N                              (612) 424-4888
Network Systems Corporation
Minneapolis, MN  55428-1099

joe@evolution.u.washington.edu (Joe Felsenstein) (02/03/91)

In article <11223@uhccux.uhcc.Hawaii.Edu> ronald@uhunix1.uhcc.Hawaii.Edu (Ronald A. Amundson) writes:
> The
>problem is that the criteria by which the common source is identified
>is different in the molecular and "macroscopic" inferences of
>homology.  I can think of two differences -- forgive my ignorance if
>I've got facts wrong.
>
>1)  Good macroscopic evolutionary inferences of homology are based on
>"shared derived" characteristics.  The nests of other sets of traits
>disallow certain similarities to count as homologies.  Mere similarity
>alone can never be used to judge two traits as homologous.  (Unless
>I'm wrong) the "mere similarity" (i.e. molecular identity or
>similarity, in the absence of evidence provided by other hierarchies
>of traits) of molecular sequences is used as a sufficient criterion
>for the term "homology" in molecular genetics.

>Ron Amundson
>Dept. of Philosophy
>University of Hawaii at Hilo
>Hilo, HI  96720-4091
>ronald@uhunix.bitnet

As far as I can see "homology" as used by morphological systematists
is the same thing.  Many studies of morphology don't actually base
themselves on characters where ancestral and derived states can be
predetermined.  Instead they toss the data into a computer, get a tree
by (say) Wagner parsimony, and use an outgroup criterion to root the tree,
and in the process determine after the fact which states are ancestral and
where the synapomorphies are.  That's the same thing molecular evolutionists
do.

They will often use more than one sequence and judge "homology" by where
the sequence fits in on a phylogeny of the sequences, where they might
use (say) Wagner parsimony with outgroup-rooting.  If this is done, then I see
no real difference between the two processes.

-----
Joe Felsenstein, Dept. of Genetics, Univ. of Washington, Seattle, WA 98195
 Internet:         joe@genetics.washington.edu     (IP No. 128.95.12.41)
 Bitnet/EARN:      felsenst@uwavm
 UUCP:             ... uw-beaver!evolution.genetics!joe

ahouse@BINAH.CC.BRANDEIS.EDU (02/04/91)

It seems to me that if you believe that these sequences are identical at many
positions because they derive from a common ancestor then you can feel
justified in using "homology" and do this knowing that you are implying an
inference.  You may in this case wish to distinguish paralogous from
orthologous.  (see Patterson's introduction to Molecules and Morphology in
Evolution: Conflict or Compromise)  
	Briefly these terms distinguish between identities that exists because
of shared ancestor of 2 species (orthologous) and identities that come from a
gene duplication event within a linneage (think of all of the proteins that have
Ig like domains).  You seem to feel that you have orthologous sequences that
share many identities.
	As an aside, molecular biologists seem much more willing to use
homology to mean "it looks the same (= same sequences)" and morphologists
prefer to insist that homology indicates a stronger statement of the reason for
the similarity (homology -> derived from a common ancestor). 

Jeremy Ahouse
Brandeis University - Biophysics

ahouse@BINAH.CC.BRANDEIS.EDU (02/07/91)

In article <OWHITE.91Jan31142027@haywire.nmsu.edu>, owhite@nmsu.edu (smouldering dog) writes:>
>to refer to two sequences being homologous means they are a  _strict_
>(nucleotide for nucleotide) match.  

Homology is often used with the goal of stating an inference about relatedness. 
You either have sequence identity or if you've looked at it amino acid
identity.  You may wish to indicate that some bases or amino acids are switched
but have the same function...  So all this means that you should be very
explicit about what you intend.  Your reading of homology as "strict" match
replaces a statement about inferred relationship with one about pattern and
seems too restrictive.

Jeremy

ahouse@BINAH.CC.BRANDEIS.EDU (02/07/91)

>	orthology - where the two sequences encode the same protein,
>	e.g. a myoglobin in a human and a myoglobin in a whale
>
>	paralogy - where the relationship between the two sequences
>	is not consistent with the phylogeny, e.g. myoglobin in a human
>	and beta-globin in a human (here, the ancestor of myoglobin
>	and hemoglobin is much older than recent ancestors of humans).
>

I think that you might what to distinguish these terms not with respect to
phylogentic consistency but rather in terms of genetic events.  A gene
duplication (that gives rise to a pair of paralogous genes) may happen before a
linneage split.  If the 2 genes become functionally diiferent in the 2
linneages you may be fooled into imagining a homology that is due to a shared
derived gene when in fact you don't have orthologous genes because of the
paralogy in the ancestor.

Jeremy Ahouse
Brandei