[comp.lang.c++] deepEquality and shallowEquality revisited.

steveb@tekcsc.WV.TEK.COM (04/20/91)

A few days ago I posted a query in comp.lang.smalltalk because I was confused
about deep and shallow equality.  I was hoping to either get references to
OOPSLA type papers or a good definition.  Judging from what I received, most
people either didn't understand my question or didn't understand deep and 
shallow equality.  I got all manner of answers including such things as
"equality is shallowEquality and identity is deepEquality".  So, in case
anyone is interested here is the best answer I received.  It is from Harry
Porter who teaches Object-Oriented Design at Portland State University.  He
addresses the subject with examples you might not fully understand unless
you've taken his class but they're relatively clear and obvious.  So, thanks
to Harry for taking the time to write this up, and thanks to everyone else who
responded.  I'm posting this to comp.lang.c++ also because it's related to
OOD in general.

Steve

==============================

There are several kinds of "equality" in object-oriented 
programming.  In an attempt to clarify the different underlying 
concepts, I use the following terms for the different meanings of the 
term "equality:"
	object identity
	shallow equality
	deep equality
	similiar-for-the-purposes-of-the-application

First, "object identity" is just a pointer comparison:  two objects are 
identical if they are the same object.  Often times, people say "two 
objects are identical if they are the same object."  By "the same 
object," we mean they are stored in the same memory location.  A 
change to one object will always affect an identical object since there 
is only one piece of memory involved.  When we say two variables (x 
and y) are identical, we mean that they contain the same pointer.  
That is, they point to the identical object. 

Second, there is the concept of "shallow equality."  Two objects are 
shallow equal if they look alike.  Two variables are shallow equal if 
they point to two regions of memory that are bit-wise equal to each 
other.  Thus, a change to one object will NOT be seen by processes 
holding pointers to the other.  The definition of shallow equality is 
usually enlarged to include "object identity" as well.  If two objects 
are identical then they are generally said to be shallow equal, too.  A 
formal definition might be something like: "Two objects are shallow 
equal if either: (1) they are identical, or (2) each corresponding pair 
of fields in the objects is identical."

Third, there is the concept of "deep equal."  The idea here is that the 
fields in the two object do not need to be identical (i.e., they don't 
need to point to the very same sub-objects) but, instead, they must 
point to objects that are themselves "equal."  And what do we mean 
by "equal?"  Namely, the sub-objects are themselves deep equal.  
This loose definition leads to a more formal recursive definition:  Two 
objects are deep equal if either: (1) they are identical, (2) each field 
is deep equal.  As an example, imagine two trees represented with 
instances of our Tree class.  If the two tree have the same shapes and 
each corresponding pair of nodes is the same, then the trees are deep 
equal.  Thus, we have two completely different sets of objects: one 
representing one tree and the other set representing the second tree.  
(The definition of deep equality given above is not entirely 
satisfactory, since it doesn't exactly state that the shapes must be the 
same.  It works well for our simple tree example, but imagine that 
we are representing a Directed Acyclic Graph (DAG) in which one 
node may be the child of several parents.  As a simple example, first 
consider a node (labeled "A") whose left-child and right-child 
pointers both point to the same (identical) node (labeled "B").  Now 
imagine a second set of nodes.  The top node (labeled "A") has a left-
child pointer which points to a node (labeled "B") and a right-child 
pointer that points to another, non-identical node also labeled "B".  
The two "A" nodes are not deep equal.  Thus, a rigorous definition of 
deep equality would require the same "shape" - or "co-reference 
structure" - among two deep-equal structures.)

Fourth, there is the concept of "similar-for-the-purposes-of-the-
application."  (Sometimes, this is called "equivalence," but Smalltalk 
uses that term to mean "identical" so I would avoid it.)  The example 
here is comparing a Fraction with a Float.  The two objects are very 
different - they belong to different classes and have differing 
numbers of fields - but one Fraction object might represent the same 
number as some Float object.  Here, the definition of equality is 
application dependent.  For example, we might want a Fraction be 
equal to a Float if they are within some small tolerance.

Finally, there is the implementation of == and = in Smalltalk.  == is 
implemented primitively as a pointer comparison and therefore == 
means object identity.  == is implemented in class Object and should 
not be overridden.

On the other hand, = is used to mean "similiar-for-the-purposes-of-
the-application."  An implementation of = is provided in various 
classes.  (As an example, take a look at the implementation of = in 
class Number. I actually found the method in a class called 
ArithmeticValue, which sits between Magnitude and Number.  Not all 
class hierarchies are (deep) equal!)  A default implementation is 
provided for = ("similiar-for-the-purposes-of-the-application") in 
class Object.  This default just invokes ==.

Shallow equality and deep equality are not provided in Smalltalk, at 
least not in class Object.  You may always implement them in any 
class.

Equality is related to copying.  When you make a copy of an object, 
what exactly do you get?  Again, there are several related concepts: 
"shallow equality", "deep equality" and "pointer copying."  In 
Smalltalk, the assignment statement (left-arrow) performs pointer 
copying: no new objects are allocated on the heap.  Pointer copying is 
also used for parameter passing and returned values.

If you want to create new objects, you can send a message to the 
object you want to copy.  All objects understand the messages 
"shallowCopy", "deepCopy" and "copy."  ShallowCopy will create 
exactly one new object and the fields of the original will be moved 
(via pointer copying) into the new object.  Thus, sub-objects will be 
shared.  The implementation of deepCopy in Smalltalk allocates a 
new object and then calls itself recursively for each field.  Thus, all 
sub-objects will be copied and not shared.  If there is no pointer co-
reference in the object being copied, this works fine.  On the other 
hand, the implementation of deepCopy in Smalltalk will fail to 
produce an exact copy of the DAG example mentioned above.  Finally, 
the message "copy" is intended to be used to create a copy as 
appropriate for the application.  The copy message is implemented in 
class Object (as a shallowCopy), but should be overridden in classes 
where other behavior is appropriate.

The importance of getting all these concepts straight cannot be over-
emphasized.  When constructing complex programs, you must know 
whether objects are identical or not.  Changes made to one object 
may be felt elsewhere if both parts of the program hold pointers to 
an identical object.  Also, some programs will use the identity 
comparison (==) for efficiency and clarity, but this can be confusing 
when objects are equal but not the same object.

jimad@microsoft.UUCP (Jim ADCOCK) (04/23/91)

In article <10646@orca.wv.tek.com| steveb@tekcsc.WV.TEK.COM () writes:
(quoting generally good stuff from Harry Porter)
|First, "object identity" is just a pointer comparison:  two objects are 
|identical if they are the same object.  Often times, people say "two 
|objects are identical if they are the same object."  By "the same 
|object," we mean they are stored in the same memory location.  

I disagree, rather:

"An object's identity is that which distinguishes an object from all other
 objects."

I caution C++ programmers that while defining "object identity" to 
be the same as a memory address works pretty well for Smalltalk, there
are many problems with such a definition in C++.  The reasons that using
a memory address as the representation of an object's identity works well
in Smalltalk include:  1) Smalltalk has garbage collection.  Thus if an
object no longer exists, we are guaranteed that its "identity" no longer
exists anywhere in the system, thus we are free to re-assign that identity
to some newly created object.  2) Smalltalk [generally] doesn't have
multiple inheritence, thus avoiding the C++ problem that a MI object
may have several addresses each of which can be considered as representing
the object.

Some examples of common "identity traps" that C++ programmers constantly
fall into by equating 'this' to be the same thing as an object's 'identity.'

* An embedded object may have the same address as an embedding object.
* MI objects may have several significant addresses.
* Compilers differ in how they lay out objects, thus these problems change
	with compiler.
* C++ has VERY ill-defined rules about when and where compilers may make
	a copy of an object, changing "its 'identity.'"
* Two objects may occupy the same address at different points in time,
	and C++ doesn't guarantee that all references to the prior object
	have gone away.

etc, etc, etc.

Thus I claim an object's address in C++ IS NOT its identity.  To again quote
Khoshofian and Abnous: "An object's identity is that which distiguishes
an object from all other objects."  Clearly, in C++, a memory address is
frequently insufficient to correctly make this distinction.

|The importance of getting all these concepts straight cannot be over-
|emphasized.  When constructing complex programs, you must know 
|whether objects are identical or not.  

Agreed.  Thus I warn C++ programmers that frequently it is inappropriate
to consider a C++ object's address[es] to be its "identity."

For more complete discussions see:

"Is 'this' identity?" in Feb 1991 'The C++ Report'

and

Object Orienteering 
by Khoshafian and Abnous

chip@tct.com (Chip Salzenberg) (05/01/91)

In an article with which I otherwise agree, Jim Adcock writes:

>* C++ has VERY ill-defined rules about when and where compilers may make
>	a copy of an object, changing "its 'identity.'"

Those rules describe the creation of new objects, not the changing of
old objects' identities.  So I would consider object copying rules to
be a separate issue from object identity determination.
-- 
Brand X Industries Sentient and Semi-Sentient Being Resources Department:
        Because Sometimes, "Human" Just Isn't Good Enough [tm]
     Chip Salzenberg         <chip@tct.com>, <uunet!pdn!tct!chip>

jimad@microsoft.UUCP (Jim ADCOCK) (05/04/91)

In article <281EE52A.672F@tct.com> chip@tct.com (Chip Salzenberg) writes:
|In an article with which I otherwise agree, Jim Adcock writes:
|
|>* C++ has VERY ill-defined rules about when and where compilers may make
|>	a copy of an object, changing "its 'identity.'"
|
|Those rules describe the creation of new objects, not the changing of
|old objects' identities.  So I would consider object copying rules to
|be a separate issue from object identity determination.

This kind of depends on one's perspective.  As a programmer, if you get
a copy made unexpectedly by some compiler, where you expect the original
instead, then the identity you get is not the one you expected.  The fact
that you get the wrong object along with the wrong identity doesn't help
you much.  What you'd really like is clearer rules about if and when
its legal for a compiler to make a copy.  The idea that a compiler can
introduce a "temporary" copy if and when it chooses kind of makes OOP
difficult -- or at least non-portable.

chip@tct.com (Chip Salzenberg) (05/10/91)

According to jimad@microsoft.UUCP (Jim ADCOCK):
>In article <281EE52A.672F@tct.com> chip@tct.com (Chip Salzenberg) writes:
>|I would consider object copying rules to be a separate issue from
>|object identity determination.
>
>This kind of depends on one's perspective.  As a programmer, if you get
>a copy made unexpectedly by some compiler, where you expect the original
>instead, then the identity you get is not the one you expected.  The fact
>that you get the wrong object along with the wrong identity doesn't help
>you much.

If the copy can be identified as a copy because its identity is
detectably different (which, in ARM C++, it is), then copying and
identity become entirely unrelated issues (which, in ARM C++, they
are).

>What you'd really like is clearer rules about if and when its legal for
>a compiler to make a copy.

I am happy with the clarity of the rules: I think that I understand
them.  I am not happy with their laxity, however, and wish that the
implementor's latitude in this area were reduced.

>The idea that a compiler can introduce a "temporary" copy if and when it
>chooses kind of makes OOP difficult -- or at least non-portable.

Some things are difficult, yes; though I tend to avoid circumstances
in which copying is optional.  Non-portable, no -- just be careful not
to ass_u_me that a copy will or will not be made in circumstances
which allow both.
-- 
Brand X Industries Sentient and Semi-Sentient Being Resources Department:
        Because Sometimes, "Human" Just Isn't Good Enough [tm]
     Chip Salzenberg         <chip@tct.com>, <uunet!pdn!tct!chip>