[comp.lang.c++] References vs Pointers

solomon@gjetost.cs.wisc.edu (Marvin Solomon) (07/09/89)

In article <PCG.89Jul8222705@thor.cs.aber.ac.uk> pcg@thor.cs.aber.ac.uk
(Piercarlo Grandi) discusses the Simula 67 ':-' operator and how it relates
to ':='.  (His note contains quotes several levels deep, and it wasn't clear
to me who wrote what, so I won't quote him here).  I've come to realize that
there is a basic concept here that can be used to classify languages and
clarify their semantics.

In 1975 I mentioned to a grad student at the University of Aarhus (Denmark)
that I found the distinction between :- and := in Simula very confusing.
It seemed to me that you used :- to assign pointers and := to assign
anything else.  The student, who was an acolyte of Christian Nygaard, one of
the chief gurus of Simula, heatedly informed me that Simula doesn't have
pointers, only references, and A REFERENCE IS NOT A POINTER!!.  After
a bit of arguing, I came to realize that he and I had different pictures
in our heads when we wrote
	ref(foo) p;
Consider what it means to declare (in C or C++)
	int i = 3;
This allocates a cell big enough to hold an integer, puts a three into the
cell, and binds the identifier 'i' to it.  The picture is
	    +---+
	i : | 3 |
	    +---+
Now consider the definition
	int *p = &i;
This allocates a cell big enough for an address, binds 'p' to it, and
initializes it with the address of the cell bound to i.  The picture is
	    +---+
	p : | *--------+
	    +---+      |
		       v
		     +---+
		 i : | 3 |
		     +---+
Given this picture, the C assignment operator '=' always means "copy":
In the expression 'e1 = e2', the expression 'e1' must evaluate to an
'lvalue'--i.e., it must designate a cell--and e2 must evaluate to a value
of the right size and shape to fit into the cell.  A copy of the value
of e2 is put into the cell denoted by e1.

Now consider the Simula statements
	ref(integer) p,q;
	p :- new integer;
	q :- p;
If you view the Simula 'ref(integer)' as a synonym for the C 'int *',
it would appear that the "assignment" operator :- has copy semantics.
If so, why are there two operators :- and := in Simula?  In fact, in almost
every case only one of
	p :-
or	p :=
is legal, depending on the type of p.  The lone exception concerns the type
'text', which is a built-in character-string type.  More on this later.

As it turned out, my informant had quite a different picture in his mind.
To him, a reference is a kind of movable label for a cell.  For the fragment
of Simula above, he would draw the picture
	+---+
     p: |   |
     q: |   |
	+---+
Thus := means "copy the contents", whereas :- means "move a label".
Restrictions on when :- or := is legal can be thought of as implementation
limitations rather than fundamental semantics concepts.  For example,
	integer i;
	ref(integer) p;
	p :- i;
	p := 3;
is not allowed, but if it were, it would lead to the picture
	+---+
     p: | 3 |
     i: |   |
	+---+
In the case of 'text', both :- and := are allowed, and the semantics follows
this pattern:
	text t,u,v;

	t := "hello";
	u :- t;  comment u refers to the same string as t;
	v := t;  comment v refers to a copy;
	t[1] := "H";
	comment Now t=u="Hello" and v="hello";

[DISCLAIMER:  I may not have the Simula syntax exactly right.  It's been
over 10 years since I've used the language and I don't have a manual handy.
But I'm pretty sure I have the semantics right.]

C++ aficianados will note the similarity of Simula "references" and C++
"references".  References in C++ are probably one of its most obscure and
confusing features for C programmers.  The confusion is exacerbated by the
fact that C++ has *both* pointers and references and by the fact that
the same symbol (=) is used for copy (in the context of an assignent) and
reference binding (in the context of an initialization).  Consider
	int i,j;
	int *p;
	int &jj = j;

	p = &i;
	*p = 3;
	jj = 4;

The picture is
	    +---+                 +---+
	p : | *--------+       j: | 4 |
	    +---+      |      jj: |   |
		       v          +---+
		     +---+
		 i : | 3 |
		     +---+

In the introduction to this note, I promised a classification of languages.
The classification concerns nesting of data structures; it is most easily
described in implementation terms:  Can a data structure contain instances
of other data strucutres, pointers to data structures, both, or neither?

FORTRAN is a langauge that does not admit nested structures at all.
COBOL allows nesting of structures, but does not support pointers.
LISP, Simula, SmallTalk, and CLU are all examples of languages that allow
data structures to contain pointers, but not actual instances (copies)
of data structures.  I gather from discussions in this group that Eiffel
falls into this category, although I do no know Eiffel well enough to be sure.

The fourth class of languages all seem to trace their descent from a languauge
called CPL (or perhaps Algol 68, which was designed about the same time).
They include PL/I, Algol W, Pascal, BCPL, C, and C++.  These allow a
"subobject" to be represented either by a local copy or by a pointer.
The programmer is require to choose the representation.

In all but the last class of languages, the distinction between a pointer
and a local copy can be largely hidden by the syntax and operational semantics
of the language, and by introducing limitations on how language primitives
can be combined.  For example, compare C++ with Simula:
	C++				Simula
	---				------
	class foo { int a,b,c; };	class foo; begin integer a,b,c; end;
	foo *p;				ref(foo) p;
	foo f;				foo f; comment illegal;
	p = new foo;			p :- new foo;
	(*p).a = 5;			p.a := 5;
	f.a = 6;			comment no counterpart;
In this respect at least, Pascal is like C++.  The equivalent program is
	type foo = record a,b,c : integer end;
	var p : ^foo;
	    f : foo;
	begin
		new(p);
		p^.a := 5;
		f.a := 6;
	end;
Since the Simula programmercannot get his hand directly on an instance, but
only a reference, the dot notation can be used unambiguously to mean
"dereference the pointer and index into the resulting structure" (or
"access the structure referred to by the reference", depending on your point
of view).  In C and Pascal, a finer distinction is made between an object
and a reference to it (this is a from of the "use/mention" distinction familiar
to linguists:  "the name of the song is called...").  Pointers must be
explicitly dereferenced before selection.  The syntax is clumsy enough
in C that the '->' operator was introduced as an abbreviation (no such
abbreviation is required in Pascal because "dereference" is a postfix operator).

The distinction between a pointer and a copy has important implications for
time and space efficency.  Consider

	struct large { char data[1000]; };
	struct large hello = {"hello"};
	struct large goodbye = {"goodbye"};

	struct {
		struct large a;
		struct large *b;
	} f,g;
	char c;

	f.a = hello;		// slow
	f.b = &goodbye;		// fast
	g = f;			// g.a=f.a (slow); g.b=f.b (fast)
	c = f.a.data[0];	// very fast (one memory reference)
	c = f.b->data[0]	// possibly two memory references

It also has important semantic ramifications in three contexts:
update, comparison, and deletion.

Structure-sharing can cause anomalous behavior in the presense of updates.
	f.a.data[0] = '*';
	f.b->data[0] = '*';
	puts(g.a.data);		// "hello"
	puts(g.b->data);	// "*oodbye"
This is the infamous "alias problem".  (Pure) LISP avoids the problem by
disallowing update-in-place.  In other languages, it causes no end of
headaches for (optimizing) compiler writers and serious pitfalls for
programmers.  Teaching introductory data structures in Simula in the late
70's, I found this alias problem to be one of the worst sources  of bugs
in students' programs.  ("Look, I print this message before calling f and
after the return and it changes, even though I don't pass it as a parameter!").

Comparison leads to the EQ vs EQUAL distinction mentioned in a previous note.
	struct { char *a; } f,g;
	f.a = "hello";
	g.a = "hello";
	bcmp(&f, &g, sizeof f); // returns false!

Deletion brings us to the subject of garbage collection, which has already
been beated to death in this group.  I have two cents to contribute to
this question too, but I've already rambled on far too long, so I'll restrain
myself.

In summary, structure-sharing is dangerous, but so absolutely essential
in so many applications that is cannot be left out of a language.  In cases
where the programmer chooses not to share, however, there is no reason
to suffer the additional time and space overhead of a pointer.
This sort of low-level efficiency question would be better left to
the compiler, but until we have compilers for "real" languages that can do
complete alias analysis (I've been waiting 20 years now), I'm afraid
the programmer is still going to be asked to lend a hand.
--
	Marvin Solomon
	Computer Sciences Department
	University of Wisconsin, Madison WI
	solomon@cs.wisc.edu, ...seismo!uwvax!solomon