solomon@gjetost.cs.wisc.edu (Marvin Solomon) (07/09/89)
In article <PCG.89Jul8222705@thor.cs.aber.ac.uk> pcg@thor.cs.aber.ac.uk (Piercarlo Grandi) discusses the Simula 67 ':-' operator and how it relates to ':='. (His note contains quotes several levels deep, and it wasn't clear to me who wrote what, so I won't quote him here). I've come to realize that there is a basic concept here that can be used to classify languages and clarify their semantics. In 1975 I mentioned to a grad student at the University of Aarhus (Denmark) that I found the distinction between :- and := in Simula very confusing. It seemed to me that you used :- to assign pointers and := to assign anything else. The student, who was an acolyte of Christian Nygaard, one of the chief gurus of Simula, heatedly informed me that Simula doesn't have pointers, only references, and A REFERENCE IS NOT A POINTER!!. After a bit of arguing, I came to realize that he and I had different pictures in our heads when we wrote ref(foo) p; Consider what it means to declare (in C or C++) int i = 3; This allocates a cell big enough to hold an integer, puts a three into the cell, and binds the identifier 'i' to it. The picture is +---+ i : | 3 | +---+ Now consider the definition int *p = &i; This allocates a cell big enough for an address, binds 'p' to it, and initializes it with the address of the cell bound to i. The picture is +---+ p : | *--------+ +---+ | v +---+ i : | 3 | +---+ Given this picture, the C assignment operator '=' always means "copy": In the expression 'e1 = e2', the expression 'e1' must evaluate to an 'lvalue'--i.e., it must designate a cell--and e2 must evaluate to a value of the right size and shape to fit into the cell. A copy of the value of e2 is put into the cell denoted by e1. Now consider the Simula statements ref(integer) p,q; p :- new integer; q :- p; If you view the Simula 'ref(integer)' as a synonym for the C 'int *', it would appear that the "assignment" operator :- has copy semantics. If so, why are there two operators :- and := in Simula? In fact, in almost every case only one of p :- or p := is legal, depending on the type of p. The lone exception concerns the type 'text', which is a built-in character-string type. More on this later. As it turned out, my informant had quite a different picture in his mind. To him, a reference is a kind of movable label for a cell. For the fragment of Simula above, he would draw the picture +---+ p: | | q: | | +---+ Thus := means "copy the contents", whereas :- means "move a label". Restrictions on when :- or := is legal can be thought of as implementation limitations rather than fundamental semantics concepts. For example, integer i; ref(integer) p; p :- i; p := 3; is not allowed, but if it were, it would lead to the picture +---+ p: | 3 | i: | | +---+ In the case of 'text', both :- and := are allowed, and the semantics follows this pattern: text t,u,v; t := "hello"; u :- t; comment u refers to the same string as t; v := t; comment v refers to a copy; t[1] := "H"; comment Now t=u="Hello" and v="hello"; [DISCLAIMER: I may not have the Simula syntax exactly right. It's been over 10 years since I've used the language and I don't have a manual handy. But I'm pretty sure I have the semantics right.] C++ aficianados will note the similarity of Simula "references" and C++ "references". References in C++ are probably one of its most obscure and confusing features for C programmers. The confusion is exacerbated by the fact that C++ has *both* pointers and references and by the fact that the same symbol (=) is used for copy (in the context of an assignent) and reference binding (in the context of an initialization). Consider int i,j; int *p; int &jj = j; p = &i; *p = 3; jj = 4; The picture is +---+ +---+ p : | *--------+ j: | 4 | +---+ | jj: | | v +---+ +---+ i : | 3 | +---+ In the introduction to this note, I promised a classification of languages. The classification concerns nesting of data structures; it is most easily described in implementation terms: Can a data structure contain instances of other data strucutres, pointers to data structures, both, or neither? FORTRAN is a langauge that does not admit nested structures at all. COBOL allows nesting of structures, but does not support pointers. LISP, Simula, SmallTalk, and CLU are all examples of languages that allow data structures to contain pointers, but not actual instances (copies) of data structures. I gather from discussions in this group that Eiffel falls into this category, although I do no know Eiffel well enough to be sure. The fourth class of languages all seem to trace their descent from a languauge called CPL (or perhaps Algol 68, which was designed about the same time). They include PL/I, Algol W, Pascal, BCPL, C, and C++. These allow a "subobject" to be represented either by a local copy or by a pointer. The programmer is require to choose the representation. In all but the last class of languages, the distinction between a pointer and a local copy can be largely hidden by the syntax and operational semantics of the language, and by introducing limitations on how language primitives can be combined. For example, compare C++ with Simula: C++ Simula --- ------ class foo { int a,b,c; }; class foo; begin integer a,b,c; end; foo *p; ref(foo) p; foo f; foo f; comment illegal; p = new foo; p :- new foo; (*p).a = 5; p.a := 5; f.a = 6; comment no counterpart; In this respect at least, Pascal is like C++. The equivalent program is type foo = record a,b,c : integer end; var p : ^foo; f : foo; begin new(p); p^.a := 5; f.a := 6; end; Since the Simula programmercannot get his hand directly on an instance, but only a reference, the dot notation can be used unambiguously to mean "dereference the pointer and index into the resulting structure" (or "access the structure referred to by the reference", depending on your point of view). In C and Pascal, a finer distinction is made between an object and a reference to it (this is a from of the "use/mention" distinction familiar to linguists: "the name of the song is called..."). Pointers must be explicitly dereferenced before selection. The syntax is clumsy enough in C that the '->' operator was introduced as an abbreviation (no such abbreviation is required in Pascal because "dereference" is a postfix operator). The distinction between a pointer and a copy has important implications for time and space efficency. Consider struct large { char data[1000]; }; struct large hello = {"hello"}; struct large goodbye = {"goodbye"}; struct { struct large a; struct large *b; } f,g; char c; f.a = hello; // slow f.b = &goodbye; // fast g = f; // g.a=f.a (slow); g.b=f.b (fast) c = f.a.data[0]; // very fast (one memory reference) c = f.b->data[0] // possibly two memory references It also has important semantic ramifications in three contexts: update, comparison, and deletion. Structure-sharing can cause anomalous behavior in the presense of updates. f.a.data[0] = '*'; f.b->data[0] = '*'; puts(g.a.data); // "hello" puts(g.b->data); // "*oodbye" This is the infamous "alias problem". (Pure) LISP avoids the problem by disallowing update-in-place. In other languages, it causes no end of headaches for (optimizing) compiler writers and serious pitfalls for programmers. Teaching introductory data structures in Simula in the late 70's, I found this alias problem to be one of the worst sources of bugs in students' programs. ("Look, I print this message before calling f and after the return and it changes, even though I don't pass it as a parameter!"). Comparison leads to the EQ vs EQUAL distinction mentioned in a previous note. struct { char *a; } f,g; f.a = "hello"; g.a = "hello"; bcmp(&f, &g, sizeof f); // returns false! Deletion brings us to the subject of garbage collection, which has already been beated to death in this group. I have two cents to contribute to this question too, but I've already rambled on far too long, so I'll restrain myself. In summary, structure-sharing is dangerous, but so absolutely essential in so many applications that is cannot be left out of a language. In cases where the programmer chooses not to share, however, there is no reason to suffer the additional time and space overhead of a pointer. This sort of low-level efficiency question would be better left to the compiler, but until we have compilers for "real" languages that can do complete alias analysis (I've been waiting 20 years now), I'm afraid the programmer is still going to be asked to lend a hand. -- Marvin Solomon Computer Sciences Department University of Wisconsin, Madison WI solomon@cs.wisc.edu, ...seismo!uwvax!solomon