[comp.lang.misc] BCPL Pointers

firth@sei.cmu.edu (Robert Firth) (09/22/87)
The current debate about BCPL, pointers, and the Amiga might
find some history useful.  The language itself is described
in [Martin Richards: BCPL, The Language and its Compiler].

BCPL is a typeless language.  That is, while the programmer
is free to consider a variable 'v' as holding an integer, a
character, an enumeral, a pointer, or even a function or
label, the language makes no such distinction.  The variable
holds a bit pattern, and one can interpret it however one
pleases, though of course programs that impose consistent
interpretations are more likely to do sensible things.

Likewise, the only storage unit is the word, which is a unit
big enough to hold a bit pattern of useful size.  The normal
size chosen is the size that will comfortably hold an integer
or a pointer, eg 16 bits on PDP-11, 32 bits on MC68000.

The abstract machine assumed by BCPL has a von Neumann store,
comprising a linear array of words numbered sequentially from
0 to some maximum, such as 2^15-1 or 2^30-1.  Hence, given a
pointer to one such word, 'p', say, you can generate a pointer
to the next word by writing 'p+1'.

An array (vector, in BCPL parlance) is represented by a variable
holding a pointer to its first word, so if v is a vector then the
value 'v' points to the first word, 'v+1' to the next, and 'v+i'
to the i'th.  The dereferencing operator in BCPL is written "!",
so
	!(v+i)

gets you the i'th component of the vector 'v'.  This can be written
also
	v!i

using the diadic form of "!".

Recall, however, that BCPL is typeless.  The compiler doesn't know
that 'v' is a vector and 'i' and index; they are both just bit patterns.
Hence, the expression '(v+i)' looks just like normal integer addition.
This has a strange consequence: if '(v+1)' is the same code, regardless
of whether 'v' is an integer or a vector, then the BCPL pointers to
adjacent words must indeed differ by 1.  On a machine with an underlying
byte addressing scheme, therefore, the BCPL pointers cannot be true
addresses, but must be scaled, eg by 2 on the PDP-11 or by 4 on the
MC68000.  They must therefore be rescaled (by a left shift, usually)
before being dereferenced.

If the language were typed, then the compiler, on seeing 'v+1', could
say "Hm, 'v' is of type pointer, so that "+" is not integer addition,
it is pointer offsetting, and what I should really do is add '4', since
one BCPL word occupiers four address units."  This, of course, is just
what a C compiler would do.  But the BCPL compiler can't, since there
is no type information.

One alternative, however, is to attach the type information to the
operators.  That, after all, is why we have "+" for integer addition,
and "#+" for floating addition: the compiler can't deduce the operation
from the types of the operands, so we have to tell it.  Well, suppose
we redefine "!" to mean "indexing", ie

	v!i

now means "access the i'th component of v, scaling i as necessary".  We
can then use the TRUE address as the pointer, and the compiler will
indeed know that i must be multiplied by 4.  And a good codegenerator
can then use the indexing mode of the MC68020 to generate much slicker
code.

This has several consequences.  First, the "!" operator is not symmetric;
'v!i' and 'i!v' no longer mean the same thing.  Secondly, we cannot
compute the address of 'v!i' just by saying 'v+i', since that is not
the address.  We could say 'v+4*i', or 'v+UnitsPerWord*i', but a better
form seems to be '@(v!i)'.  Nastiest, of course, are those FOR loops
that say things like

	FOR p=v TO v+vmax DO .... !p ...

since the step is no longer 1, but UnitsPerWord.

How hard would this be?  I edited these changes into the compiler and
a codegenerator in less than a day, making the required consequential
changes to the code that would allow the new compiler to compile
itself.  One necessary concession, was to interpret 'constant!variable'
as if it were the other way round, since the code was littered with
the idiom "constant_offset ! pointer_to_structure".  Otherwise, a
global scan of all occurrences of "+", "-", and "FOR" found everything.
On the VAX-11, the code size of the compiler decreased by about 6%,
because of better use of the indexed address modes and the omission of
right shifts for address generation.

The interface with assembler code is simplified, too, since addresses
were now true machine addresses.

Hope this puts the issue in perspective.