[comp.arch] BCPL Strings

firth@sei.cmu.edu (Robert Firth) (02/09/89)
BCPL Strings
------------

First, please let me apologise for the less than polite tone of my
previous post.  It was unwarranted and out of place.

By way of atonement, I've put together this fairly lengthy note on
BCPL strings, their evolution and implementation.  It does eventually
get back to the original topic, the possible use of string instructions.

The request was for an "authoritative" treatment, which I think obliges
me to cite references.  The works consulted were

[1] Martin Richards: BCPL - a tool for compiler writing and systems
    programming.  Proceedings AFIPS Spring JCC, 1969.

[2] Martin Richards: BCPL Programming Manual.  University of Cambridge
    Computing Laboratory, 1974.

    (Alas, my copy of the original 1969 version has vanished into the
     mists of time)

[3] Martin Richards & Colin Whitby-Strevens: BCPL - the Language and
    its Compiler.  Cambridge University Press, 1978.

[4] R D Eager: Draft BCPL Standard.  University of Kent, 1982.

I have also looked at the code and documentation of three BCPL systems
that, while not truly independent, were of relatively autonomous authorship.

However, any errors or omissions are mine.


Current Definition
------------------

The language feature is defined in [4, 3.2.5.3] thus:

	The value of a string is the BCPL address of the first of
	a set of contiguous cells which at runtime hold the actual
	characters of a string together with its length.  Each cell
	is divided into a number of 'bytes'.  Byte zero of cell zero
	of a string contains the length, and subsequent bytes contain
	the characters packed contiguously and in the given sequence.
	The unused bytes of the last cell are filled with binary zeros.

The same section also says (though probably not in the best place):

	The number of string characters allowed in a string depends
	on the implementation, but will be at least 127.  A null
	string is legal.

Evolution
---------

It is interesting to contrast the definition in [2, 2.2]:

	A  <string  constant>  consists of up to 255 characters
	enclosed in string quotes ("). The internal character set
	is EBCDIC (on the 370)...
 
	The machine representation of a string is the address of
	the region of store where the length and characters of the
	string are packed.

The following has "always" been the case

a. a string is represented by a (BCPL) address

b. this address identifies a place where the string proper is stored

c. the stored representation contains both a length and the text

d. the stored representation occupies a contiguous sequence of whole
   machine words starting with the addressed word.

Over time, the specification has evolved towards greater definiteness of
representation (a fact I find of some interest).  This is seen in three
places

a. it has become explicit that the stored string is null-padded to a
   word boundary

b. a null string is explicitly declared legal, and a lower bound is set
   on the maximum allowed length (it is set at 127 rather than 255 as
   a concession to implementors stuck with signed bytes, though in fact
   the limit is 255 on both the PDP-11 and the VAX-11)

c. the length is required to be stored in "byte zero"

This is further tightened by the most precise implementors' guide I
know (authored by Tim Ward of HPAC), that says:

	Implementations may be relied upon...
	... to make no assumptions about the unused bits in the last
	    word of a string
	... to accept any 8-bit byte as a valid string character

The string has thus been some flavour of "counted" string, and the length
is stored "with" the string, but the manner in which it is represented
has varied.  I know of two implementations that held the length in a
word rather than a byte; not surprisingly, both were on word addressed
machines.

Implementations differ on whether a string constant can be modified at
run time; my habit has been to put literal strings in read-only memory,
but the draft standard is silent on this issue and most of the older
implementations allow literal strings to be modified.


Operations on Strings
---------------------

Since the language is typeless, there are no operations on strings as
such.  Hence, a string value may be assigned, compared, multiplied
by 42, and so on, like any other value, though not all these operations
will yield sensible results.  Given that the value is an address, it
should be clear that assignment is reference assignment (the string
is not copied) and comparison is address comparison.

More significant are the standard library routines that operate on strings.
These are [2, 2.8.2]:

	PACKSTRING(V,S)  is a function which packs the characters
	V!1 to V!N into S, where N = V!0 & 255. The result is the
	subscript of the highest element of S used (i.e. N/4 on the 370).
 
	UNPACKSTRING(S,V)  is a routine to unpack characters from
	the string S into V!1 to V!N when N is the length of the
	string, and set V!0 = N.
 
	GETBYTE(S,I)  is a function whose result is the Ith character
	of the string S.  By convention the zeroth character of a string
	is its length.
 
	PUTBYTE(S,I,CH)  is a routine which will update the Ith 
	character of the string S with CH.
 
and, of course:

	WRITES(S) writes the string S to the current output stream.
 
Since BCPL normally operates only on whole-word quantities, the purpose
of the pack and unpack procedures is clearly to convert between the
string representation and one directly manipulable.  The reason PackString
returns a result is that, after having packed a string in a temporary,
one normally wishes to stuff it away into allocated storage.  The result
tells you exactly how much storage to allocate, and also tells you the
upper bound of the FOR-loop that will do the word-by-word copy.  The
code is then portable.

The get and put routines allow individual components of a string to be
fetched and stored, including the string length.

Note that at that time (1974) the exact representation of the string
length was still implementation dependent; it could be retrieved by
GetByte(S,0) but, strictly speaking, need not actually be stored in
the zeroth byte.  The form of words in [3], a few years later, was
rather:

	By convention, byte zero contains the number of characters
	in the string...

which is much closer to describing a physical storage layout.

When revising the BCPL library in 1975/76 I added two procedures that
never made it into the standard:

	StringLength(S)  returns the length of string S.

	CompareString(S1,S2)  returns +,0,- as S1 >,=,< S2.

The former was intended to replace GetByte(S,0), which has to be a special
case if byte 0 is not the length.  The latter was introduced because most
users wrote their own function and either got it wrong or relied on the
null-padding of strings.  The standard, of course, found alternative
solutions to both problems.


Language Extensions
-------------------

One language extension applies to strings.  In 1973/74 [I think; the exact
provenance is unclear from my references] the new syntax was introduced:

	S%I

to imply the selection of byte I from vector S (the vector presumably
containing a string).  The following are strictly equivalent:

	... S%I ...	... GetByte(S,I) ...
	S%I := CH	PutByte(S,I,CH)


Implementation Issues
---------------------

On a conventional byte-addressed machine, there is no real problem in
implementing the standard.  The mandated representation is probably the
best, and the operations are acceptably efficient once the inevitable
cost of rescaling BCPL addresses has been paid.  The byte selector '%'
generates simple open code; the library routines can be written as
simple assembler bodies, and the other operations don't care about the
format of a string.

On a word-addressed machine, the byte selector is usually translated as a
call of the library routine (an explicit constant index can be special
cased), and the library routines must perform whatever tedious work is
required in accessing individual characters.  On the NM4, for instance,
there are special double-indexing instructions that can fetch a byte
whose index is in one register from a vector whose (word) address is in
another - this maps almost exactly onto the BCPL form S%I.

To provide some perspective on the relative importance of the various
operations, I looked at the code of a BCPL front end and back end.  In
the two together, the string handling routines were invoked this often:

	PackString	:  2
	UnpackString	:  0
	GetByte		:  8
	PutByte		: 15
	StringLength	:  4
	CompareString	:  1
	WriteS(string)	: 31
	WriteF(format)	: 65

I couldn't count the number of times a string was assigned, passed as a
parameter, or returned as a result, but it was well over a thousand.

The above is a static count; the most deeply nested uses of the library
routines were:

	In the lexical scanner, one call of PutByte for each character
	of each multi-character input token

	In the symbol table builder, a few calls of CompareString for
	each occurrence of each BCPL identifier (one per link in a
	hash chain)

	In the codegenerator, one call of write-under-format for each
	Assembler operation and operand

Recall here the intended application domain of BCPL: compiler writing
and systems programming.  In other domains, more extensive string
handling might be required, but in this one the great majority of
operations on strings do not care about the string format or contents,
and so are best implemented as by-reference operations on a string
pointer or descriptor.  


Hardware String Instructions
----------------------------

If the majority of operations on BCPL strings manipulate their addresses
rather than their bodies, then they do not need string instructions.  Of the
library operations, the dominant ones are the plain write, which writes an
explicit string, and the write-under-format, which takes a format string
and must dismantle and decode it.  Neither of these is really assisted by
having string instructions.

Finally, there are two operations that could be helped by the hardware:
string copying and string lexicographic comparison.  I looked at these
on the Vax-11.

String copy can be done by MOVC3:

	MOVZBL (Rsrc),Rtemp		; get length from byte 0
	INCL Rtemp			; add 1 for length byte itself
	MOVC3 Rtemp,(Rsrc),(Rdst)	; and move the bytes

But BCPL strings are always word-aligned and padded to a whole number
of words, so the copy could also be done by MOVL in a loop, which is
faster.  Even if the representation were changed to a 16-bit inclusive
leading count (reducing the above code to one instruction) the loop
would still be faster.

String lexicographic comparison could likewise be done by CMPC5, but
again CMPB in a loop is almost always faster, in part because the loop
usually exits after a couple of iterations and in part because CMPC5
takes a long time to get moving.

My tentative conclusion is that hardware string instructions are not
relevant to the great majority of BCPL string operations, which don't
even look at the string, and are at best of marginal help to those
operations that do look at the string.  The big win is to use by-reference
semantics whenever possible.  Once you are committed to iterating over
the string, there seems no pervasive advantage either way between a counted
string and a sentinel-terminated string, though in my experience more
machines are slightly more comfortable with the latter.

Hope that helps, and thank you for your attention

Robert Firth