firth@sei.cmu.edu (Robert Firth) (02/09/89)
BCPL Strings ------------ First, please let me apologise for the less than polite tone of my previous post. It was unwarranted and out of place. By way of atonement, I've put together this fairly lengthy note on BCPL strings, their evolution and implementation. It does eventually get back to the original topic, the possible use of string instructions. The request was for an "authoritative" treatment, which I think obliges me to cite references. The works consulted were [1] Martin Richards: BCPL - a tool for compiler writing and systems programming. Proceedings AFIPS Spring JCC, 1969. [2] Martin Richards: BCPL Programming Manual. University of Cambridge Computing Laboratory, 1974. (Alas, my copy of the original 1969 version has vanished into the mists of time) [3] Martin Richards & Colin Whitby-Strevens: BCPL - the Language and its Compiler. Cambridge University Press, 1978. [4] R D Eager: Draft BCPL Standard. University of Kent, 1982. I have also looked at the code and documentation of three BCPL systems that, while not truly independent, were of relatively autonomous authorship. However, any errors or omissions are mine. Current Definition ------------------ The language feature is defined in [4, 3.2.5.3] thus: The value of a string is the BCPL address of the first of a set of contiguous cells which at runtime hold the actual characters of a string together with its length. Each cell is divided into a number of 'bytes'. Byte zero of cell zero of a string contains the length, and subsequent bytes contain the characters packed contiguously and in the given sequence. The unused bytes of the last cell are filled with binary zeros. The same section also says (though probably not in the best place): The number of string characters allowed in a string depends on the implementation, but will be at least 127. A null string is legal. Evolution --------- It is interesting to contrast the definition in [2, 2.2]: A <string constant> consists of up to 255 characters enclosed in string quotes ("). The internal character set is EBCDIC (on the 370)... The machine representation of a string is the address of the region of store where the length and characters of the string are packed. The following has "always" been the case a. a string is represented by a (BCPL) address b. this address identifies a place where the string proper is stored c. the stored representation contains both a length and the text d. the stored representation occupies a contiguous sequence of whole machine words starting with the addressed word. Over time, the specification has evolved towards greater definiteness of representation (a fact I find of some interest). This is seen in three places a. it has become explicit that the stored string is null-padded to a word boundary b. a null string is explicitly declared legal, and a lower bound is set on the maximum allowed length (it is set at 127 rather than 255 as a concession to implementors stuck with signed bytes, though in fact the limit is 255 on both the PDP-11 and the VAX-11) c. the length is required to be stored in "byte zero" This is further tightened by the most precise implementors' guide I know (authored by Tim Ward of HPAC), that says: Implementations may be relied upon... ... to make no assumptions about the unused bits in the last word of a string ... to accept any 8-bit byte as a valid string character The string has thus been some flavour of "counted" string, and the length is stored "with" the string, but the manner in which it is represented has varied. I know of two implementations that held the length in a word rather than a byte; not surprisingly, both were on word addressed machines. Implementations differ on whether a string constant can be modified at run time; my habit has been to put literal strings in read-only memory, but the draft standard is silent on this issue and most of the older implementations allow literal strings to be modified. Operations on Strings --------------------- Since the language is typeless, there are no operations on strings as such. Hence, a string value may be assigned, compared, multiplied by 42, and so on, like any other value, though not all these operations will yield sensible results. Given that the value is an address, it should be clear that assignment is reference assignment (the string is not copied) and comparison is address comparison. More significant are the standard library routines that operate on strings. These are [2, 2.8.2]: PACKSTRING(V,S) is a function which packs the characters V!1 to V!N into S, where N = V!0 & 255. The result is the subscript of the highest element of S used (i.e. N/4 on the 370). UNPACKSTRING(S,V) is a routine to unpack characters from the string S into V!1 to V!N when N is the length of the string, and set V!0 = N. GETBYTE(S,I) is a function whose result is the Ith character of the string S. By convention the zeroth character of a string is its length. PUTBYTE(S,I,CH) is a routine which will update the Ith character of the string S with CH. and, of course: WRITES(S) writes the string S to the current output stream. Since BCPL normally operates only on whole-word quantities, the purpose of the pack and unpack procedures is clearly to convert between the string representation and one directly manipulable. The reason PackString returns a result is that, after having packed a string in a temporary, one normally wishes to stuff it away into allocated storage. The result tells you exactly how much storage to allocate, and also tells you the upper bound of the FOR-loop that will do the word-by-word copy. The code is then portable. The get and put routines allow individual components of a string to be fetched and stored, including the string length. Note that at that time (1974) the exact representation of the string length was still implementation dependent; it could be retrieved by GetByte(S,0) but, strictly speaking, need not actually be stored in the zeroth byte. The form of words in [3], a few years later, was rather: By convention, byte zero contains the number of characters in the string... which is much closer to describing a physical storage layout. When revising the BCPL library in 1975/76 I added two procedures that never made it into the standard: StringLength(S) returns the length of string S. CompareString(S1,S2) returns +,0,- as S1 >,=,< S2. The former was intended to replace GetByte(S,0), which has to be a special case if byte 0 is not the length. The latter was introduced because most users wrote their own function and either got it wrong or relied on the null-padding of strings. The standard, of course, found alternative solutions to both problems. Language Extensions ------------------- One language extension applies to strings. In 1973/74 [I think; the exact provenance is unclear from my references] the new syntax was introduced: S%I to imply the selection of byte I from vector S (the vector presumably containing a string). The following are strictly equivalent: ... S%I ... ... GetByte(S,I) ... S%I := CH PutByte(S,I,CH) Implementation Issues --------------------- On a conventional byte-addressed machine, there is no real problem in implementing the standard. The mandated representation is probably the best, and the operations are acceptably efficient once the inevitable cost of rescaling BCPL addresses has been paid. The byte selector '%' generates simple open code; the library routines can be written as simple assembler bodies, and the other operations don't care about the format of a string. On a word-addressed machine, the byte selector is usually translated as a call of the library routine (an explicit constant index can be special cased), and the library routines must perform whatever tedious work is required in accessing individual characters. On the NM4, for instance, there are special double-indexing instructions that can fetch a byte whose index is in one register from a vector whose (word) address is in another - this maps almost exactly onto the BCPL form S%I. To provide some perspective on the relative importance of the various operations, I looked at the code of a BCPL front end and back end. In the two together, the string handling routines were invoked this often: PackString : 2 UnpackString : 0 GetByte : 8 PutByte : 15 StringLength : 4 CompareString : 1 WriteS(string) : 31 WriteF(format) : 65 I couldn't count the number of times a string was assigned, passed as a parameter, or returned as a result, but it was well over a thousand. The above is a static count; the most deeply nested uses of the library routines were: In the lexical scanner, one call of PutByte for each character of each multi-character input token In the symbol table builder, a few calls of CompareString for each occurrence of each BCPL identifier (one per link in a hash chain) In the codegenerator, one call of write-under-format for each Assembler operation and operand Recall here the intended application domain of BCPL: compiler writing and systems programming. In other domains, more extensive string handling might be required, but in this one the great majority of operations on strings do not care about the string format or contents, and so are best implemented as by-reference operations on a string pointer or descriptor. Hardware String Instructions ---------------------------- If the majority of operations on BCPL strings manipulate their addresses rather than their bodies, then they do not need string instructions. Of the library operations, the dominant ones are the plain write, which writes an explicit string, and the write-under-format, which takes a format string and must dismantle and decode it. Neither of these is really assisted by having string instructions. Finally, there are two operations that could be helped by the hardware: string copying and string lexicographic comparison. I looked at these on the Vax-11. String copy can be done by MOVC3: MOVZBL (Rsrc),Rtemp ; get length from byte 0 INCL Rtemp ; add 1 for length byte itself MOVC3 Rtemp,(Rsrc),(Rdst) ; and move the bytes But BCPL strings are always word-aligned and padded to a whole number of words, so the copy could also be done by MOVL in a loop, which is faster. Even if the representation were changed to a 16-bit inclusive leading count (reducing the above code to one instruction) the loop would still be faster. String lexicographic comparison could likewise be done by CMPC5, but again CMPB in a loop is almost always faster, in part because the loop usually exits after a couple of iterations and in part because CMPC5 takes a long time to get moving. My tentative conclusion is that hardware string instructions are not relevant to the great majority of BCPL string operations, which don't even look at the string, and are at best of marginal help to those operations that do look at the string. The big win is to use by-reference semantics whenever possible. Once you are committed to iterating over the string, there seems no pervasive advantage either way between a counted string and a sentinel-terminated string, though in my experience more machines are slightly more comfortable with the latter. Hope that helps, and thank you for your attention Robert Firth