[comp.arch] null-terminated strings are OK

mash@mips.UUCP (John Mashey) (12/21/87)

In article <261@ivory.SanDiego.NCR.COM> jan@ivory.UUCP (Jan Stubbs) writes:
.....
>Personally, I can't imagine any convenience a null terminated string would have over a string preceded by its length. The 'C' convention forces any string 
>operation to examine the whole string, where alternative schemes would not.
>Even a machine with byte addressability only has to test each byte for zero
>as it goes. A string preceded by its length could be easily added or
>subtracted from another string as well, an operation present in some 
>dialects of Pascal.

Dmr hasn't commented yet, so he may not. As I recall, either from
old memos or discussions, here is some of the reasoning:

1) C doesn't have a string data type at all, on purpose.
If necessary, one can always do a macro package, or preprocessor,
to implement the type on top of the existing facilities (done many times,
with different choices of implementation.)

2) As recounted by Kernighan and Plauger in their tales of converting
Software Tools to Pascal, the fixed-length strings implicity required by
the need to use many different Pascal implementationswas fairly painful.

3) If you build in a string data-type, you usually end up with:
	length	string OR
	current-length	max-length string OR
	length	pointer-to-string OR
	current-length	max-length pointer-to-string
and you've definitely made this decision for the language user, as the
decision percolates around thru calling conventions, storage allocation,
etc. 

4) Note that the choice of 1 of the 4 above versus C's choice can
interact very strongly with architectural features.  You can't win,
but, some of these mesh horridly with some computer architectures'
string features, and hence are hard to use anyway in a portable way.

5) C originally had the philosphy that execution time would more-or-less
reflect the code-size, specifically, that simple-looking statements
wouldn't surprise you swelling gigantically. [This has somewhat been
violated, lately, by structure-assignments].

6) C's early emphasis on general-purpose systems programming argued for
representations that didn't clash with realities of things like I/O devices.
For example, when you read a "string" from a TTY it doesn't naturally
arrive with a length in front of it.

7) I've written code in languages (like PL/I) that had strings,
and while it helped some of the time, in other cases, the C code,
especially for parsing strings, doing I/O with them, or splitting
them into substrings, or passing pointers to suffixes, was actually
more natural, and often far more efficient.  Consider the act of
scanning a string, removing the prefix, and passing a pointer to the suffix.
This may well cause materialization of a copy of the entire suffix,
just in order to create another string descriptor for it [especially
if what you use is {length, string}.

8) This probably belongs in comp.lang.c, although there are architectural
implications: for example, those who ported UNIX onto word-addressed
machines with special string instructions have had a bunch of fun with this.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086