[comp.arch] String lengths

PLS@cup.portal.com (Paul L Schauble) (02/05/89)

This deserves a new subject.

Since it was mentioned in the Endian Wars, does anyone know why C uses the
null terminated string rather than an explicit length? It seems like such
an odd choice considering that
  - It removes a character from the character set, a source of many C
    bugs, and
  - All machines I know of that have character string instructions want
    the length of the string. This forces the string primitives to first
    scan for null, a time wasting operation.

There must have been a reason. What is it?

  ++PLS

hascall@atanasoff.cs.iastate.edu (John Hascall) (02/06/89)

In article <14331@cup.portal.com> PLS@cup.portal.com (Paul L Schauble) writes:

>Since it was mentioned in the Endian Wars, does anyone know why C uses the
>null terminated string rather than an explicit length? It seems like such
>an odd choice considering that

>  - It removes a character from the character set, a source of many C
>    bugs, and

   Agreed.

>  - All machines I know of that have character string instructions want
>    the length of the string. This forces the string primitives to first
>    scan for null, a time wasting operation.

   I assume you mean something like:

      +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
      |length| H | E | L | L | O | , |   | W | O | R | L | D | \n|
      +------+---+---+---+---+---+---+---+---+---+---+---+---+---+

  but, what size would you use for "length", a byte? a word? a longword?

  I suspect that some of these machines' instructions expect different
  sized operands for the length.

  Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
  a machine with no string instructions.

  John Hascall
  ISU Comp Center

nagel@paris.ics.uci.edu (Mark Nagel) (02/06/89)

In article <14331@cup.portal.com>, PLS@cup (Paul L Schauble) writes:
|This deserves a new subject.
|
|Since it was mentioned in the Endian Wars, does anyone know why C uses the
|null terminated string rather than an explicit length? It seems like such
|an odd choice considering that
|  - It removes a character from the character set, a source of many C
|    bugs, and
|  - All machines I know of that have character string instructions want
|    the length of the string. This forces the string primitives to first
|    scan for null, a time wasting operation.
|
|There must have been a reason. What is it?

Hmm.  There are two things going on here.  One is that you want to
have truly variable-length strings.  You can do it the C way, or you
can adopt some more complicated method like having different string
types or a variable length string length indicator.  I think the
implementors chose the simplest approach, hoping that in the average
case, the overhead from scanning a string would be small (and
hopefully the value would be cached in whatever data structure needed
it).  The other thing (once the sentinel method was chosen) was to
select the proper terminating character.  I don't think NUL is used
much anywhere for anything and thus is a good candidate.  In addition,
I've heard that NUL was chosen as a way to help prevent overrunning
the ends of strings by too much in the case of a missing end-of-string
character.  What single byte value is more prevalent in machine code
than zero?

Mark Nagel @ UC Irvine, Dept of Info and Comp Sci
ARPA: nagel@ics.uci.edu              | Charisma doesn't have jelly in the
UUCP: {sdcsvax,ucbvax}!ucivax!nagel  | middle. -- Jim Ignatowski

aglew@mcdurb.Urbana.Gould.COM (02/06/89)

>>Since it was mentioned in the Endian Wars, does anyone know why C uses the
>>null terminated string rather than an explicit length?
>>  - All machines I know of that have character string instructions want
>>    the length of the string. This forces the string primitives to first
>>    scan for null, a time wasting operation.
>
>   I assume you mean something like:
>
>      +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
>      |length| H | E | L | L | O | , |   | W | O | R | L | D | \n|
>      +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
>
>  Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
>  a machine with no string instructions.


May I encourage people implementing string libraries to use an extra 
level of indirection? Instead of length immediately preceding the string,
let length be associated with a pointer to the string. Makes
substringing operations much easier, and has the ability to reduce
unnecessary copies (at the risk of increased aliasing).

       +------+---+
       |length|ptr|
       +------+---+
                |
         +------+
         |
         V
       +---+---+---+---+---+---+---+---+---+---+---+---+---+
       | H | E | L | L | O | , |   | W | O | R | L | D | \n|
       +---+---+---+---+---+---+---+---+---+---+---+---+---+

Machine instructions should not mandate the layout of strings in memory.
They should, instead, require length and start to be preloaded in registers
(where the machine'll have to put them anyway).

That is, of course, if you *have* string instructions. As I am fond of
pointing out, large integer operations can remove the need for many
string operations (ie. give me a 64 or 128 bit wide bus, and a 
"STORE BYTES UNDER MASK" operation, and I don't *want* string operations).


Andy "Krazy" Glew   aglew@urbana.mcd.mot.com   uunet!uiucdcs!mcdurb!aglew
   Motorola Microcomputer Division, Champaign-Urbana Design Center
	   1101 E. University, Urbana, Illinois 61801, USA.
   
My opinions are my own, and are not the opinions of my employer, or
any other organisation. I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

dmr@alice.UUCP (02/06/89)

The question arose: why does C use a terminating character for
strings instead of a count?

Discussion of the representation of strings in C is not fruitful
unless it is realized that there are no strings in C.  There
are character arrays, which serve a similar purpose, but no
strings.

Things very deep in the design of the language, and in the customs
of its use, make strings a mess to add.  The intent was that
the behavior of character arrays should be exactly like that
of other arrays, and the hope was that stringish operations
on these character arrays should be convenient enough.

The interplay of pointers and arrays, and the possible representations
for them, place strong contraints on what one can do if one wants
real strings (counted sequences of characters) in the context
of the existing language, in particular if types char* or char[]
are going to be counted strings.  In general it is hard to account for
the space in which to put the count, and also to make sure that
it can be updated properly under all operations.  For example,
'sizeof' is used for allocation and it is hard to make this
use compatible with a count.  Similarly, in practice,
most implementations make 'struct { char s1[3]; s2[5]; }' say
something about the storage layout that doesn't mix well with
a count.

Given the explicit use of character arrays, and explicit pointers to
sequences of characters, the conventional use of a terminating
marker is hard to avoid.  The history of this convention and
of the general array scheme had little to do with the PDP-11; it
was inherited from BCPL and B.

Of course, it is possible to imagine adding a primitive string
type to C, and to put in some useful operations like concatenation,
search, and substring.  This would not really be a good idea,
because this new primitive type would continually be at war with
the existing character pointers and arrays.  In the context
of C (even with ANSI function prototypes) it would be quite
difficult to make a string type usable in all the places it should
be.

In extensible languages like C++ and of course in languages
in which the notion is designed in from the start, strings are
fine.  (However, even in C++, where it is readily possible to define
your own string class, it would take quite a lot of work to
make this class work smoothly with the entire existing library).

In my opinion, C's array/pointer scheme for representation
of character strings has worked out reasonably well, although
it is somewhat clumsy when there are lots of string operations.
I don't think it has been demonstrated that the usual run of
C programs pays an extremely high cost in performance for their
string operations, though doubtless there are counterexamples
for particular machine architectures or particular programs.

	Dennis Ritchie
	att!research!dmr
	dmr@research.att.com

firth@sei.cmu.edu (Robert Firth) (02/06/89)

In article <8876@alice.UUCP> dmr@alice.UUCP writes:

["strings" in C]

>Given the explicit use of character arrays, and explicit pointers to
>sequences of characters, the conventional use of a terminating
>marker is hard to avoid.  The history of this convention and
>of the general array scheme had little to do with the PDP-11; it
>was inherited from BCPL and B.

A correction here: the C scheme was NOT inherited from BCPL.
BCPL strings are not confused with character arrays; their
implemetation is not normally visible to the programmer, and
their semantics are respectably robust.

Much the most common implementation is the one proposed earlier -
have an initial length count followed by exactly that number
of characters.  Naturally, all characters are legal, including
NUL.

There are several reasons for the C 'design', but that its
perpetrators didn't know any better isn't one of them.

hascall@atanasoff.cs.iastate.edu (John Hascall) (02/07/89)

In article <8876@alice.UUCP> dmr@alice.UUCP writes:
>The question arose: why does C use a terminating character for
>strings instead of a count?

  [...discussion of the history of "strings" in C omitted...]

>In my opinion, C's array/pointer scheme for representation
>of character strings has worked out reasonably well, although
>it is somewhat clumsy when there are lots of string operations.

>I don't think it has been demonstrated that the usual run of
>C programs pays an extremely high cost in performance for their
>string operations, though doubtless there are counterexamples
>for particular machine architectures or particular programs.

   This is a rather circular argument.  This is rather like
   saying "I don't think it has been demonstrated that the usual
   automobile pays an extremely high cost in performance for their
   amphibious operations,..."

   Just like most people don't drive their cars across lakes,
   most C programs are not string operation intensive.

   If I had to cross a lake,
     I'd probably use a different tool than a car.
   If I had to write a string operation intensive program,
     I'd probably use a different tool than C.

John Hascall
ISU Comp Center
Rx:  Apply :-) above as needed for pain.

wsmith@m.cs.uiuc.edu (02/07/89)

On using machine string instructions for performing things such
as index() that need the length of the string ahead of time...

On the 80x86 family of chips, I believe it is possible for a routine to 
beat the fancy string instructions with clever coding that does not need
the length of the string so that the index() subroutine does not average 
O(length), but instead much less than that when the pattern is near the 
beginning.   Certainly on some of the low end VAXes this was true also 
because the special string instructions were executed by software.
Just because they give a special instruction, doesn't mean you have to 
use it.

Bill Smith
wsmith@cs.uiuc.edu
uiucdcs!wsmith

haynes@ucscc.UCSC.EDU (Jim Haynes) (02/07/89)

In article <8876@alice.UUCP> dmr@alice.UUCP writes:
>The question arose: why does C use a terminating character for
>strings instead of a count?
>
Since dmr has spoken I probably shouldn't even think of trying to
add anything, but ...

If we think of machine architecture there are historically several
different ways that have been used to represent strings.
1) Put the characters in adjacent storage locations, and use a special
character to delimit the end of the string.  This goes back to at least
IBM 702 or thereabouts.  It is basically what is used in C; IBM just
happened to use a character other than null.
2) Put the characters in adjacent storage locations, and reserve a bit
per storage location to delimit string boundaries.  This is found in
IBM 1401-1410-7010 family and 1620.
3) Put the characters in adjacent storage locations, don't delimit the
boundaries at all, and absorb the information about string length into
the instruction stream.  This is what's done in IBM 360 and a lot of
other machines.  In contrast to 1) and 2) it requires something like a
simulation of 4) below to deal with varying length strings and dynamic
allocation of storage.
4) Put the characters in adjacent storage locations and store information
about starting address and length in a data structure for the purpose,
a string descriptor.  This is used in Burroughs B6500 and later
machines of that family.  The string is virtually delimited, in that
you can't access it, or aren't supposed to, without going through
the descriptor and observing the boundary.

What's the difference between a string and an array of characters?
Is it anything other than the set of operations that are provided
to operate on it?
haynes@ucscc.ucsc.edu
haynes@ucscc.bitnet
..ucbvax!ucscc!haynes

"Any clod can have the facts, but having opinions is an Art."
        Charles McCabe, San Francisco Chronicle

dmr@alice.UUCP (02/07/89)

Robert Firth justifiably corrects my misstatement about
BCPL strings; they were indeed counted.  I evidently edited
my memory.

Perhaps he or someone else can discuss authoritatively how they
fitted into the language and (to recall the original topic) how string
instructions might help in processing.  More nearly correct
memory--which applies only to 20-year-old implementations
on the IBM 7090 and GE 635/645--says that strings existed
in two forms: a one-byte count followed by the characters
of the string, and an expanded form in which the count
and the characters were blown out into words.

In these implementations, bytes were 9 bits, and so the compact
representation limited strings to length 511.  On the
other hand, the blown-out form occupied lots of space and
wasn't suitable for transferring directly to text files.
Originally, I believe, there were no operators for accessing
individual characters in a compact string, though they were
added later.

	Dennis Ritchie
	att!research!dmr
	dmr@research.att.com

prc@maxim.ERBE.SE (Robert Claeson) (02/07/89)

In article <762@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes:

> >  - All machines I know of that have character string instructions want
> >    the length of the string. This forces the string primitives to first
> >    scan for null, a time wasting operation.

>    I assume you mean something like:

>       +------+---+---+---+---+---+---+---+---+---+---+---+---+---+
>       |length| H | E | L | L | O | , |   | W | O | R | L | D | \n|
>       +------+---+---+---+---+---+---+---+---+---+---+---+---+---+

>   but, what size would you use for "length", a byte? a word? a longword?

A 16-bit word, just to remain compatible with all the 16-bit machines
out there.

>   I suspect that some of these machines' instructions expect different
>   sized operands for the length.

Some expects 16 bits, some expects 32 bits, and a few (such as the old
Z-80) expects 8 bits.

>   Also, to quote K&R: "C was originally designed ... on the DEC PDP-11",
>   a machine with no string instructions.

Well, what can we learn from this?
-- 
Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden
"No problems." -- Alf
Tel: +46 758-202 50  EUnet:    rclaeson@ERBE.SE  uucp:   uunet!erbe.se!rclaeson
Fax: +46 758-197 20  Internet: rclaeson@ERBE.SE  BITNET: rclaeson@ERBE.SE

colwell@mfci.UUCP (Robert Colwell) (02/07/89)

In article <765@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
=In article <8876@alice.UUCP= dmr@alice.UUCP writes:
==The question arose: why does C use a terminating character for
==strings instead of a count?
==In my opinion, C's array/pointer scheme for representation
==of character strings has worked out reasonably well, although
==it is somewhat clumsy when there are lots of string operations.
=
==I don't think it has been demonstrated that the usual run of
==C programs pays an extremely high cost in performance for their
==string operations, though doubtless there are counterexamples
==for particular machine architectures or particular programs.
=
=   This is a rather circular argument.  This is rather like
=   saying "I don't think it has been demonstrated that the usual
=   automobile pays an extremely high cost in performance for their
=   amphibious operations,..."
=
=   Just like most people don't drive their cars across lakes,
=   most C programs are not string operation intensive.

Dennis's comment *could* have been circular, but I don't think it 
was.  After all, the Unix OS has lots of places where exceptionally
poor string handling would be obvious very quickly, and there are 
several known installations of this OS nowadays...On the other hand,
there's Dhrystone -- if your string handling is poor, your 
Dhrystone number may well be pathetic.  And Dhrystone is supposed
to be a systems code benchmark (at least to this level of fidelity).

Bob Colwell               ..!uunet!mfci!colwell
Multiflow Computer     or colwell@multiflow.com
175 N. Main St.
Branford, CT 06405     203-488-6090

rds@n.sp.cs.cmu.edu (Robert Sansom) (02/08/89)

In article <8882@alice.UUCP>, dmr@alice.UUCP writes:
> Robert Firth justifiably corrects my misstatement about
> BCPL strings; they were indeed counted.  I evidently edited
> my memory.
> 
> Perhaps he or someone else can discuss authoritatively how they
> fitted into the language and (to recall the original topic) how string
> instructions might help in processing.

To quote 'BCPL - the language and its compiler' by Martin Richards and Colin
Whitby-Strevens:

    '... you can use UNPACKSTRING to lay out a string in a vector one
    character to a word, and PACKSTRING to pack it up again.  After
    unpacking your string, you will discover that the first word contains a
    count of the number of characters in the sting proper, which starts at
    the second word.'

and:

    'Exactly how BCPL strings are stored depends, amongst other things, upon
    the implementation word size.  This dependency is concealed within the
    string access procedures GETBYTE and PUTBYTE.  The call "GETBYTE(S,I)"
    obtains the Ith byte of the string S.  By convention, byte 0 contains the
    number of characters in the string, which are stored consecutively from
    byte.  The call "PUTBYTE(S,I,C)" sets the Ith byte of the string S to
    contain the character C.'


		Robert Sansom (rds@cs.cmu.edu)
		School of Computer Science
		Carnegie Mellon University
--

eric@snark.uu.net (Eric S. Raymond) (02/08/89)

In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes:
> In article <8876@alice.UUCP> dmr@alice.UUCP writes:
> >The history of this convention and of the general array scheme had little
> >to do with the PDP-11; it was inherited from BCPL and B.
> 
> A correction here: the C scheme was NOT inherited from BCPL.

I've seen bonehead idiocy on the net before, but this tops it all -- this takes
the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to
before you pontificate? Didn't the name `Dennis Ritchie' register in whatever
soggy lump of excrement you're using as a central nervous system? Do you
realize that the person you just incorrectly `corrected' on a point of C's
intellectual antecedents is the *inventor of C himself*!?!

Sheesh. No *wonder* Dennis doesn't post more often.

Next time dmr posts something, I suggest you shut up and listen. Respectfully.
-- 
      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
      Email: eric@snark.uu.net                       CompuServe: [72037,2306]
      Post: 22 S. Warren Avenue, Malvern, PA 19355      Phone: (215)-296-5718

lm@snafu.Sun.COM (Larry McVoy) (02/08/89)

In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes:
$In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes:
$> In article <8876@alice.UUCP> dmr@alice.UUCP writes:
$> >The history of this convention and of the general array scheme had little
$> >to do with the PDP-11; it was inherited from BCPL and B.
$> 
$> A correction here: the C scheme was NOT inherited from BCPL.
$
$I've seen bonehead idiocy on the net before, but this tops it all -- this takes
$the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to
$before you pontificate? Didn't the name `Dennis Ritchie' register in whatever
$soggy lump of excrement you're using as a central nervous system? Do you
$realize that the person you just incorrectly `corrected' on a point of C's
$intellectual antecedents is the *inventor of C himself*!?!
$
$Sheesh. No *wonder* Dennis doesn't post more often.
$
$Next time dmr posts something, I suggest you shut up and listen. Respectfully.
$-- 
$      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
$      Email: eric@snark.uu.net                       CompuServe: [72037,2306]
$      Post: 22 S. Warren Avenue, Malvern, PA 19355      Phone: (215)-296-5718

Is everyone else laughing as hard as I am?

Eric,

	Even the Gods make mistakes, OK?  And, although I don't pretend to speak
for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the
last person that would want you to "shut up and listen".  The whole point of 
this newsgroup, and research in general, is to question the obvious, point out
the incorrect.  It's called learning.  Blind faith is called religion and has
no place in science.

"Sheesh", indeed.

Larry McVoy, Lachman Associates.			...!sun!lm or lm@sun.com

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (02/08/89)

In article <88850@sun.uucp> lm@sun.UUCP (Larry McVoy) writes:
>In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes:
>$In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes:
>$> In article <8876@alice.UUCP> dmr@alice.UUCP writes:
>$> >The history of this convention and of the general array scheme had little
>$> >to do with the PDP-11; it was inherited from BCPL and B.
>$> 
>$> A correction here: the C scheme was NOT inherited from BCPL.
>$
>$I've seen bonehead idiocy on the net before, ...
>$before you pontificate? Didn't the name `Dennis Ritchie' ...
>$
>$Next time dmr posts something, I suggest you shut up and listen. Respectfully.

>Is everyone else laughing as hard as I am?

Hopefully. It is quite amusing.
>
>...
>for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the
>last person that would want you to "shut up and listen".  The whole point of 
>this newsgroup, and research in general, is to question the obvious, point out
>the incorrect.  It's called learning.  Blind faith is called religion and has
>no place in science.

In general, I agree with Mr. McV. This is, however, a special case.
The question is (to paraphrase) "What did the inventors of C think
about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff".

I do not think it fair to characterize "boff" as a valid hypothesis,
unless the PI had died, and left no notes, or ambigous ones. Since the
PI is very much alive, and has spoken, contradicting him is a bit out
of _my_ definition of scientific inquiry.

Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

jefu@pawl.rpi.edu (Jeffrey Putnam) (02/08/89)

In reference to the C representation of strings.

Note followups to comp.lang.c.

I like the C model for strings.  I like it mostly for its simplicity
and ease of use.  It may well be that a representation for strings
that includes string length as a part of a structure is better for
efficiency, or more modular or whatever.  But! the model used is
simple and introduces no magic into the language.

Magic?  Yup.  Magic is what happens when the language (or operating system
or hardware) does something odd that is not reachable by the user.  This
includes magic strings, magic arrays (arrays stored in the same way - that
is with extra information hidden from the user), magic library calls (like
some VMS calls) and so on.  

If language (hardware, os) designers want to do something, they should make
it evident and available to the user - because if they want to do it, the
user probably will as well.  

In the string question, adding the string length means that what is passed
may be a magic cookie that the language knows how to use, but that the user
is often denied access to. I have used languages that did lots of magic
(the worst was PL/I) and it was often quite difficult to decide what was
actually happening (in a function call, for example).   

The C choice was the simpler one, one with no magic, and the best for the
kind of programming that C encourages.  Further, if you want to add
counted strings, it can be done in C easily.  I believe that i have seen
a counted string library posted to the net - it might be interesting to
see if string handling programs actually run a lot faster with such a 
library instead of the standard string functions. 

jeff putnam        -- "Sometimes one must attempt the impossible if only to
jefu@pawl.rpi.edu  --  show it is merely inadvisable."

thomson@hub.toronto.edu (Brian Thomson) (02/09/89)

In article <37529@oliveb.olivetti.com> chase@Ozona.UUCP (David Chase) writes:
>
>You should also check out the string operations on the 360/370 sort of
>machines; BCPL was running there (rather well) a very long time ago.
>I think that those operations worked on at most 256 characters (and, I
>should add, NOT on 0-length strings).  It may well be another case of
>architecture influencing language design (note that a zero-length BCPL
>string actually contains one byte -- the zero count).
>

But the 360 implementation was not the first implementation.  It
is probably more correct to say that the maximum string length was
implementation-dependent, but tended to be 255 because many machines
have 8-bit bytes.

Also, I don't remember the 360 code generator producing any string
instructions, although it certainly could be persuaded to produce
byte (character) instructions.  By that time, the language had been
extended with the packed string selector operator "%", so

     GETBYTE(S, I)         was equivalent to the (inline)   S%I
and  PUTBYTE(S, I, C)      to      S%I := C

-- 
		    Brian Thomson,	    CSRI Univ. of Toronto
		    utcsri!uthub!thomson, thomson@hub.toronto.edu

chase@Ozona.orc.olivetti.com (David Chase) (02/09/89)

In article <88853@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:

In article <8876@alice.UUCP> dmr@alice.UUCP ["PI" below] writes:
>>$> >The history of this convention and of the general array scheme had little
>>$> >to do with the PDP-11; it was inherited from BCPL and B.
["bliff" below]

>>$In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu
  (Robert Firth) ["Mr Poster" below] writes:
>>$> A correction here: the C scheme was NOT inherited from BCPL.
["boff" below]

>The question is (to paraphrase) "What did the inventors of C think
>about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff".
>
>I do not think it fair to characterize "boff" as a valid hypothesis,
>unless the PI had died, and left no notes, or ambigous ones. Since the
>PI is very much alive, and has spoken, contradicting him is a bit out
>of _my_ definition of scientific inquiry.

Sigh.  Nonetheless, he (Dennis Ritchie) probably made a mistake in his
posting.  Other Prinicipal Inventors are still alive and also left
unambiguous notes.  Also, I'd suggest that Robert Firth knows BCPL
better than Dennis Ritchie.  (I'd suggest that *I* know BCPL better
than Dennis Ritchie, too -- I've used it within the last 4 years.)
I'll give references.

From _BCPL -- The language and its compiler_ by Martin Richards and
Colin Whhitby-Strevens, 1979
----------------
[PACKSTRING and UNPACKSTRING] "After unpacking your string, you will
discover that the first word contains a count of the number of
characters in the string proper, which starts at the second word.
  As an example, we give the library routines WRITES, UNPACKSTRING,
and PACKSTRING:

LET PACKSTRING(V,S) = VALOF
$(
    LET N = V!0 & #XFF // extract least significant 8 bytes
    LET SIZE = N / BYTESPERWORD
    S!SIZE := 0        // pack out last word with zeroes
    FOR I = 0 TO N DO PUTBYTE(S,I, V!I)
    RESULTIS SIZE
$)
----------------

Note, too, that the zeros in the last word will only appear in those
cases where the bytes packed do not fill out the words in the string
(that is, consider packing a string containing 3 characters).

From "The Portability of the BCPL Compiler" by Martin Richards in
_Software -- Practice and Experience_, volume 1, pp 135-146, 1971.
----------------
Strings are packed in BCPL and the packing is necessarily machine
dependent since it depends strongly on the word and byte sizes of the
object machine.  The usual internal representation of a string value
is as a pointer to the first of a set of words holding the length and
packed characters of the string.  The zeroth byte is usually justified
to the start of a word and holds the length of the string with
successive bytes holding the characters and padded with zeros (or
possibly spaces) at the end to fill the last word.  In order to handle
strings in as machine independent way [sic] as possible packing,
unpacking and writing of strings is done using library routines which
are defined in the machine dependent interface with the operating
system.
----------------

I think it is fair to say that C did NOT inherit its string
representation from BCPL.  I wish that some of you people would check
your facts before posting.

Linguistic comparisons belong elsewhere, so I won't make them.  As far
as implementation goes, I think it is a mixed bag.  Many operations
are "faster" on strings with counts, but if your maximum count is only
255 then everything is pretty fast whether it is counted or
terminated.

You should also check out the string operations on the 360/370 sort of
machines; BCPL was running there (rather well) a very long time ago.
I think that those operations worked on at most 256 characters (and, I
should add, NOT on 0-length strings).  It may well be another case of
architecture influencing language design (note that a zero-length BCPL
string actually contains one byte -- the zero count).

David

news@ism780c.isc.com (News system) (02/09/89)

In article <8882@alice.UUCP> dmr@alice.UUCP writes:
>                                        More nearly correct
>memory--which applies only to 20-year-old implementations
>on the IBM 7090 and GE 635/645--says that strings existed
>in two forms: a one-byte count followed by the characters
>of the string, and an expanded form in which the count
>and the characters were blown out into words.
>
>In these implementations, bytes were 9 bits, and so the compact
>representation limited strings to length 511.  On the
>other hand, the blown-out form occupied lots of space and
>wasn't suitable for transferring directly to text files.
>Originally, I believe, there were no operators for accessing
>individual characters in a compact string, though they were
>added later.

I cannot speak about the GE 635, But I can speak with some authority about
the IBM 7090.  I have before me both the 7090 reference manual and the micro
code that I wrote to emulate the 7090 on the Standard IC4000.

The 7090 provided no instructions for accessing individual characters.  The
notion of characters existed only at the I/O interface where 6 characters
(in 6 bit BCD) could be read into a single 36 bit word.  The minimum transfer
was two words.  I am not aware of any languages made available by IBM that
provided a string data type.  In fact the limit of 6 character identifiers in
FORTRAN was due to the fact that 6 characters could be manipulated as a word
on the 7090 (and on the 704 the original "FORTRAN" machine).

IBM did produce a 7040 machine (similar to but not an extension of the 7090).
This machine provided character load and store.  But the use of the facility
was limited by the fact that a character address could not be put into an
index register.

I implemented an extended version of 7090 called the EX02 by Standard
Computer that did support strings.  Because the words had room in them for
two pointers and a character, I represented a string as a linked list with
one character per word.  The ends of the string was marked by null pointers.
There were instructions to traverse the linked list and to convert from a
packed array of characters to a string and back again.  The language that
took advantage of these strings was called IMPLAN (implementation language).
An interesting feature of the language was that an expression like i*s, where
i was an integer and s was a string, meant the first i charcters of s.  And
s*i meant the last i characters of the string.

   Marv Rubinstein

barmar@think.COM (Barry Margolin) (02/09/89)

In article <88853@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes:
]In article <88850@sun.uucp> lm@sun.UUCP (Larry McVoy) writes:
]>In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes:
]>$Next time dmr posts something, I suggest you shut up and listen. Respectfully.
]
]>Is everyone else laughing as hard as I am?
]Since the
]PI is very much alive, and has spoken, contradicting him is a bit out
]of _my_ definition of scientific inquiry.

BUT -- DMR later replied and said that the person contradicting him
was RIGHT!

Even the great god :-) Ritchie is permitted to have a memory lapse.

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (02/09/89)

In article <37529@oliveb.olivetti.com> chase@Ozona.UUCP (David Chase) writes:
>....
>Sigh.  Nonetheless, he (Dennis Ritchie) probably made a mistake in his
>posting.  Other Prinicipal Inventors are still alive and also left
>unambiguous notes.  Also, I'd suggest that Robert Firth knows BCPL
>better than Dennis Ritchie.  (I'd suggest that *I* know BCPL better
>than Dennis Ritchie, too -- I've used it within the last 4 years.)
>I'll give references.
>

It was pointed out to me (private mail) that what made the situation
so funny was that dmr recanted, and that then someone came along and
castigated Firth. 

I had failed to keep all the threads in mind. I do not claim to know
BCPL, and I don't claim to be expert on dmr's thoughts.

As I saw it, the argument was what was going on in dmr's
thoughts....and since dmr is very much alive speculating is not very
fruitful. 

As it happens, the speculator was right......

Just goes to show.

I hereby repudiate my position. Debating what when on in somebody's
mind a decade+ ago is valid...because the thinker probably doesn't
remember it correctly! :>

khb
Keith H. Bierman
It's Not My Fault ---- I Voted for Bill & Opus

aglew@mcdurb.Urbana.Gould.COM (02/09/89)

>/* Written  3:15 pm  Feb  6, 1989 by GQ.RLG@forsythe.stanford.edu in mcdurb:comp.arch */
>->[Me] aglew@mcdurb.Urbana.Gould.COM writes:
>->May I encourage people implementing string libraries to use an extra
>->level of indirection? Instead of length immediately preceding the string,
>->let length be associated with a pointer to the string. Makes
>->substringing operations much easier, and has the ability to reduce
>->unnecessary copies (at the risk of increased aliasing).
>->
>->       +------+---+
>->       |length|ptr|
>->       +------+---+
>->                |
>->         +------+
>->         |
>->         V
>->       +---+---+---+---+---+---+---+---+---+---+---+---+---+
>->       | H | E | L | L | O | , |   | W | O | R | L | D | \n|
>->       +---+---+---+---+---+---+---+---+---+---+---+---+---+
>
>Such an implementation has adverse effects when the string is sent
>to/from an external device, such as a file.  The 'length' must be
>with the string, or the string needs a terminator character.

If you are sending directly to an output device, I doubt that
your output device accepts your internal format. If you have 
to reformat anyway...

Oh, you mean storing data in a file. What's a file? You mean this
memory-mapped object... 
    Sorry, I don't live in that environment, unfortunately. 
Yep, you have to decide either way. For text strings, ASCII files
or binary files are fine by me. Leading counts are fine.
Nothing says that ptr could not point to the very next location.

>what happens to the 'length' information for the old string?

I sure would hope it got changed appropriately!
And I sure would hope that the use was wrapped in a library routine
or macro or C++ type object interface so that nobody ever accessed
the length and ptr explicitly!

Look, null terminated is fine by me, I use it every day. It just has
the embedded null drawback, and the fact that it encourages dumb
code. Several examples of which (dumb code that scans the string twice)
are on my list of things to fix real soon now - one is taking up 10%
of a loaded system. And, yes, good coding practices can avoid double
scanning, so all that you're left with is the embedded null problem.

(Talking about dumb code - has anyone else seen things like

     #define TERM_ESCAPE_CODE "\e[foo\0bar"
     puts(TERM_ESCAPE_CODE);	/* Do escape code magic with terminal? */

particularly in things where the escape code is computed?)
And I sure would never let any oiu

beyer@houxs.ATT.COM (J.BEYER) (02/10/89)

In article <22036@ism780c.isc.com>, news@ism780c.isc.com (News system) writes:
> The 7090 provided no instructions for accessing individual characters.  The
> notion of characters existed only at the I/O interface where 6 characters
> (in 6 bit BCD) could be read into a single 36 bit word.  The minimum transfer
> was two words.  I am not aware of any languages made available by IBM that
> provided a string data type.  In fact the limit of 6 character identifiers in
> FORTRAN was due to the fact that 6 characters could be manipulated as a word
> on the 7090 (and on the 704 the original "FORTRAN" machine).
This may be quibbling, but did not the Convert By Replacement From MQ
and similar instructions deal with 6-bit characters on the 7090? I know
they were not in the 704. Perhaps they did not get there until the 7094.
But they were not full-fledged character manipulation primitives, I agree.
It has been at least 20 years since I programmed one of those.

As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit
characters (programmer's choice), using clever addressing and tally modes.
I am even more hazy about that machine, though.


-- 
Jean-David Beyer
A.T.&T., Holmdel, New Jersey, 07733
houxs!beyer

seanf@sco.COM (Sean Fagan) (02/10/89)

In article <499@maxim.ERBE.SE> prc@maxim.ERBE.SE (Robert Claeson) writes:
>>   but, what size would you use for "length", a byte? a word? a longword?
>A 16-bit word, just to remain compatible with all the 16-bit machines
>out there.

You would limit all those who are on VAXen, 68k's, 32k's (both WE and NS),
Sparc's, 88k's, CDC Cybers (180 state), Cray's (1, 2, 3, X and Y MP), ARM's,
29k's, Elxsi's, ad naseum, just to retain backwards compatibility?

If you *must* use a <length, pointer> combination, make the <length>
attribute the same size as a normal char *.  However, I've gotten along
quite well without them, as have thousands of other people.

-- 
Sean Eric Fagan  | "What the caterpillar calls the end of the world,
seanf@sco.UUCP   |  the master calls a butterfly."  -- Richard Bach
(408) 458-1422   | Any opinions expressed are my own, not my employers'.

news@ism780c.isc.com (News system) (02/11/89)

[M Rubinstein]
>> The 7090 provided no instructions for accessing individual characters.

[J BEYER]
>This may be quibbling, but did not the Convert By Replacement From MQ
>and similar instructions deal with 6-bit characters on the 7090? I know
>they were not in the 704. Perhaps they did not get there until the 7094.
>But they were not full-fledged character manipulation primitives, I agree.
>It has been at least 20 years since I programmed one of those.

[M Rubinstein]
J BEYER is right about convert instructions on the 709/7090/7094 operating on
6 bit fields.  However the fields accessed were fields in a register.  The
were were no instructions for accessing 6 bit fields from memory.  There were
instructions for accessing two 15 bit fields from memory.  The fields were
called the address and decrment.  It is my understanding the the Lisp names
CAR and CDR come from the 15 bit field names.  CAR==contents of address
register, and CDR==contents of decrement register.

   Marv Rubinstein

chris@mimsy.UUCP (Chris Torek) (02/11/89)

More or less incidentally: it is easy to build counted strings out
of C strings, if you prefer counts.  For instance:

	typedef struct string {
		int	len;
		char	*str;
	} string;

	#define STRING(s) { sizeof(s) - 1, s }

	string foo = STRING("foo");

In pANS C this works for automatic variables as well as statics.
These strings are not normal expressions, though, and cannot be
anonymous---that is,

	result = some_string_function(str, STRING("lima beans"))

is illegal (though gcc has an extension that will do it).
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

mac@uvacs.cs.Virginia.EDU (Alex Colvin) (02/11/89)

In article <6239@saturn.ucsc.edu>, haynes@ucscc.UCSC.EDU (Jim Haynes) writes:
...<history of string representations>...

> What's the difference between a string and an array of characters?
> Is it anything other than the set of operations that are provided
> to operate on it?

In the indirect representation (via length & pointer) any substring is also
a string.

In the end-delimited representation (as in C) only a tail substring is a
string.  This still allows you to consume a string from start to end, but
makes it difficult to pull a string off the start.

In both representations, catenation is messy.  That's when we turn to
buffer chains.

childers@avsd.UUCP (Richard Childers) (02/14/89)

hascall@atanasoff.cs.iastate.edu (John Hascall) writes:

>In article <8876@alice.UUCP> dmr@alice.UUCP writes:

>>I don't think it has been demonstrated that the usual run of
>>C programs pays an extremely high cost in performance for their
>>string operations, though doubtless there are counterexamples
>>for particular machine architectures or particular programs.

>   This is a rather circular argument.  This is rather like
>   saying "I don't think it has been demonstrated that the usual
>   automobile pays an extremely high cost in performance for their
>   amphibious operations,..."

It's a lot easier to criticize two decades after the fact, to say, 'You
oughta do this ... why didn't you do THAT ?' instead of merely accepting
it as a foible of whichever language is under discussion.

I think what Mssr. Ritchie was trying to say was something along the lines
of, "OK, we messed up ... we tried for an elegant solution, treating text
as an array of characters. It was abstract, it was clean, we thought it'd
fly. It did. Not everybody likes it ... but it's not going to change, as
there are some subterranean assumptions that it would be wise to take into
account before this conversation goes much further."

>John Hascall
>ISU Comp Center
>Rx:  Apply :-) above as needed for pain.

-- richard

-- 
 *       "Do not look at my outward shape, but take what is in my hand."      *
 *                            -- Jalaludin Rumi, 1107-1173                    *
 *      ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd.UUCP!childers@tycho     *
 *          AMPEX Corporation - Audio-Visual Systems Division, R & D          *

childers@avsd.UUCP (Richard Childers) (02/14/89)

eric@snark.uu.net (Eric S. Raymond) writes:

>I've seen bonehead idiocy on the net before, but this tops it all --"

Indeed, it does.

>Next time dmr posts something, I suggest you shut up and listen. Respectfully.

Gag me with mindless worshipping twits. Ride on someone else's coattails, eh?

>      Eric S. Raymond                     (the mad mastermind of TMN-Netnews)
>      Email: eric@snark.uu.net                       CompuServe: [72037,2306]
>      Post: 22 S. Warren Avenue, Malvern, PA 19355      Phone: (215)-296-5718

-- richard

-- 
 *       "Do not look at my outward shape, but take what is in my hand."      *
 *                            -- Jalaludin Rumi, 1107-1173                    *
 *      ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd.UUCP!childers@tycho     *
 *          AMPEX Corporation - Audio-Visual Systems Division, R & D          *

mash@mips.COM (John Mashey) (02/14/89)

In article <637@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes:
....
>Dennis's comment *could* have been circular, but I don't think it 
>was.  After all, the Unix OS has lots of places where exceptionally
>poor string handling would be obvious very quickly, and there are 
>several known installations of this OS nowadays...On the other hand,
>there's Dhrystone -- if your string handling is poor, your 
>Dhrystone number may well be pathetic.  And Dhrystone is supposed
>to be a systems code benchmark (at least to this level of fidelity).

As has been discussed before in this group, Dhrystone's str* behavior
seems to differ a bit from "more typical" C programs.  Everything I've
ever seen from analyzing C programs backs up what Dennis says.
When we did the first MIPS UNIX port, I wrote all of the str* routines
in assembler; we threw most of them out in favor of the portable C programs,
because the testing time to find the (must have been at least) one bug in the
lot wouldn't have been worth it, according to the program statistics we saw.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

mac@uvacs.cs.Virginia.EDU (Alex Colvin) (02/16/89)

In article <1185@houxs.ATT.COM>, beyer@houxs.ATT.COM (J.BEYER) writes:
> In article <22036@ism780c.isc.com>, news@ism780c.isc.com (News system) writes:
> > The 7090 provided no instructions for accessing individual characters.  The

> As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit

It did.  Unfortunately, there were several incompatible ways,
the new (EIS string, all memory-memory),
the old (tally words, all memory to register)
and the incredibly old (fixed character/byte, memory-register).

I've heard of a cute way to implement character addressing on a 7090 using
one half register for the word address, another half for character offset,
and XECs to choose a load/store, with an XED chain to wrap the character
count into the word address.  Almost microcode (cf Clipper macrocode?)

			Alex Colvin
			old DTSS programmer