PLS@cup.portal.com (Paul L Schauble) (02/05/89)
This deserves a new subject. Since it was mentioned in the Endian Wars, does anyone know why C uses the null terminated string rather than an explicit length? It seems like such an odd choice considering that - It removes a character from the character set, a source of many C bugs, and - All machines I know of that have character string instructions want the length of the string. This forces the string primitives to first scan for null, a time wasting operation. There must have been a reason. What is it? ++PLS
hascall@atanasoff.cs.iastate.edu (John Hascall) (02/06/89)
In article <14331@cup.portal.com> PLS@cup.portal.com (Paul L Schauble) writes: >Since it was mentioned in the Endian Wars, does anyone know why C uses the >null terminated string rather than an explicit length? It seems like such >an odd choice considering that > - It removes a character from the character set, a source of many C > bugs, and Agreed. > - All machines I know of that have character string instructions want > the length of the string. This forces the string primitives to first > scan for null, a time wasting operation. I assume you mean something like: +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ |length| H | E | L | L | O | , | | W | O | R | L | D | \n| +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ but, what size would you use for "length", a byte? a word? a longword? I suspect that some of these machines' instructions expect different sized operands for the length. Also, to quote K&R: "C was originally designed ... on the DEC PDP-11", a machine with no string instructions. John Hascall ISU Comp Center
nagel@paris.ics.uci.edu (Mark Nagel) (02/06/89)
In article <14331@cup.portal.com>, PLS@cup (Paul L Schauble) writes: |This deserves a new subject. | |Since it was mentioned in the Endian Wars, does anyone know why C uses the |null terminated string rather than an explicit length? It seems like such |an odd choice considering that | - It removes a character from the character set, a source of many C | bugs, and | - All machines I know of that have character string instructions want | the length of the string. This forces the string primitives to first | scan for null, a time wasting operation. | |There must have been a reason. What is it? Hmm. There are two things going on here. One is that you want to have truly variable-length strings. You can do it the C way, or you can adopt some more complicated method like having different string types or a variable length string length indicator. I think the implementors chose the simplest approach, hoping that in the average case, the overhead from scanning a string would be small (and hopefully the value would be cached in whatever data structure needed it). The other thing (once the sentinel method was chosen) was to select the proper terminating character. I don't think NUL is used much anywhere for anything and thus is a good candidate. In addition, I've heard that NUL was chosen as a way to help prevent overrunning the ends of strings by too much in the case of a missing end-of-string character. What single byte value is more prevalent in machine code than zero? Mark Nagel @ UC Irvine, Dept of Info and Comp Sci ARPA: nagel@ics.uci.edu | Charisma doesn't have jelly in the UUCP: {sdcsvax,ucbvax}!ucivax!nagel | middle. -- Jim Ignatowski
aglew@mcdurb.Urbana.Gould.COM (02/06/89)
>>Since it was mentioned in the Endian Wars, does anyone know why C uses the >>null terminated string rather than an explicit length? >> - All machines I know of that have character string instructions want >> the length of the string. This forces the string primitives to first >> scan for null, a time wasting operation. > > I assume you mean something like: > > +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ > |length| H | E | L | L | O | , | | W | O | R | L | D | \n| > +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ > > Also, to quote K&R: "C was originally designed ... on the DEC PDP-11", > a machine with no string instructions. May I encourage people implementing string libraries to use an extra level of indirection? Instead of length immediately preceding the string, let length be associated with a pointer to the string. Makes substringing operations much easier, and has the ability to reduce unnecessary copies (at the risk of increased aliasing). +------+---+ |length|ptr| +------+---+ | +------+ | V +---+---+---+---+---+---+---+---+---+---+---+---+---+ | H | E | L | L | O | , | | W | O | R | L | D | \n| +---+---+---+---+---+---+---+---+---+---+---+---+---+ Machine instructions should not mandate the layout of strings in memory. They should, instead, require length and start to be preloaded in registers (where the machine'll have to put them anyway). That is, of course, if you *have* string instructions. As I am fond of pointing out, large integer operations can remove the need for many string operations (ie. give me a 64 or 128 bit wide bus, and a "STORE BYTES UNDER MASK" operation, and I don't *want* string operations). Andy "Krazy" Glew aglew@urbana.mcd.mot.com uunet!uiucdcs!mcdurb!aglew Motorola Microcomputer Division, Champaign-Urbana Design Center 1101 E. University, Urbana, Illinois 61801, USA. My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
dmr@alice.UUCP (02/06/89)
The question arose: why does C use a terminating character for strings instead of a count? Discussion of the representation of strings in C is not fruitful unless it is realized that there are no strings in C. There are character arrays, which serve a similar purpose, but no strings. Things very deep in the design of the language, and in the customs of its use, make strings a mess to add. The intent was that the behavior of character arrays should be exactly like that of other arrays, and the hope was that stringish operations on these character arrays should be convenient enough. The interplay of pointers and arrays, and the possible representations for them, place strong contraints on what one can do if one wants real strings (counted sequences of characters) in the context of the existing language, in particular if types char* or char[] are going to be counted strings. In general it is hard to account for the space in which to put the count, and also to make sure that it can be updated properly under all operations. For example, 'sizeof' is used for allocation and it is hard to make this use compatible with a count. Similarly, in practice, most implementations make 'struct { char s1[3]; s2[5]; }' say something about the storage layout that doesn't mix well with a count. Given the explicit use of character arrays, and explicit pointers to sequences of characters, the conventional use of a terminating marker is hard to avoid. The history of this convention and of the general array scheme had little to do with the PDP-11; it was inherited from BCPL and B. Of course, it is possible to imagine adding a primitive string type to C, and to put in some useful operations like concatenation, search, and substring. This would not really be a good idea, because this new primitive type would continually be at war with the existing character pointers and arrays. In the context of C (even with ANSI function prototypes) it would be quite difficult to make a string type usable in all the places it should be. In extensible languages like C++ and of course in languages in which the notion is designed in from the start, strings are fine. (However, even in C++, where it is readily possible to define your own string class, it would take quite a lot of work to make this class work smoothly with the entire existing library). In my opinion, C's array/pointer scheme for representation of character strings has worked out reasonably well, although it is somewhat clumsy when there are lots of string operations. I don't think it has been demonstrated that the usual run of C programs pays an extremely high cost in performance for their string operations, though doubtless there are counterexamples for particular machine architectures or particular programs. Dennis Ritchie att!research!dmr dmr@research.att.com
firth@sei.cmu.edu (Robert Firth) (02/06/89)
In article <8876@alice.UUCP> dmr@alice.UUCP writes: ["strings" in C] >Given the explicit use of character arrays, and explicit pointers to >sequences of characters, the conventional use of a terminating >marker is hard to avoid. The history of this convention and >of the general array scheme had little to do with the PDP-11; it >was inherited from BCPL and B. A correction here: the C scheme was NOT inherited from BCPL. BCPL strings are not confused with character arrays; their implemetation is not normally visible to the programmer, and their semantics are respectably robust. Much the most common implementation is the one proposed earlier - have an initial length count followed by exactly that number of characters. Naturally, all characters are legal, including NUL. There are several reasons for the C 'design', but that its perpetrators didn't know any better isn't one of them.
hascall@atanasoff.cs.iastate.edu (John Hascall) (02/07/89)
In article <8876@alice.UUCP> dmr@alice.UUCP writes: >The question arose: why does C use a terminating character for >strings instead of a count? [...discussion of the history of "strings" in C omitted...] >In my opinion, C's array/pointer scheme for representation >of character strings has worked out reasonably well, although >it is somewhat clumsy when there are lots of string operations. >I don't think it has been demonstrated that the usual run of >C programs pays an extremely high cost in performance for their >string operations, though doubtless there are counterexamples >for particular machine architectures or particular programs. This is a rather circular argument. This is rather like saying "I don't think it has been demonstrated that the usual automobile pays an extremely high cost in performance for their amphibious operations,..." Just like most people don't drive their cars across lakes, most C programs are not string operation intensive. If I had to cross a lake, I'd probably use a different tool than a car. If I had to write a string operation intensive program, I'd probably use a different tool than C. John Hascall ISU Comp Center Rx: Apply :-) above as needed for pain.
wsmith@m.cs.uiuc.edu (02/07/89)
On using machine string instructions for performing things such as index() that need the length of the string ahead of time... On the 80x86 family of chips, I believe it is possible for a routine to beat the fancy string instructions with clever coding that does not need the length of the string so that the index() subroutine does not average O(length), but instead much less than that when the pattern is near the beginning. Certainly on some of the low end VAXes this was true also because the special string instructions were executed by software. Just because they give a special instruction, doesn't mean you have to use it. Bill Smith wsmith@cs.uiuc.edu uiucdcs!wsmith
haynes@ucscc.UCSC.EDU (Jim Haynes) (02/07/89)
In article <8876@alice.UUCP> dmr@alice.UUCP writes: >The question arose: why does C use a terminating character for >strings instead of a count? > Since dmr has spoken I probably shouldn't even think of trying to add anything, but ... If we think of machine architecture there are historically several different ways that have been used to represent strings. 1) Put the characters in adjacent storage locations, and use a special character to delimit the end of the string. This goes back to at least IBM 702 or thereabouts. It is basically what is used in C; IBM just happened to use a character other than null. 2) Put the characters in adjacent storage locations, and reserve a bit per storage location to delimit string boundaries. This is found in IBM 1401-1410-7010 family and 1620. 3) Put the characters in adjacent storage locations, don't delimit the boundaries at all, and absorb the information about string length into the instruction stream. This is what's done in IBM 360 and a lot of other machines. In contrast to 1) and 2) it requires something like a simulation of 4) below to deal with varying length strings and dynamic allocation of storage. 4) Put the characters in adjacent storage locations and store information about starting address and length in a data structure for the purpose, a string descriptor. This is used in Burroughs B6500 and later machines of that family. The string is virtually delimited, in that you can't access it, or aren't supposed to, without going through the descriptor and observing the boundary. What's the difference between a string and an array of characters? Is it anything other than the set of operations that are provided to operate on it? haynes@ucscc.ucsc.edu haynes@ucscc.bitnet ..ucbvax!ucscc!haynes "Any clod can have the facts, but having opinions is an Art." Charles McCabe, San Francisco Chronicle
dmr@alice.UUCP (02/07/89)
Robert Firth justifiably corrects my misstatement about BCPL strings; they were indeed counted. I evidently edited my memory. Perhaps he or someone else can discuss authoritatively how they fitted into the language and (to recall the original topic) how string instructions might help in processing. More nearly correct memory--which applies only to 20-year-old implementations on the IBM 7090 and GE 635/645--says that strings existed in two forms: a one-byte count followed by the characters of the string, and an expanded form in which the count and the characters were blown out into words. In these implementations, bytes were 9 bits, and so the compact representation limited strings to length 511. On the other hand, the blown-out form occupied lots of space and wasn't suitable for transferring directly to text files. Originally, I believe, there were no operators for accessing individual characters in a compact string, though they were added later. Dennis Ritchie att!research!dmr dmr@research.att.com
prc@maxim.ERBE.SE (Robert Claeson) (02/07/89)
In article <762@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > > - All machines I know of that have character string instructions want > > the length of the string. This forces the string primitives to first > > scan for null, a time wasting operation. > I assume you mean something like: > +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ > |length| H | E | L | L | O | , | | W | O | R | L | D | \n| > +------+---+---+---+---+---+---+---+---+---+---+---+---+---+ > but, what size would you use for "length", a byte? a word? a longword? A 16-bit word, just to remain compatible with all the 16-bit machines out there. > I suspect that some of these machines' instructions expect different > sized operands for the length. Some expects 16 bits, some expects 32 bits, and a few (such as the old Z-80) expects 8 bits. > Also, to quote K&R: "C was originally designed ... on the DEC PDP-11", > a machine with no string instructions. Well, what can we learn from this? -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden "No problems." -- Alf Tel: +46 758-202 50 EUnet: rclaeson@ERBE.SE uucp: uunet!erbe.se!rclaeson Fax: +46 758-197 20 Internet: rclaeson@ERBE.SE BITNET: rclaeson@ERBE.SE
colwell@mfci.UUCP (Robert Colwell) (02/07/89)
In article <765@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu (John Hascall) writes: =In article <8876@alice.UUCP= dmr@alice.UUCP writes: ==The question arose: why does C use a terminating character for ==strings instead of a count? ==In my opinion, C's array/pointer scheme for representation ==of character strings has worked out reasonably well, although ==it is somewhat clumsy when there are lots of string operations. = ==I don't think it has been demonstrated that the usual run of ==C programs pays an extremely high cost in performance for their ==string operations, though doubtless there are counterexamples ==for particular machine architectures or particular programs. = = This is a rather circular argument. This is rather like = saying "I don't think it has been demonstrated that the usual = automobile pays an extremely high cost in performance for their = amphibious operations,..." = = Just like most people don't drive their cars across lakes, = most C programs are not string operation intensive. Dennis's comment *could* have been circular, but I don't think it was. After all, the Unix OS has lots of places where exceptionally poor string handling would be obvious very quickly, and there are several known installations of this OS nowadays...On the other hand, there's Dhrystone -- if your string handling is poor, your Dhrystone number may well be pathetic. And Dhrystone is supposed to be a systems code benchmark (at least to this level of fidelity). Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 175 N. Main St. Branford, CT 06405 203-488-6090
rds@n.sp.cs.cmu.edu (Robert Sansom) (02/08/89)
In article <8882@alice.UUCP>, dmr@alice.UUCP writes: > Robert Firth justifiably corrects my misstatement about > BCPL strings; they were indeed counted. I evidently edited > my memory. > > Perhaps he or someone else can discuss authoritatively how they > fitted into the language and (to recall the original topic) how string > instructions might help in processing. To quote 'BCPL - the language and its compiler' by Martin Richards and Colin Whitby-Strevens: '... you can use UNPACKSTRING to lay out a string in a vector one character to a word, and PACKSTRING to pack it up again. After unpacking your string, you will discover that the first word contains a count of the number of characters in the sting proper, which starts at the second word.' and: 'Exactly how BCPL strings are stored depends, amongst other things, upon the implementation word size. This dependency is concealed within the string access procedures GETBYTE and PUTBYTE. The call "GETBYTE(S,I)" obtains the Ith byte of the string S. By convention, byte 0 contains the number of characters in the string, which are stored consecutively from byte. The call "PUTBYTE(S,I,C)" sets the Ith byte of the string S to contain the character C.' Robert Sansom (rds@cs.cmu.edu) School of Computer Science Carnegie Mellon University --
eric@snark.uu.net (Eric S. Raymond) (02/08/89)
In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes: > In article <8876@alice.UUCP> dmr@alice.UUCP writes: > >The history of this convention and of the general array scheme had little > >to do with the PDP-11; it was inherited from BCPL and B. > > A correction here: the C scheme was NOT inherited from BCPL. I've seen bonehead idiocy on the net before, but this tops it all -- this takes the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to before you pontificate? Didn't the name `Dennis Ritchie' register in whatever soggy lump of excrement you're using as a central nervous system? Do you realize that the person you just incorrectly `corrected' on a point of C's intellectual antecedents is the *inventor of C himself*!?! Sheesh. No *wonder* Dennis doesn't post more often. Next time dmr posts something, I suggest you shut up and listen. Respectfully. -- Eric S. Raymond (the mad mastermind of TMN-Netnews) Email: eric@snark.uu.net CompuServe: [72037,2306] Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718
lm@snafu.Sun.COM (Larry McVoy) (02/08/89)
In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: $In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes: $> In article <8876@alice.UUCP> dmr@alice.UUCP writes: $> >The history of this convention and of the general array scheme had little $> >to do with the PDP-11; it was inherited from BCPL and B. $> $> A correction here: the C scheme was NOT inherited from BCPL. $ $I've seen bonehead idiocy on the net before, but this tops it all -- this takes $the cut-glass flyswatter. Mr. Firth, do you *read* what you're replying to $before you pontificate? Didn't the name `Dennis Ritchie' register in whatever $soggy lump of excrement you're using as a central nervous system? Do you $realize that the person you just incorrectly `corrected' on a point of C's $intellectual antecedents is the *inventor of C himself*!?! $ $Sheesh. No *wonder* Dennis doesn't post more often. $ $Next time dmr posts something, I suggest you shut up and listen. Respectfully. $-- $ Eric S. Raymond (the mad mastermind of TMN-Netnews) $ Email: eric@snark.uu.net CompuServe: [72037,2306] $ Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718 Is everyone else laughing as hard as I am? Eric, Even the Gods make mistakes, OK? And, although I don't pretend to speak for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the last person that would want you to "shut up and listen". The whole point of this newsgroup, and research in general, is to question the obvious, point out the incorrect. It's called learning. Blind faith is called religion and has no place in science. "Sheesh", indeed. Larry McVoy, Lachman Associates. ...!sun!lm or lm@sun.com
khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (02/08/89)
In article <88850@sun.uucp> lm@sun.UUCP (Larry McVoy) writes: >In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: >$In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) writes: >$> In article <8876@alice.UUCP> dmr@alice.UUCP writes: >$> >The history of this convention and of the general array scheme had little >$> >to do with the PDP-11; it was inherited from BCPL and B. >$> >$> A correction here: the C scheme was NOT inherited from BCPL. >$ >$I've seen bonehead idiocy on the net before, ... >$before you pontificate? Didn't the name `Dennis Ritchie' ... >$ >$Next time dmr posts something, I suggest you shut up and listen. Respectfully. >Is everyone else laughing as hard as I am? Hopefully. It is quite amusing. > >... >for Dennis Ritchie or anyone else besides myself, I'd suspect that he'd be the >last person that would want you to "shut up and listen". The whole point of >this newsgroup, and research in general, is to question the obvious, point out >the incorrect. It's called learning. Blind faith is called religion and has >no place in science. In general, I agree with Mr. McV. This is, however, a special case. The question is (to paraphrase) "What did the inventors of C think about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff". I do not think it fair to characterize "boff" as a valid hypothesis, unless the PI had died, and left no notes, or ambigous ones. Since the PI is very much alive, and has spoken, contradicting him is a bit out of _my_ definition of scientific inquiry. Keith H. Bierman It's Not My Fault ---- I Voted for Bill & Opus
jefu@pawl.rpi.edu (Jeffrey Putnam) (02/08/89)
In reference to the C representation of strings. Note followups to comp.lang.c. I like the C model for strings. I like it mostly for its simplicity and ease of use. It may well be that a representation for strings that includes string length as a part of a structure is better for efficiency, or more modular or whatever. But! the model used is simple and introduces no magic into the language. Magic? Yup. Magic is what happens when the language (or operating system or hardware) does something odd that is not reachable by the user. This includes magic strings, magic arrays (arrays stored in the same way - that is with extra information hidden from the user), magic library calls (like some VMS calls) and so on. If language (hardware, os) designers want to do something, they should make it evident and available to the user - because if they want to do it, the user probably will as well. In the string question, adding the string length means that what is passed may be a magic cookie that the language knows how to use, but that the user is often denied access to. I have used languages that did lots of magic (the worst was PL/I) and it was often quite difficult to decide what was actually happening (in a function call, for example). The C choice was the simpler one, one with no magic, and the best for the kind of programming that C encourages. Further, if you want to add counted strings, it can be done in C easily. I believe that i have seen a counted string library posted to the net - it might be interesting to see if string handling programs actually run a lot faster with such a library instead of the standard string functions. jeff putnam -- "Sometimes one must attempt the impossible if only to jefu@pawl.rpi.edu -- show it is merely inadvisable."
thomson@hub.toronto.edu (Brian Thomson) (02/09/89)
In article <37529@oliveb.olivetti.com> chase@Ozona.UUCP (David Chase) writes: > >You should also check out the string operations on the 360/370 sort of >machines; BCPL was running there (rather well) a very long time ago. >I think that those operations worked on at most 256 characters (and, I >should add, NOT on 0-length strings). It may well be another case of >architecture influencing language design (note that a zero-length BCPL >string actually contains one byte -- the zero count). > But the 360 implementation was not the first implementation. It is probably more correct to say that the maximum string length was implementation-dependent, but tended to be 255 because many machines have 8-bit bytes. Also, I don't remember the 360 code generator producing any string instructions, although it certainly could be persuaded to produce byte (character) instructions. By that time, the language had been extended with the packed string selector operator "%", so GETBYTE(S, I) was equivalent to the (inline) S%I and PUTBYTE(S, I, C) to S%I := C -- Brian Thomson, CSRI Univ. of Toronto utcsri!uthub!thomson, thomson@hub.toronto.edu
chase@Ozona.orc.olivetti.com (David Chase) (02/09/89)
In article <88853@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes: In article <8876@alice.UUCP> dmr@alice.UUCP ["PI" below] writes: >>$> >The history of this convention and of the general array scheme had little >>$> >to do with the PDP-11; it was inherited from BCPL and B. ["bliff" below] >>$In article <8442@aw.sei.cmu.edu>, firth@bd.sei.cmu.edu (Robert Firth) ["Mr Poster" below] writes: >>$> A correction here: the C scheme was NOT inherited from BCPL. ["boff" below] >The question is (to paraphrase) "What did the inventors of C think >about?" The Principal Inventor sez "bliff". Mr Poster sez "no it was boff". > >I do not think it fair to characterize "boff" as a valid hypothesis, >unless the PI had died, and left no notes, or ambigous ones. Since the >PI is very much alive, and has spoken, contradicting him is a bit out >of _my_ definition of scientific inquiry. Sigh. Nonetheless, he (Dennis Ritchie) probably made a mistake in his posting. Other Prinicipal Inventors are still alive and also left unambiguous notes. Also, I'd suggest that Robert Firth knows BCPL better than Dennis Ritchie. (I'd suggest that *I* know BCPL better than Dennis Ritchie, too -- I've used it within the last 4 years.) I'll give references. From _BCPL -- The language and its compiler_ by Martin Richards and Colin Whhitby-Strevens, 1979 ---------------- [PACKSTRING and UNPACKSTRING] "After unpacking your string, you will discover that the first word contains a count of the number of characters in the string proper, which starts at the second word. As an example, we give the library routines WRITES, UNPACKSTRING, and PACKSTRING: LET PACKSTRING(V,S) = VALOF $( LET N = V!0 & #XFF // extract least significant 8 bytes LET SIZE = N / BYTESPERWORD S!SIZE := 0 // pack out last word with zeroes FOR I = 0 TO N DO PUTBYTE(S,I, V!I) RESULTIS SIZE $) ---------------- Note, too, that the zeros in the last word will only appear in those cases where the bytes packed do not fill out the words in the string (that is, consider packing a string containing 3 characters). From "The Portability of the BCPL Compiler" by Martin Richards in _Software -- Practice and Experience_, volume 1, pp 135-146, 1971. ---------------- Strings are packed in BCPL and the packing is necessarily machine dependent since it depends strongly on the word and byte sizes of the object machine. The usual internal representation of a string value is as a pointer to the first of a set of words holding the length and packed characters of the string. The zeroth byte is usually justified to the start of a word and holds the length of the string with successive bytes holding the characters and padded with zeros (or possibly spaces) at the end to fill the last word. In order to handle strings in as machine independent way [sic] as possible packing, unpacking and writing of strings is done using library routines which are defined in the machine dependent interface with the operating system. ---------------- I think it is fair to say that C did NOT inherit its string representation from BCPL. I wish that some of you people would check your facts before posting. Linguistic comparisons belong elsewhere, so I won't make them. As far as implementation goes, I think it is a mixed bag. Many operations are "faster" on strings with counts, but if your maximum count is only 255 then everything is pretty fast whether it is counted or terminated. You should also check out the string operations on the 360/370 sort of machines; BCPL was running there (rather well) a very long time ago. I think that those operations worked on at most 256 characters (and, I should add, NOT on 0-length strings). It may well be another case of architecture influencing language design (note that a zero-length BCPL string actually contains one byte -- the zero count). David
news@ism780c.isc.com (News system) (02/09/89)
In article <8882@alice.UUCP> dmr@alice.UUCP writes: > More nearly correct >memory--which applies only to 20-year-old implementations >on the IBM 7090 and GE 635/645--says that strings existed >in two forms: a one-byte count followed by the characters >of the string, and an expanded form in which the count >and the characters were blown out into words. > >In these implementations, bytes were 9 bits, and so the compact >representation limited strings to length 511. On the >other hand, the blown-out form occupied lots of space and >wasn't suitable for transferring directly to text files. >Originally, I believe, there were no operators for accessing >individual characters in a compact string, though they were >added later. I cannot speak about the GE 635, But I can speak with some authority about the IBM 7090. I have before me both the 7090 reference manual and the micro code that I wrote to emulate the 7090 on the Standard IC4000. The 7090 provided no instructions for accessing individual characters. The notion of characters existed only at the I/O interface where 6 characters (in 6 bit BCD) could be read into a single 36 bit word. The minimum transfer was two words. I am not aware of any languages made available by IBM that provided a string data type. In fact the limit of 6 character identifiers in FORTRAN was due to the fact that 6 characters could be manipulated as a word on the 7090 (and on the 704 the original "FORTRAN" machine). IBM did produce a 7040 machine (similar to but not an extension of the 7090). This machine provided character load and store. But the use of the facility was limited by the fact that a character address could not be put into an index register. I implemented an extended version of 7090 called the EX02 by Standard Computer that did support strings. Because the words had room in them for two pointers and a character, I represented a string as a linked list with one character per word. The ends of the string was marked by null pointers. There were instructions to traverse the linked list and to convert from a packed array of characters to a string and back again. The language that took advantage of these strings was called IMPLAN (implementation language). An interesting feature of the language was that an expression like i*s, where i was an integer and s was a string, meant the first i charcters of s. And s*i meant the last i characters of the string. Marv Rubinstein
barmar@think.COM (Barry Margolin) (02/09/89)
In article <88853@sun.uucp> khb@sun.UUCP (Keith Bierman - Sun Tactical Engineering) writes: ]In article <88850@sun.uucp> lm@sun.UUCP (Larry McVoy) writes: ]>In article <enj91#24gKdg=eric@snark.uu.net> eric@snark.uu.net (Eric S. Raymond) writes: ]>$Next time dmr posts something, I suggest you shut up and listen. Respectfully. ] ]>Is everyone else laughing as hard as I am? ]Since the ]PI is very much alive, and has spoken, contradicting him is a bit out ]of _my_ definition of scientific inquiry. BUT -- DMR later replied and said that the person contradicting him was RIGHT! Even the great god :-) Ritchie is permitted to have a memory lapse. Barry Margolin Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (02/09/89)
In article <37529@oliveb.olivetti.com> chase@Ozona.UUCP (David Chase) writes: >.... >Sigh. Nonetheless, he (Dennis Ritchie) probably made a mistake in his >posting. Other Prinicipal Inventors are still alive and also left >unambiguous notes. Also, I'd suggest that Robert Firth knows BCPL >better than Dennis Ritchie. (I'd suggest that *I* know BCPL better >than Dennis Ritchie, too -- I've used it within the last 4 years.) >I'll give references. > It was pointed out to me (private mail) that what made the situation so funny was that dmr recanted, and that then someone came along and castigated Firth. I had failed to keep all the threads in mind. I do not claim to know BCPL, and I don't claim to be expert on dmr's thoughts. As I saw it, the argument was what was going on in dmr's thoughts....and since dmr is very much alive speculating is not very fruitful. As it happens, the speculator was right...... Just goes to show. I hereby repudiate my position. Debating what when on in somebody's mind a decade+ ago is valid...because the thinker probably doesn't remember it correctly! :> khb Keith H. Bierman It's Not My Fault ---- I Voted for Bill & Opus
aglew@mcdurb.Urbana.Gould.COM (02/09/89)
>/* Written 3:15 pm Feb 6, 1989 by GQ.RLG@forsythe.stanford.edu in mcdurb:comp.arch */ >->[Me] aglew@mcdurb.Urbana.Gould.COM writes: >->May I encourage people implementing string libraries to use an extra >->level of indirection? Instead of length immediately preceding the string, >->let length be associated with a pointer to the string. Makes >->substringing operations much easier, and has the ability to reduce >->unnecessary copies (at the risk of increased aliasing). >-> >-> +------+---+ >-> |length|ptr| >-> +------+---+ >-> | >-> +------+ >-> | >-> V >-> +---+---+---+---+---+---+---+---+---+---+---+---+---+ >-> | H | E | L | L | O | , | | W | O | R | L | D | \n| >-> +---+---+---+---+---+---+---+---+---+---+---+---+---+ > >Such an implementation has adverse effects when the string is sent >to/from an external device, such as a file. The 'length' must be >with the string, or the string needs a terminator character. If you are sending directly to an output device, I doubt that your output device accepts your internal format. If you have to reformat anyway... Oh, you mean storing data in a file. What's a file? You mean this memory-mapped object... Sorry, I don't live in that environment, unfortunately. Yep, you have to decide either way. For text strings, ASCII files or binary files are fine by me. Leading counts are fine. Nothing says that ptr could not point to the very next location. >what happens to the 'length' information for the old string? I sure would hope it got changed appropriately! And I sure would hope that the use was wrapped in a library routine or macro or C++ type object interface so that nobody ever accessed the length and ptr explicitly! Look, null terminated is fine by me, I use it every day. It just has the embedded null drawback, and the fact that it encourages dumb code. Several examples of which (dumb code that scans the string twice) are on my list of things to fix real soon now - one is taking up 10% of a loaded system. And, yes, good coding practices can avoid double scanning, so all that you're left with is the embedded null problem. (Talking about dumb code - has anyone else seen things like #define TERM_ESCAPE_CODE "\e[foo\0bar" puts(TERM_ESCAPE_CODE); /* Do escape code magic with terminal? */ particularly in things where the escape code is computed?) And I sure would never let any oiu
beyer@houxs.ATT.COM (J.BEYER) (02/10/89)
In article <22036@ism780c.isc.com>, news@ism780c.isc.com (News system) writes: > The 7090 provided no instructions for accessing individual characters. The > notion of characters existed only at the I/O interface where 6 characters > (in 6 bit BCD) could be read into a single 36 bit word. The minimum transfer > was two words. I am not aware of any languages made available by IBM that > provided a string data type. In fact the limit of 6 character identifiers in > FORTRAN was due to the fact that 6 characters could be manipulated as a word > on the 7090 (and on the 704 the original "FORTRAN" machine). This may be quibbling, but did not the Convert By Replacement From MQ and similar instructions deal with 6-bit characters on the 7090? I know they were not in the 704. Perhaps they did not get there until the 7094. But they were not full-fledged character manipulation primitives, I agree. It has been at least 20 years since I programmed one of those. As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit characters (programmer's choice), using clever addressing and tally modes. I am even more hazy about that machine, though. -- Jean-David Beyer A.T.&T., Holmdel, New Jersey, 07733 houxs!beyer
seanf@sco.COM (Sean Fagan) (02/10/89)
In article <499@maxim.ERBE.SE> prc@maxim.ERBE.SE (Robert Claeson) writes: >> but, what size would you use for "length", a byte? a word? a longword? >A 16-bit word, just to remain compatible with all the 16-bit machines >out there. You would limit all those who are on VAXen, 68k's, 32k's (both WE and NS), Sparc's, 88k's, CDC Cybers (180 state), Cray's (1, 2, 3, X and Y MP), ARM's, 29k's, Elxsi's, ad naseum, just to retain backwards compatibility? If you *must* use a <length, pointer> combination, make the <length> attribute the same size as a normal char *. However, I've gotten along quite well without them, as have thousands of other people. -- Sean Eric Fagan | "What the caterpillar calls the end of the world, seanf@sco.UUCP | the master calls a butterfly." -- Richard Bach (408) 458-1422 | Any opinions expressed are my own, not my employers'.
news@ism780c.isc.com (News system) (02/11/89)
[M Rubinstein] >> The 7090 provided no instructions for accessing individual characters. [J BEYER] >This may be quibbling, but did not the Convert By Replacement From MQ >and similar instructions deal with 6-bit characters on the 7090? I know >they were not in the 704. Perhaps they did not get there until the 7094. >But they were not full-fledged character manipulation primitives, I agree. >It has been at least 20 years since I programmed one of those. [M Rubinstein] J BEYER is right about convert instructions on the 709/7090/7094 operating on 6 bit fields. However the fields accessed were fields in a register. The were were no instructions for accessing 6 bit fields from memory. There were instructions for accessing two 15 bit fields from memory. The fields were called the address and decrment. It is my understanding the the Lisp names CAR and CDR come from the 15 bit field names. CAR==contents of address register, and CDR==contents of decrement register. Marv Rubinstein
chris@mimsy.UUCP (Chris Torek) (02/11/89)
More or less incidentally: it is easy to build counted strings out
of C strings, if you prefer counts. For instance:
typedef struct string {
int len;
char *str;
} string;
#define STRING(s) { sizeof(s) - 1, s }
string foo = STRING("foo");
In pANS C this works for automatic variables as well as statics.
These strings are not normal expressions, though, and cannot be
anonymous---that is,
result = some_string_function(str, STRING("lima beans"))
is illegal (though gcc has an extension that will do it).
--
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
mac@uvacs.cs.Virginia.EDU (Alex Colvin) (02/11/89)
In article <6239@saturn.ucsc.edu>, haynes@ucscc.UCSC.EDU (Jim Haynes) writes: ...<history of string representations>... > What's the difference between a string and an array of characters? > Is it anything other than the set of operations that are provided > to operate on it? In the indirect representation (via length & pointer) any substring is also a string. In the end-delimited representation (as in C) only a tail substring is a string. This still allows you to consume a string from start to end, but makes it difficult to pull a string off the start. In both representations, catenation is messy. That's when we turn to buffer chains.
childers@avsd.UUCP (Richard Childers) (02/14/89)
hascall@atanasoff.cs.iastate.edu (John Hascall) writes: >In article <8876@alice.UUCP> dmr@alice.UUCP writes: >>I don't think it has been demonstrated that the usual run of >>C programs pays an extremely high cost in performance for their >>string operations, though doubtless there are counterexamples >>for particular machine architectures or particular programs. > This is a rather circular argument. This is rather like > saying "I don't think it has been demonstrated that the usual > automobile pays an extremely high cost in performance for their > amphibious operations,..." It's a lot easier to criticize two decades after the fact, to say, 'You oughta do this ... why didn't you do THAT ?' instead of merely accepting it as a foible of whichever language is under discussion. I think what Mssr. Ritchie was trying to say was something along the lines of, "OK, we messed up ... we tried for an elegant solution, treating text as an array of characters. It was abstract, it was clean, we thought it'd fly. It did. Not everybody likes it ... but it's not going to change, as there are some subterranean assumptions that it would be wise to take into account before this conversation goes much further." >John Hascall >ISU Comp Center >Rx: Apply :-) above as needed for pain. -- richard -- * "Do not look at my outward shape, but take what is in my hand." * * -- Jalaludin Rumi, 1107-1173 * * ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd.UUCP!childers@tycho * * AMPEX Corporation - Audio-Visual Systems Division, R & D *
childers@avsd.UUCP (Richard Childers) (02/14/89)
eric@snark.uu.net (Eric S. Raymond) writes: >I've seen bonehead idiocy on the net before, but this tops it all --" Indeed, it does. >Next time dmr posts something, I suggest you shut up and listen. Respectfully. Gag me with mindless worshipping twits. Ride on someone else's coattails, eh? > Eric S. Raymond (the mad mastermind of TMN-Netnews) > Email: eric@snark.uu.net CompuServe: [72037,2306] > Post: 22 S. Warren Avenue, Malvern, PA 19355 Phone: (215)-296-5718 -- richard -- * "Do not look at my outward shape, but take what is in my hand." * * -- Jalaludin Rumi, 1107-1173 * * ..{amdahl|decwrl|octopus|pyramid|ucbvax}!avsd.UUCP!childers@tycho * * AMPEX Corporation - Audio-Visual Systems Division, R & D *
mash@mips.COM (John Mashey) (02/14/89)
In article <637@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: .... >Dennis's comment *could* have been circular, but I don't think it >was. After all, the Unix OS has lots of places where exceptionally >poor string handling would be obvious very quickly, and there are >several known installations of this OS nowadays...On the other hand, >there's Dhrystone -- if your string handling is poor, your >Dhrystone number may well be pathetic. And Dhrystone is supposed >to be a systems code benchmark (at least to this level of fidelity). As has been discussed before in this group, Dhrystone's str* behavior seems to differ a bit from "more typical" C programs. Everything I've ever seen from analyzing C programs backs up what Dennis says. When we did the first MIPS UNIX port, I wrote all of the str* routines in assembler; we threw most of them out in favor of the portable C programs, because the testing time to find the (must have been at least) one bug in the lot wouldn't have been worth it, according to the program statistics we saw. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mac@uvacs.cs.Virginia.EDU (Alex Colvin) (02/16/89)
In article <1185@houxs.ATT.COM>, beyer@houxs.ATT.COM (J.BEYER) writes: > In article <22036@ism780c.isc.com>, news@ism780c.isc.com (News system) writes: > > The 7090 provided no instructions for accessing individual characters. The > As I recall, the GE635 and 645 did have ways to address either 6 or 9 bit It did. Unfortunately, there were several incompatible ways, the new (EIS string, all memory-memory), the old (tally words, all memory to register) and the incredibly old (fixed character/byte, memory-register). I've heard of a cute way to implement character addressing on a 7090 using one half register for the word address, another half for character offset, and XECs to choose a load/store, with an XED chain to wrap the character count into the word address. Almost microcode (cf Clipper macrocode?) Alex Colvin old DTSS programmer