srg@quick.COM (Spencer Garrett) (12/21/87)
In article <164@sdeggo.UUCP>, dave@sdeggo.UUCP (David L. Smith) writes: > In article <261@ivory.SanDiego.NCR.COM>, jan@ivory.SanDiego.NCR.COM (Jan Stubbs) writes: > > Personally, I can't imagine any convenience a null terminated string would > > have over a string preceded by its length. > > Well, there is no limit imposed on the length of the string. It also takes > up one less byte per string in overhead (unless you wanted to limit your > string length to 255 characters) which isn't very important today, but > probably was when C was first defined. Ahem. What makes you think 2 bytes would be enough? The VAX is constantly tripping over its own feet because the designers couldn't bring themselves to make ALL fields big enough to describe anything architecturally possible. (e.g. - 16-bit conditional branch offsets and string lengths) I can think of two major advantages of null-terminated strings over strings preceded by their lengths. 1) You can pass substrings around without copying or altering the original string. (String tails, at least, but that's usually what you want. "process some part of the string and pass the rest on") This is why PL/I used separate "dope vectors" to describe strings, but this is a level of complexity completely inappropriate as part of the C language. You can write a library to do that in C. (Try writing the inverse library in Pascal!) 2) Having a CHARACTER to mark the end of a string is ever so much more convenient and efficient than having to compare lengths all the time (assuming you're looking at the characters and not just copying them, and even that is an implementation issue and not truly fundamental). The general paradigm is: while (the next character is within the interesting range) do something interesting with it; now look at what the uninteresting character was;
barmar@think.COM (Barry Margolin) (12/22/87)
In article <174@quick.COM> srg@quick.COM (Spencer Garrett) writes: >2) Having a CHARACTER to mark the end of a string is ever so much >more convenient and efficient than having to compare lengths all the >time The problem with this is that you must reserve a character. I've had trouble with terminal emulators on machines whose keyboard-reading system call returns 0 to indicate that no data is available, but 0 is also the code for Control-@. This type of problem occurs in file reading, too. If you are reading a binary file, or perhaps a file of keyboard operations (perhaps you recorded an Emacs session), as characters the NULL character can no longer be used as a string terminator; such files can contain any 8-bit code, so NO 8-bit character can be used as a termination indicator. Strings with lengths ALWAYS work. --- Barry Margolin Thinking Machines Corp. barmar@think.com seismo!think!barmar
baum@apple.UUCP (Allen J. Baum) (12/22/87)
-------- [] >In article <174@quick.COM> srg@quick.COM (Spencer Garrett) writes: >The general paradigm is: > > while (the next character is within the interesting range) > do something interesting with it; > now look at what the uninteresting character was; While this is nice in theory, in practice there can be some very subtle problems with lookahead, especially if the string ends on a page boundary, and the lookahead of the next character causes an access violation. -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
jefu@pawl22.pawl.rpi.edu (Jeffrey Putnam) (12/22/87)
In article <.....> people write:
Lots of stuff about strings with null terminators and lengths and dope vectors
and....
There are a lot of ways to implement strings in various ways and their
benefits and costs.
Granted that strings are necessary in a language, there are a number of
ways to implement them and there are serious tradeoffs in chosing any
particular implementation. When making this choice, i think personally
that the simplest method should be chosen if possible. The problems
with the dope vectors and embedded counts are that there is magic going on
behind the scenes somewhere and often the exact semantics becomes much
more difficult to describe. I have used languages that used such methods
and found that on occasion odd and unexplained things would happen
because things were not made explicit.
Another point that has been made, and should be made again is that C
is powerful enough to construct the other methods where needed.
One of the virtues of C is that much is made explicit and as simple as
possible and there is very little behind-the-scenes magic. The same
may be said for Unix (tm), and (perhaps more germainly to this
newsgroup) RISC machines. I would like to urge designers of systems,
languages, and machines to adopt the same general ideas - little magic,
and as simple as possible.
jeff putnam (userft4z%rpitsmts@itsgw.rpi.edu, though this account is
about to die)
gregg@a.cs.okstate.edu (Gregg Wonderly) (12/23/87)
in article <14116@think.UUCP>, barmar@think.COM (Barry Margolin) says: > > In article <174@quick.COM> srg@quick.COM (Spencer Garrett) writes: >>2) Having a CHARACTER to mark the end of a string is ever so much >>more convenient and efficient than having to compare lengths all the >>time > > The problem with this is that you must reserve a character. I've had > trouble with terminal emulators on machines whose keyboard-reading > system call returns 0 to indicate that no data is available, but 0 is > also the code for Control-@. This type of problem occurs in file > reading, too. If you are reading a binary file, or perhaps a file of > keyboard operations (perhaps you recorded an Emacs session), as > characters the NULL character can no longer be used as a string > terminator; such files can contain any 8-bit code, so NO 8-bit > character can be used as a termination indicator. > > Strings with lengths ALWAYS work. Think about what you just said. You applied negative comments to the use of the zero byte in cases where you WOULD NOT use it. All of the above examples cite applications where you make use of system calls to read a buffer of characters, and then process them one by one. The point of the zero byte is NOT to make EVERYTHING possible. It is susposed to make general string handling convienent, and that it does. -------------- Gregg Wonderly Department of Computing and Information Sciences Oklahoma State University UUCP: {cbosgd, ihnp4, rutgers}!okstate!gregg Internet: gregg@A.CS.OKSTATE.EDU IBM: Yesterday's technology, tomorrow, for incredible prices
blu@hall.cray.com (Brian Utterback) (12/23/87)
In article <14116@think.UUCP| barmar@sauron.think.com.UUCP (Barry Margolin) writes: | |The problem with this is that you must reserve a character. I've had |trouble with terminal emulators on machines whose keyboard-reading |system call returns 0 to indicate that no data is available, but 0 is |also the code for Control-@. This type of problem occurs in file |reading, too. If you are reading a binary file, or perhaps a file of |keyboard operations (perhaps you recorded an Emacs session), as |characters the NULL character can no longer be used as a string |terminator; such files can contain any 8-bit code, so NO 8-bit |character can be used as a termination indicator. |Strings with lengths ALWAYS work. |Barry Margolin Amen to that. I just spent hours trying to find out what was wrong with a rasterfile to laserprinter filter. It turned out that the problem is that fprintf cannot output a null. At least the compiler should issue a warning if it eats a null. I mean, what is the use of being able to specify a character in a string (i.e. \000) if the compiler won't really use it? And it KNEW it, and didn't tell me. Sheesh.
jrl@anuck.UUCP (j.r.lupien) (12/23/87)
In article <14116@think.UUCP>, barmar@think.COM (Barry Margolin) writes: > In article <174@quick.COM> srg@quick.COM (Spencer Garrett) writes: > >2) Having a CHARACTER to mark the end of a string is ever so much > >more convenient and efficient than having to compare lengths all the > >time > > The problem with this is that you must reserve a character. I've had This is a good point, but it is not a fatal problem. Note that whatever definition of strings you adopt, it is only relevant to string operations and string libraries. This in no way prevents the programmer from treating the strings they deal with as abstract data items, and writing their own library to manipulate it. > trouble with terminal emulators on machines whose keyboard-reading > system call returns 0 to indicate that no data is available, but 0 is > also the code for Control-@. As far as terminals are concerned, a null is a null. A driver that returns 0 on no data without setting ERRNO or something is broken, and must be fixed. > This type of problem occurs in file > reading, too. If you are reading a binary file, or perhaps a file of > keyboard operations (perhaps you recorded an Emacs session), as > characters the NULL character can no longer be used as a string > terminator; such files can contain any 8-bit code, so NO 8-bit > character can be used as a termination indicator. I have a program "strings(1)" which is able to distinguish null terminated strings in a binary file and print them out. This program is great for overcomming poor documentation and the like. The point is, you just check if the characters preceeding the null are printable. If not, they are not part of the string. > > Strings with lengths ALWAYS work. > No they don't. Did you read the article you responded to? Given a fixed format count (is it int? short? long?) there is a length of string you can't give the length of, due to overflow. Null terminated strings ALWAYS work in this regard. > Barry Margolin > Thinking Machines Corp. Barry, I am not trying to flame you. I would like to think that you have reasons for your attitude, but the ones given above don't hold water as stated. Would you mind elaborating? John Lupien Computing Tools ihnp4!mvuxa!anuxh!jrl
ralphw@TEMP.IUS.CS.CMU.EDU (Ralph Hyre) (12/24/87)
In article <422@anuck.UUCP> jrl@anuck.UUCP (j.r.lupien) writes: >In article <14116@think.UUCP>, barmar@think.COM (Barry Margolin) writes: >> In article <174@quick.COM> srg@quick.COM (Spencer Garrett) writes: >> >2) Having a CHARACTER to mark the end of a string is ever so much >> >more convenient and efficient than having to compare lengths all the >> >time >> >> The problem with this is that you must reserve a character. I've had > >This is a good point, but it is not a fatal problem. Note that whatever >definition of strings you adopt, it is only relevant to string >operations and string libraries... This is true, but think about how many operations involving strings there are in *nix. One sloppy compiler or doprintf implementation (see below) will kill you. I submit that it's easier to screw up with null-terminated than with length + data (or some variant, like dope vectors). In article <2447@hall.cray.com>, blu@hall.cray.com (Brian Utterback) writes: >Amen to that. I just spent hours trying to find out what was wrong with >a rasterfile to laserprinter filter. It turned out that the problem is >that fprintf cannot output a null. At least the compiler should issue >a warning if it eats a null. I mean, what is the use of being able to >specify a character in a string (i.e. \000) if the compiler won't really >use it? And it KNEW it, and didn't tell me. Sheesh. Ideally string libraries that 'eat' characters would internally use byte-stuffing to encode NULLs in the strings. I guess that would violate the 'simple but stupid' C philosophy:-) >As far as terminals are concerned, a null is a null. A driver that >returns 0 on no data without setting ERRNO or something is broken, >and must be fixed. >> Strings with lengths ALWAYS work. >> >No they don't. Did you read the article you responded to? >Given a fixed format count (is it int? short? long?) there is >a length of string you can't give the length of, due to overflow. >Null terminated strings ALWAYS work in this regard. Strings with lengths work to the limits of practicality. A 32-bit length will handle strings the size of a processors' address space. If you've got a string that long you should probably use other data structures. If your string usage requirements are different, perhaps you could implement a string package the Lisp hacker's way: string: implementation type ('length+data' or 'NULL-terminated' or 'complex') [this is where those tag bits in LISP-oriented processors pcome in handy] # of substrings address of substring #1 length of substring #1 address of sunstrings #2 length of substring #1 Anyway, I think that anything that can be said on this subject has been said. Please stop. -- - Ralph W. Hyre, Jr. Internet: ralphw@ius2.cs.cmu.edu Phone:(412)268-{2847,3275} CMU-{BUGS,DARK} Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA -- - Ralph W. Hyre, Jr. Internet: ralphw@ius2.cs.cmu.edu Phone:(412)268-{2847,3275} CMU-{BUGS,DARK} Amateur Packet Radio: N3FGW@W2XO, or c/o W3VC, CMU Radio Club, Pittsburgh, PA
throopw@xyzzy.UUCP (Wayne A. Throop) (12/24/87)
> baum@apple.UUCP (Allen J. Baum) >> srg@quick.COM (Spencer Garrett) >>The general paradigm is: >> while (the next character is within the interesting range) >> do something interesting with it; >> now look at what the uninteresting character was; > While this is nice in theory, in practice there can be some very subtle > problems with lookahead, especially if the string ends on a page boundary, > and the lookahead of the next character causes an access violation. The string is ill-formed if the terminating character isn't there. Such problems can happen in other string formats, such as a <length,string> record where the length reaches off the end of the valid address space. Thus, this objection is not specific to <EOS-character> terminated strings, but to any variable length string construct whatsoever. The problem being "what happens if the length isn't specified correctly and you wander off the end of the string". -- There are some forms of insanity which, driven to an ultimate expression, can become new models of sanity. --- Bureau of Sabotage {Frank Herbert} -- Wayne Throop <the-known-world>!mcnc!rti!xyzzy!throopw
jack@cwi.nl (Jack Jansen) (12/24/87)
Even though I favour null-terminated strings myself there is a very strong point against them: Very little error checking is possible in the runtime system. Look at the bug reports in comp.bugs.4bsd for instance: I would guess that about 25% of the bug reports is about programs that silently assume that no mail-address/line/whatever will ever be more than NSTR characters. I know, the programmer can cater for this, but it shouldn't be her business. Moreover, the approach where the count is kept with the pointer has another *big* advantage: it unifies strings with other variable dimension arrays. Again, this is not a point that I feel very strong about myself (I don't think I *ever* inverted a matrix), but it *is* rather stupid that you have to specify the dimensions of a matrix in a call some routine while the compiler knows those dimensions already.... -- Jack Jansen, jack@cwi.nl (or jack@mcvax.uucp) The shell is my oyster.
rw@beatnix.UUCP (Russell Williams) (12/24/87)
In article <178@imagine.PAWL.RPI.EDU> You cant get here from there. writes: >In article <.....> people write: >Lots of stuff about strings with null terminators and lengths and dope vectors >and.... > >There are a lot of ways to implement strings in various ways and their >benefits and costs. > >Another point that has been made, and should be made again is that C >is powerful enough to construct the other methods where needed. > >One of the virtues of C is that much is made explicit and as simple as >possible and there is very little behind-the-scenes magic. The same >may be said for Unix (tm), and (perhaps more germainly to this >newsgroup) RISC machines. I would like to urge designers of systems, The only drawback is that with Unix, the tools have been built; with RISC machines, the compilers do it for you. With C, there is no standard package written to handle arbitrary-content strings, so everybody uses the built in null terminated strings, and thus most programs fail on files or data with embedded nulls, and many fail on files with lines longer than MAXLINE. Sometimes this isn't a problem. With Emacs, for example, it's a serious pain. Further, the C convention is to take advantage of the knowledge of the representation so there's no easy way to change programs to a different string representation. Perhaps if there were a package in the standard library, people would use it when appropriate. With C++ and its inline functions, you could construct compatible libraries to handle things both ways. Our O/S (EMBOS) is written in Pascal extended to include dope vectored strings. Except for very low-level routines (such as memory manager) which can't handle having unexpected heap allocations happen, there have been very few drawbacks or cases where we had to use another method of character handling; in high-level code (editors, batch schedulers, terminal drivers), the fact that strange magic may happen behind your back has proven irrelevant. Russell Williams ..{ucbvax!sun,lll-lcc!lll-tis,altos86}!elxsi!rw
roy@phri.UUCP (Roy Smith) (12/24/87)
In article <2447@hall.cray.com> blu@hall.UUCP (Brian Utterback) writes: > At least the compiler should issue a warning if it eats a null. I mean, > what is the use of being able to specify a character in a string (i.e. > \000) if the compiler won't really use it? Interesting question. Should a \000 in a string constant be flagged as a warning by the compiler? On both my 4.3 Vax and 3.2 Suns, the following program draws no complaint: main () { printf ("this\000probably won't print\n"); } Even lint has nothing to say (other than complaining about printf's return value being ignored on the Sun). Surely at least lint should pick up this one. There is nothing strictly illegal about imbedding a null in a string constant, but it is strange. One might want to do: printf ("this\000probably won't print\n"+5); and the compiler should let you do it, although I can't think of any valid reason offhand you would *want* to do such a thing. -- Roy Smith, {allegra,cmcl2,philabs}!phri!roy System Administrator, Public Health Research Institute 455 First Avenue, New York, NY 10016
aglew@ccvaxa.UUCP (12/24/87)
..> Strings Oh, for Christopher Sake'! Both sentinel terminated strings (of which null terminated are probably the most frequent example) and length strings are useful. Like somebody else said, there's nothing new. How about some architecturally oriented discussion?: (1) What machines have had support for various formats? (1.1) Has anybody combined the two? Myself, I tend to like dope strings for maximum length AND null byte. (2) For the RISCers, what simple instruction sequences best support the various formats? Eg. somebody from ACORN described the general XOR trick to find the last byte in a string. What are the appropriate tricks for length strings and dope strings? (2.1) What simple tricks on a vector machine make string manipulation easier? Myself, I tend to lean towards NO explicit string copy, etc. instructions. But, if the memory system has separate byte enable lines, something like STORE-REGISTER-BYTES-UNDER-MASK, with simple sequences for generating the mask from your sentinel and/or your length. This leads to some possibilities: (3) How do you encode the mask? 1 bit per byte, or a number saying the leading or trailing N bytes, or what? Myself, I prefer the mask to be all bits, with an error if some byte in the mask is not all 1s or 0s. But then, I want a bit addressed machine eventually. (3.1) Is it worthwhile having special instructions to go from a word conating several bytes to a mask containing bits set for all the bytes up to the first null byte? Myself, I don't think so. A bit of logical arithmetic gives it to us in a few instructions. However, there is a real tradeoff here. If you need a few instructions on every register-full of bytes, then you have to have a larger register containing more bytes before doing mass string moves through large packets becomes worthwhile. All this is in aid of the goal of moving as many characters as possible in a single memory access, plus optimizing handling the heads and tails of strings, rather than the middles, since most strings that my programs manipulate are small. Are these worthwhile goals? Give me numbers: (4) What are the distributions of lengths, (mis)alignments, etc., for frequently used string operations? Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@mycroft.gould.com ihnp4!uiucdcs!ccvaxa!aglew aglew@gswd-vms.arpa My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
baum@apple.UUCP (Allen J. Baum) (12/25/87)
-------- [] >In article <504@xyzzy.UUCP> throopw@xyzzy.UUCP (Wayne A. Throop) writes: >The string is ill-formed if the terminating character isn't there. I wasn't talking about the case of a non-existent character, but the case where lookahead is employed to fetch a character that might not ever get used -- {decwrl,hplabs,ihnp4}!nsc!apple!baum (408)973-3385
daveb@geac.UUCP (David Collier-Brown) (12/28/87)
In article <152@piring.cwi.nl> jack@cwi.nl (Jack Jansen) writes: | Even though I favour null-terminated strings myself... [enumeration of booby-traps that catch C programmers] | Moreover, the approach where the count is kept with the pointer | has another *big* advantage: it unifies strings with other variable | dimension arrays. Again, this is not a point that I feel very strong | about myself (I don't think I *ever* inverted a matrix), but | it *is* rather stupid that you have to specify the dimensions of | a matrix in a call some routine while the compiler knows those | dimensions already.... Agreed, but we're drifting away from the architecture question: should we provide facilities for supporting array operations (search string/array for terminator) or array descriptors (assign dope-vector slice n..m of array x). This question can become arbitrarily complex... and is herewith cross-posted. In the meantime, I try to use C as a language to code **into** and not **in**, so I can often (but not always) back out of locally-unwise assumptions of the language designers. --dave (architecture X programming languages) c-b ps: for people reading this in comp.lang.c, the discussion of the strengths and weaknesses of lengths and terminating characters came out of a discussion of what to put in hardware. It is worth reviewing, as this is **not** the religious discussion it might appear at first glance.
scottg@hpiacla.HP.COM (Scott Gulland) (12/29/87)
/ hpiacla:comp.arch / srg@quick.COM (Spencer Garrett) / 4:09 pm Dec 20, 1987 / In response to article <164@sdeggo.UUCP>, srg@quick.COM (Spencer Garret) writes: >I can think of two major advantages of null-terminated strings over >strings preceded by their lengths. > >1) You can pass substrings around without copying or alterning the >original string. (String tails, at least, but that's usually what >you want. "process some ... > >2) Having a CHARACTER to mark the end of a string is ever so much >more convenient and efficient than having to compare lengths all the >time (assuming you're looking at the characters and not just copying >them, and even that is an implementation issue and not truly fundamental). >The general paradigm is: > > while (the next character is within the interesting range) > do something interesting with it; > now look at what the uninteresting character was; I would strongly disagree with the above statements. Operations such as concatenation and assignment are extremely inefficient on null-terminated strings when compared to strings which are preceeded by their lengths. This is because you must search through the string character by character to find the end of the string before you can perform the desired operation. I would also assert that string comparison, concatenation, assignment, etc occur much more frequently than the operations given in #1 above. Another benifit of strings preceded by thier length is that it is much easier to use strings in these compilers (less code to be written and a simplier syntax). This results in higher programmer productivity as well as string operation that perform at twice the levels of compilers with null-terminated strings. As far as the example given above, it is quite easy to do this without comparing length for non-null-terminated strings. For example: string[strlen(string)+1] = '\0' while (the next character is within the interesting range) do something interesting with it; now look at what the uninteresting character was; Note: In the PASCALs I have worked in, it is legal to insert characters past the current length of the string without affecting the strings length. ----------
aglew@ccvaxa.UUCP (12/29/87)
..> Embedding nulls in strings I have frequently used NULs embedded in strings for terminal control applications. Eg. char ControlString[] = "escape string with a \0 inside"; write(fd,ControlString,sizeof(ControlString)-1); So there is an extra null at the end - I'll gladly trade that for the convenience of expressing literals the way I want them. Is this legal C? I thought so. I did this mainly on microcomputers, CP/M and IBM PCs running MANX C. Andy "Krazy" Glew. Gould CSD-Urbana. 1101 E. University, Urbana, IL 61801 aglew@mycroft.gould.com ihnp4!uiucdcs!ccvaxa!aglew aglew@gswd-vms.arpa My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products.
davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (01/04/88)
In article <2447@hall.cray.com> blu@hall.UUCP (Brian Utterback) writes: | In article <14116@think.UUCP| barmar@sauron.think.com.UUCP (Barry Margolin) writes: | | stuff about null-terminated strings | | Amen to that. I just spent hours trying to find out what was wrong with | a rasterfile to laserprinter filter. It turned out that the problem is | that fprintf cannot output a null. This is just not so. Try fprintf("%c", '\000') and it works fine. fprintf works with standard C strings which are null terminated. Why complain because it doesn't do what you want? I suggest using putc if you want to force any character out. | At least the compiler should issue | a warning if it eats a null. I have never seem a compiler eat a null. For example: main() { static char xx[] = {"abc\000def"}; printf("%d bytes\n", sizeof xx); } returns a size of eight bytes. The null is not "eaten," it just doesn't work the way you want it to. | a warning if it eats a null. I mean, what is the use of being able to | specify a character in a string (i.e. \000) if the compiler won't really | use it? And it KNEW it, and didn't tell me. Sheesh. See above. I'm not against length defined strings, but I think that a posting like this indicates that the poster didn't understand the problem. Forgive me if you understood but failed to express yourself correctly. -- bill davidsen (wedu@ge-crd.arpa) {uunet | philabs | seismo}!steinmetz!crdos1!davidsen "Stupidity, like virtue, is its own reward" -me
franka@mmintl.UUCP (01/09/88)
[I have directed followups to comp.lang.c only] I believe it was a mistake in C to make string constants refer to null terminated strings automatically. Better, I think, to make the constant contain exactly what the programmer wrote. That way, other string- representation schemes could be used easily and conveniently. As it is, null-terminated strings are more convenient than they would be if it were done this way, but anything else is much uglier. It is, of course, much too late to change the language in this way. -- Frank Adams ihnp4!philabs!pwa-b!mmintl!franka Ashton-Tate 52 Oakland Ave North E. Hartford, CT 06108