dave@murphy.UUCP (12/16/86)
Summary: invent an 8-bit character set and then let some of them be negative Line eater: fully conforming I've been thinking about this business with long chars and short chars and trigraphs and international character sets and such, and I've got a proposal. The proposal is this: if someone can come up with an 8-bit character set that contains all of the necessary characters for the Western languages, (and includes the existing USASCII set as a subset), then let's drop the requirements that a member of a machine's "natural" character set be represented as a positive number in a plain char. This will have the following benefits: 1. Everyone can adopt a character set that will have all of the characters that they need, and not have to overload any of the USASCII set with other characters. Portability of programs and other text files will benefit greatly, and trigraphs will be unnecessary. (For many languages, there aren't enough punctuation characters to overmap; for example, I think that it takes 17 characters to represent all of the possible letter-and-accent combinations in French, and that's just for lower case.) 2. The character set will fit into almost everyone's byte size, meaning no dramatic increase in the size of text files. (Nearly everyone uses at least an 8-bit byte with UN*X; the only ones that I can think of are the PDP10/20's, which can use 7-bit bytes.). 3. It won't be necessary to raise sizeof(char) from 1. This means that programs that use chars for things other than text (yes, there are a *lot* of them) won't be disturbed. 4. Each implementation can continue using the signedness for char that best fits the architechure. It won't be necessary to force all plain chars to unsigned. The disadvantages that I can see are these: 1. Since some of the char values may be negative, it will not be possible to collate chars by simply comparing their values; you have to call a collating routine defined for the particular implementation. (But, some languages don't collate in strict alphabetic order, so you'll wind up doing this with any international character set.) 2. You will have to use functions to do things like converting a letter to upper or lower case; just masking off bits won't get it anymore. 3. Some terminals already use the codes > 127 for other purposes. There is no easy answer to this problem. 4. The value 255 can't be used because it may look like EOF on some systems. In short, it doesn't look to me like there is any good reason to require characters to be represented as positive values. Or have I overlooked something really basic? --- "I used to be able to sing the blues, but now I have too much money." -- Bruce Dickinson Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL UUCP: ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt ARPA: dcornutt@gswd-vms.arpa (I'm not sure how well this works) "The opinions expressed herein are not necessarily those of my employer, not necessarily mine, and probably not necessary."
greg@utcsri.UUCP (Gregory Smith) (12/19/86)
In article <39@houligan.UUCP> dave@murphy.UUCP writes: >4. The value 255 can't be used because it may look like EOF on some systems. for(i=0; i<30000; ++i) printf( "AAAAARRRRRRRRRRGGGGGGHHHH!!!!!!!!!\n"); -- ---------------------------------------------------------------------- Greg Smith University of Toronto UUCP: ..utzoo!utcsri!greg Have vAX, will hack...
rbutterworth@watmath.UUCP (Ray Butterworth) (12/23/86)
In article <39@houligan.UUCP>, dave@murphy.UUCP writes: > In short, it doesn't look to me like there is any good reason to require > characters to be represented as positive values. Or have I overlooked > something really basic? Consider the following: char str[5]; int i; i=getchar(); str[0]=i; if (i!=str[0]) printf("Is this possible?"); if (isupper(i) && !isupper(str[0])) printf("How about this?"); The answer is that yes, it is possible when getchar() is allowed to return characters that have the upper bit on, on compilers that sign extend. If the character is 0xF0 say, "i" will be assigned 0xF0, but str[0] will have the value 0xFFFFFFF0, and the comparison will fail. Similarly, many functions such as isupper() will behave incorrectly since they aren't defined to work on negative arguments. If ANSI were to define getchar() to return a value that is sign extended on machines that sign-extend chars, and define functions such as isupper() to accept such arguments, I think it would solve most problems of sign-extension and 8-bit character sets. It probably wouldn't break any existing source code either (except for code that stupidly ignores the EOF manifest and uses an explicit value). > > 4. The value 255 can't be used because it may look like EOF on some systems. The only requirement on EOF is that it be a negative int. If the implementors make it say, (-12345), it won't be confused with any character, with or without sign extension.
karl@haddock.UUCP (Karl Heuer) (12/25/86)
In article <39@houligan.UUCP> dave@murphy.UUCP writes: >Summary: invent an 8-bit character set and then let some of them be negative Suppose I am using such a system, and one of the characters -- call it '@' -- has a negative value. The following program will not work: main() { int c; ... c = getchar(); ... if (c == '@') ... } Note that getchar() returns an UNSIGNED char on success; this is to guarantee that none of them compare equal to EOF. Thus, any printing character that I want to enclose in single quotes had better be positive, or it becomes VERY awkward to use. Please don't suggest that getchar() should return a signed char and that '\377' should be reserved. It won't work. Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
ron@brl-sem.ARPA (Ron Natalie <ron>) (12/30/86)
In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes: > In article <39@houligan.UUCP> dave@murphy.UUCP writes: > >Summary: invent an 8-bit character set and then let some of them be negative > > Suppose I am using such a system, and one of the characters -- call it '@' -- > has a negative value. The following program will not work: > main() { int c; ... c = getchar(); ... if (c == '@') ... } Getchar returns int. The int has a character in it. Before trying to use it as such, you ought to either place it back into a character variable explicitly or use a cast to char... main() { int c; ... c = getchar(); ... if ((char) c == '@') ... }
karl@haddock.UUCP (Karl Heuer) (01/09/87)
In article <548@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes: >In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes: >> In article <39@houligan.UUCP> dave@murphy.UUCP writes: >> >Summary: invent an 8-bit character set and let some of them be negative >> >> Suppose I am using such a system, and one of the characters -- call it '@' >> -- has a negative value. The following program will not work: >> main() { int c; ... c = getchar(); ... if (c == '@') ... } > >Getchar returns int. The int has a character in it. Before trying to >use it as such, you ought to either place it back into a character >variable explicitly or use a cast to char... > > main() { int c; ... c = getchar(); ... if ((char) c == '@') ... } That's one way to "fix" the problem, but the construct I wrote is valid by current standards and is a common idiom. I don't think programmers would like having to cast* the result of getchar() back into char before using it! Your suggestion does make sense logically, though, and I think it supports my contention that making getchar() an int function was a mistake in the first place.** Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint *Of course, the cast is done *after* testing for EOF. **I do have what I think is a better idea, but I'm not going to describe it in this posting. Anyway, it's too late to change getchar() now.
mouse@mcgill-vision.UUCP (01/12/87)
In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes: > Suppose I am using such a system, and one of the characters -- call > it '@' -- has a negative value. The following program will not work: > main() { int c; ... c = getchar(); ... if (c == '@') ... } > Note that getchar() returns an UNSIGNED char on success; this is to > guarantee that none of them compare equal to EOF. Thus, any printing > character that I want to enclose in single quotes had better be > positive, or it becomes VERY awkward to use. Well. Now, exactly what does it mean to say that @ is negative? Presumably it means that the test below will succeed: char c; /* note: not int */ .... c = '@'; if (c < 0) Now, remember that everybody (K&R and H&S and I hope ANSI) agrees that 'x' is an int, not a char. Notice that you can't make '@' the same thing as what getchar() returns, because the following will fail: char string[something]; .... if (string[subscript] == '@') About the neatest solution I see is to make 'x' have type unsigned char rather than int, at least when there's only one character between quotes (is there any code out there *using* multi-char character constants?). Then we also have to arrange that char and unsigned char are not promoted to int in expressions not involving anything bigger than char. This should make both of these work. Is there anything wrong with changing the type of 'x' literals (and fixing char-only expressions), that is, will it break anything? der Mouse USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse think!mosart!mcgill-vision!mouse Europe: mcvax!decvax!utcsri!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu
mouse@mcgill-vision.UUCP (01/12/87)
In article <293@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes: > [Your suggestion] supports my contention that making getchar() an int > function was a mistake in the first place.** > **I do have what I think is a better idea, but I'm not going to > describe it in this posting. How about in another posting then? What I normally do is something more like char c; /* yes, char! */ .... c = getc(stream); /* or getchar() if stdin */ if (feof(stream) || ferror(stream)) { .... } ie, *ignore* the EOF return and check explicitly. Is this better? worse? than the int c; c=getc(stream); if (c==EOF) approach? Why? der Mouse USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse think!mosart!mcgill-vision!mouse Europe: mcvax!decvax!utcsri!mcgill-vision!mouse ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu
dave@murphy.UUCP (01/14/87)
Summary: does EOF have to be -1? In article <289@haddock.UUCP>, karl@haddock.ISC.COM.UUCP (Karl Heuer) types (in response to an earlier article that I wrote): >Suppose I am using such a system, and one of the characters -- call it '@' -- >has a negative value. The following program will not work: > main() { int c; ... c = getchar(); ... if (c == '@') ... } >Note that getchar() returns an UNSIGNED char on success; this is to guarantee >that none of them compare equal to EOF. Thus, any printing character that I >want to enclose in single quotes had better be positive, or it becomes VERY >awkward to use. Thanks for pointing this out, but I don't see where it should cause a major problem. Assuming that the character set in use doesn't take up the entire range of int values, all that is necessary is to pick a value for EOF that doesn't correspond to any character value. (Newcomers: keep in mind that the return value of getchar and getc is defined as being an int, not a char, even though it is often treated like a char.) This way, getchar can return a possibly negative value, and EOF won't collide with any legit character value. Would defining EOF to be something other than -1 cause a problem? I don't think so. K&R says, on p. 144: "The standard library defines the symbolic constant EOF to be -1 (with a #define in the file 'stdio.h'), but tests should be written in terms of EOF, not -1, so as to be independent of the specific value." I don't this is a situation like with NULL where the actual value has a special meaning in the language definition, so I don't see why it couldn't be changed. People who are testing for -1 or for a negative value instead of using EOF deserve whatever they get. If anyone knows of any reason why the value of EOF can't be implementation- specific, I'd like to hear about it. --- "I used to be able to sing the blues, but now I have too much money." -- Bruce Dickinson Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL UUCP: ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt ARPA: dcornutt@gswd-vms.arpa (I'm not sure how well this works) "The opinions expressed herein are not necessarily those of my employer, not necessarily mine, and probably not necessary."
karl@haddock.UUCP (01/19/87)
In article <598@mcgill-vision.UUCP> mcgill-vision!mouse (der Mouse) writes: >In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes: >> Suppose I am using such a system, and one of the characters -- call >> it '@' -- has a negative value. The following program will not work: >> main() { int c; ... c = getchar(); ... if (c == '@') ... } >> ... Any printing character that I want to enclose in single quotes had >> better be positive, or it becomes VERY awkward to use. > >Well. Now, exactly what does it mean to say that @ is negative? >Presumably it means that the test below will succeed: > char c = '@'; if (c < 0) ... Actually what I meant was simply that "if ('@' < 0) ..." would succeed. This is not the same thing since '@' has type int. Your test says only that char is implemented as a signed datatype, and that '@' has the high bit set. >Notice that you can't make '@' the same thing as what getchar() returns, >because [char s[N]; if (s[0] == '@') ...] will fail. That's the flip side of the problem, which I overlooked it in my posting. The problem is independent of single-quotes; any machine on which characters are signed will fail to handle the test (getchar() == s[0]). The only reason it "worked" so well on the pdp11 was that *in practice*, all the chars one has to deal with (I'm assuming text characters, not one-byte integers) were 7-bit, so it didn't matter whether they were sign-extended (as with s[0]) or unsigned (as with getchar()). >About the neatest solution I see is to make 'x' have type unsigned char >rather than int, at least when there's only one character between >quotes. Then we also have to arrange that char and unsigned char >are not promoted to int in expressions not involving anything bigger >than char. This should make both of these work. I dunno. A simpler solution is to assert that plain char is unsigned char. As I said before, I suspect the adopted solution will be that in an 8-bit environment plain char will be unsigned char; the only default-signed-char compilers will be on pdp11-like machines in 7-bit environments. >(is there any code out there *using* multi-char character constants?) If so, it's almost all nonportable. The only portable use I've seen was one I wrote for a program that dealt with the two-letter codes found in termcap, troff, etc: "switch (s[0]*'\1\0' + s[1]*'\0\1') { case 'xy': ...; }". I ended up not using it anyway, since lint didn't like it. (But it is independent of byte size and byte ordering.) [From article <600@mcgill-vision.UUCP>, same author, again quoting kwzh] >> [Your suggestion] supports my contention that making getchar() an int >> function was a mistake in the first place.** I am now even more sure, btw, that making it (int)(unsigned char)c was wrong. (Perhaps, as someone else suggested, (int)c would have been better; provided EOF is defined as something out-of-band like 0x8000.) >> **I do have what I think is a better idea, but I'm not going to >> describe it in this posting. (This was because I tend to do a lot of my posting in the wee hours of the morning, and I didn't trust myself to give any details.) >How about in another posting then? Stay tuned. I'll probably be posting it to comp.lang.misc (since "it isn't C anymore") sometime in February (not sooner; I have a big project due). Look for "Error handling". >What I normally do is something more like [char c; /*!*/ ... c = getchar(); >if (feof(stdin)) ...] ie, *ignore* the EOF return and check explicitly. I think that's a better model in that it doesn't rely on the ability to cast char into a larger type; the problem is that it's cumbersome. The common idiom "while ((c = getchar()) != EOF) ..." has to be written with a comma ("while (c = getchar(), !feof(stdin)) ...") or a test-in-the-middle loop ("for (;;) { c = getchar(); if (feof(stdin)) break; ... }"). Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/20/87)
In article <44@houligan.UUCP> dave@murphy.UUCP writes: >If anyone knows of any reason why the value of EOF can't be implementation- >specific, I'd like to hear about it. Of course EOF can be defined differently; its only constraint is that it must be distinct from any possible valid value returned by getc() (which, by the way, does NOT sign-extend input chars).
rbutterworth@watmath.UUCP (01/23/87)
In article <5541@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes: > Of course EOF can be defined differently; its only constraint is that > it must be distinct from any possible valid value returned by getc() > (which, by the way, does NOT sign-extend input chars). I think this discussion is going around in circles again. The reason for the question about EOF being allowed to have a value that does not correspond to the int value of any char, signed or unsigned, was to allow getchar() to be redefined to return a possibly sign-extended value. If getchar() returned such a value, it would simplify things and solve a number of problems. The only things that would be hurt are those programs that "know" what EOF looks like. But if getchar() were to be so changed, then things like int c=getchar(); if (c!=EOF){ if (c==string[7]) would work correctly. Under the current definition, "c" is not sign extended but string[7] might be sign extended and the comparison will fail even if the two characters are in fact the same. Similarly the is*() and to*() functions could be defined to work on both string[7] and getchar() results. I don't know of any advantage or purpose to the current getchar() behaviour. Making EOF out-of-bounds allows getchar() to be defined differently than it currently is, and thereby solves these problems.
gwyn@brl-smoke.UUCP (01/28/87)
In article <4604@watmath.UUCP> rbutterworth@watmath.UUCP (Ray Butterworth) writes: >But if getchar() were to be so changed, then things like > int c=getchar(); > if (c!=EOF){ > if (c==string[7]) >would work correctly. Under the current definition, "c" is not >sign extended but string[7] might be sign extended and the >comparison will fail even if the two characters are in fact the same. This is actually a consequence of the sloppy-signedness of "plain" char. If string[] is an array of unsigned chars, or if one uses an explicit (unsigned char) cast on the right side (or a (char) cast on the left side), your example will work under the current rules. >I don't know of any advantage or purpose to the current getchar() >behaviour. It had the "advantage" of returning a single value rather than two. This fit in with common style and supported things like while ( (c = getchar()) != EOF ) putchar( c ); It is certainly too late to change the getchar() interface, even if one agrees that it "should" have been designed differently.