[comp.lang.c] is it really necessary for character values to be positive?

dave@murphy.UUCP (12/16/86)

Summary: invent an 8-bit character set and then let some of them be negative
Line eater: fully conforming

I've been thinking about this business with long chars and short chars
and trigraphs and international character sets and such, and I've got a
proposal.  The proposal is this: if someone can come up with an 8-bit
character set that contains all of the necessary characters for the
Western languages, (and includes the existing USASCII set as a subset),
then let's drop the requirements that a member of a machine's "natural"
character set be represented as a positive number in a plain char.  This
will have the following benefits:

1. Everyone can adopt a character set that will have all of the characters
that they need, and not have to overload any of the USASCII set with
other characters.  Portability of programs and other text files will benefit
greatly, and trigraphs will be unnecessary.  (For many languages, there
aren't enough punctuation characters to overmap; for example, I think that
it takes 17 characters to represent all of the possible letter-and-accent
combinations in French, and that's just for lower case.)

2. The character set will fit into almost everyone's byte size, meaning no
dramatic increase in the size of text files.  (Nearly everyone uses at least
an 8-bit byte with UN*X; the only ones that I can think of are the PDP10/20's,
which can use 7-bit bytes.).

3. It won't be necessary to raise sizeof(char) from 1.  This means that
programs that use chars for things other than text (yes, there are a *lot*
of them) won't be disturbed.

4. Each implementation can continue using the signedness for char that best
fits the architechure.  It won't be necessary to force all plain chars to
unsigned.

The disadvantages that I can see are these:

1. Since some of the char values may be negative, it will not be possible
to collate chars by simply comparing their values; you have to call a
collating routine defined for the particular implementation.  (But, some
languages don't collate in strict alphabetic order, so you'll wind up
doing this with any international character set.)

2. You will have to use functions to do things like converting a letter
to upper or lower case; just masking off bits won't get it anymore.

3. Some terminals already use the codes > 127 for other purposes.  There is
no easy answer to this problem.

4. The value 255 can't be used because it may look like EOF on some systems.

In short, it doesn't look to me like there is any good reason to require
characters to be represented as positive values.  Or have I overlooked
something really basic?
---
"I used to be able to sing the blues, but now I have too much money."
-- Bruce Dickinson

Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL
UUCP:  ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt
 or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt
ARPA: dcornutt@gswd-vms.arpa (I'm not sure how well this works)

"The opinions expressed herein are not necessarily those of my employer,
not necessarily mine, and probably not necessary."

greg@utcsri.UUCP (Gregory Smith) (12/19/86)

In article <39@houligan.UUCP> dave@murphy.UUCP writes:
>4. The value 255 can't be used because it may look like EOF on some systems.

for(i=0; i<30000; ++i) printf( "AAAAARRRRRRRRRRGGGGGGHHHH!!!!!!!!!\n");
-- 
----------------------------------------------------------------------
Greg Smith     University of Toronto      UUCP: ..utzoo!utcsri!greg
Have vAX, will hack...

rbutterworth@watmath.UUCP (Ray Butterworth) (12/23/86)

In article <39@houligan.UUCP>, dave@murphy.UUCP writes:
> In short, it doesn't look to me like there is any good reason to require
> characters to be represented as positive values.  Or have I overlooked
> something really basic?

Consider the following:
    char str[5];
    int i;
    i=getchar();
    str[0]=i;
    if (i!=str[0])
        printf("Is this possible?");
    if (isupper(i) && !isupper(str[0]))
        printf("How about this?");

The answer is that yes, it is possible when getchar() is allowed to
return characters that have the upper bit on, on compilers that
sign extend.  If the character is 0xF0 say, "i" will be assigned
0xF0, but str[0] will have the value 0xFFFFFFF0, and the comparison
will fail.
Similarly, many functions such as isupper() will behave incorrectly
since they aren't defined to work on negative arguments.

If ANSI were to define getchar() to return a value that is sign
extended on machines that sign-extend chars, and define functions
such as isupper() to accept such arguments, I think it would solve
most problems of sign-extension and 8-bit character sets.  It probably
wouldn't break any existing source code either (except for code that
stupidly ignores the EOF manifest and uses an explicit value).

>
> 4. The value 255 can't be used because it may look like EOF on some systems.
The only requirement on EOF is that it be a negative int.
If the implementors make it say, (-12345), it won't be confused with
any character, with or without sign extension.

karl@haddock.UUCP (Karl Heuer) (12/25/86)

In article <39@houligan.UUCP> dave@murphy.UUCP writes:
>Summary: invent an 8-bit character set and then let some of them be negative

Suppose I am using such a system, and one of the characters -- call it '@' --
has a negative value.  The following program will not work:
    main() { int c; ... c = getchar(); ... if (c == '@') ... }
Note that getchar() returns an UNSIGNED char on success; this is to guarantee
that none of them compare equal to EOF.  Thus, any printing character that I
want to enclose in single quotes had better be positive, or it becomes VERY
awkward to use.

Please don't suggest that getchar() should return a signed char and that
'\377' should be reserved.  It won't work.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

ron@brl-sem.ARPA (Ron Natalie <ron>) (12/30/86)

In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
> In article <39@houligan.UUCP> dave@murphy.UUCP writes:
> >Summary: invent an 8-bit character set and then let some of them be negative
> 
> Suppose I am using such a system, and one of the characters -- call it '@' --
> has a negative value.  The following program will not work:
>     main() { int c; ... c = getchar(); ... if (c == '@') ... }

Getchar returns int.  The int has a character in it.  Before trying to
use it as such, you ought to either place it back into a character
variable explicitly or use a cast to char...

     main() { int c; ... c = getchar(); ... if ((char) c == '@') ... }

karl@haddock.UUCP (Karl Heuer) (01/09/87)

In article <548@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes:
>In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
>> In article <39@houligan.UUCP> dave@murphy.UUCP writes:
>> >Summary: invent an 8-bit character set and let some of them be negative
>> 
>> Suppose I am using such a system, and one of the characters -- call it '@'
>> -- has a negative value.  The following program will not work:
>>     main() { int c; ... c = getchar(); ... if (c == '@') ... }
>
>Getchar returns int.  The int has a character in it.  Before trying to
>use it as such, you ought to either place it back into a character
>variable explicitly or use a cast to char...
>
>     main() { int c; ... c = getchar(); ... if ((char) c == '@') ... }

That's one way to "fix" the problem, but the construct I wrote is valid by
current standards and is a common idiom.  I don't think programmers would like
having to cast* the result of getchar() back into char before using it!

Your suggestion does make sense logically, though, and I think it supports my
contention that making getchar() an int function was a mistake in the first
place.**

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
*Of course, the cast is done *after* testing for EOF.
**I do have what I think is a better idea, but I'm not going to describe it in
this posting.  Anyway, it's too late to change getchar() now.

mouse@mcgill-vision.UUCP (01/12/87)

In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
> Suppose I am using such a system, and one of the characters -- call
> it '@' -- has a negative value.  The following program will not work:
>     main() { int c; ... c = getchar(); ... if (c == '@') ... }
> Note that getchar() returns an UNSIGNED char on success; this is to
> guarantee that none of them compare equal to EOF.  Thus, any printing
> character that I want to enclose in single quotes had better be
> positive, or it becomes VERY awkward to use.

Well.  Now, exactly what does it mean to say that @ is negative?
Presumably it means that the test below will succeed:

	char c; /* note: not int */
	....
	c = '@';
	if (c < 0)

Now, remember that everybody (K&R and H&S and I hope ANSI) agrees that
'x' is an int, not a char.  Notice that you can't make '@' the same
thing as what getchar() returns, because the following will fail:

	char string[something];
	....
	if (string[subscript] == '@')

About the neatest solution I see is to make 'x' have type unsigned char
rather than int, at least when there's only one character between
quotes (is there any code out there *using* multi-char character
constants?).  Then we also have to arrange that char and unsigned char
are not promoted to int in expressions not involving anything bigger
than char.  This should make both of these work.

Is there anything wrong with changing the type of 'x' literals (and
fixing char-only expressions), that is, will it break anything?

					der Mouse

USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse
     think!mosart!mcgill-vision!mouse
Europe: mcvax!decvax!utcsri!mcgill-vision!mouse
ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu

mouse@mcgill-vision.UUCP (01/12/87)

In article <293@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
> [Your suggestion] supports my contention that making getchar() an int
> function was a mistake in the first place.**

> **I do have what I think is a better idea, but I'm not going to
> describe it in this posting.

How about in another posting then?

What I normally do is something more like

	char c;					/* yes, char! */
	....
	c = getc(stream);			/* or getchar() if stdin */
	if (feof(stream) || ferror(stream))
	 { ....
	 }

ie, *ignore* the EOF return and check explicitly.

Is this better? worse? than the int c; c=getc(stream); if (c==EOF)
approach?  Why?

					der Mouse

USA: {ihnp4,decvax,akgua,utzoo,etc}!utcsri!mcgill-vision!mouse
     think!mosart!mcgill-vision!mouse
Europe: mcvax!decvax!utcsri!mcgill-vision!mouse
ARPAnet: think!mosart!mcgill-vision!mouse@harvard.harvard.edu

dave@murphy.UUCP (01/14/87)

Summary: does EOF have to be -1?

In article <289@haddock.UUCP>, karl@haddock.ISC.COM.UUCP (Karl Heuer) types
(in response to an earlier article that I wrote):

>Suppose I am using such a system, and one of the characters -- call it '@' --
>has a negative value.  The following program will not work:
>    main() { int c; ... c = getchar(); ... if (c == '@') ... }
>Note that getchar() returns an UNSIGNED char on success; this is to guarantee
>that none of them compare equal to EOF.  Thus, any printing character that I
>want to enclose in single quotes had better be positive, or it becomes VERY
>awkward to use.

Thanks for pointing this out, but I don't see where it should cause a major
problem.  Assuming that the character set in use doesn't take up the entire
range of int values, all that is necessary is to pick a value for EOF that
doesn't correspond to any character value.  (Newcomers: keep in mind that
the return value of getchar and getc is defined as being an int, not a
char, even though it is often treated like a char.)  This way, getchar can
return a possibly negative value, and EOF won't collide with any legit
character value.

Would defining EOF to be something other than -1 cause a problem?  I don't
think so.  K&R says, on p. 144: "The standard library defines the symbolic
constant EOF to be -1 (with a #define in the file 'stdio.h'), but tests
should be written in terms of EOF, not -1, so as to be independent of the
specific value."  I don't this is a situation like with NULL where the
actual value has a special meaning in the language definition, so I don't
see why it couldn't be changed.  People who are testing for -1 or for a
negative value instead of using EOF deserve whatever they get.

If anyone knows of any reason why the value of EOF can't be implementation-
specific, I'd like to hear about it.
---
"I used to be able to sing the blues, but now I have too much money."
-- Bruce Dickinson

Dave Cornutt, Gould Computer Systems, Ft. Lauderdale, FL
UUCP:  ...!{sun,pur-ee,brl-bmd,bcopen}!gould!dcornutt
 or ...!{ucf-cs,allegra,codas}!novavax!houligan!dcornutt
ARPA: dcornutt@gswd-vms.arpa (I'm not sure how well this works)

"The opinions expressed herein are not necessarily those of my employer,
not necessarily mine, and probably not necessary."

karl@haddock.UUCP (01/19/87)

In article <598@mcgill-vision.UUCP> mcgill-vision!mouse (der Mouse) writes:
>In article <289@haddock.UUCP>, karl@haddock.UUCP (Karl Heuer) writes:
>> Suppose I am using such a system, and one of the characters -- call
>> it '@' -- has a negative value.  The following program will not work:
>>     main() { int c; ... c = getchar(); ... if (c == '@') ... }
>> ... Any printing character that I want to enclose in single quotes had
>> better be positive, or it becomes VERY awkward to use.
>
>Well.  Now, exactly what does it mean to say that @ is negative?
>Presumably it means that the test below will succeed:
>	char c = '@'; if (c < 0) ...

Actually what I meant was simply that "if ('@' < 0) ..." would succeed.  This
is not the same thing since '@' has type int.  Your test says only that char
is implemented as a signed datatype, and that '@' has the high bit set.

>Notice that you can't make '@' the same thing as what getchar() returns,
>because [char s[N]; if (s[0] == '@') ...] will fail.

That's the flip side of the problem, which I overlooked it in my posting.  The
problem is independent of single-quotes; any machine on which characters are
signed will fail to handle the test (getchar() == s[0]).  The only reason it
"worked" so well on the pdp11 was that *in practice*, all the chars one has to
deal with (I'm assuming text characters, not one-byte integers) were 7-bit, so
it didn't matter whether they were sign-extended (as with s[0]) or unsigned
(as with getchar()).

>About the neatest solution I see is to make 'x' have type unsigned char
>rather than int, at least when there's only one character between
>quotes.  Then we also have to arrange that char and unsigned char
>are not promoted to int in expressions not involving anything bigger
>than char.  This should make both of these work.

I dunno.  A simpler solution is to assert that plain char is unsigned char.
As I said before, I suspect the adopted solution will be that in an 8-bit
environment plain char will be unsigned char; the only default-signed-char
compilers will be on pdp11-like machines in 7-bit environments.

>(is there any code out there *using* multi-char character constants?)

If so, it's almost all nonportable.  The only portable use I've seen was one I
wrote for a program that dealt with the two-letter codes found in termcap,
troff, etc: "switch (s[0]*'\1\0' + s[1]*'\0\1') { case 'xy': ...; }".  I ended
up not using it anyway, since lint didn't like it.  (But it is independent of
byte size and byte ordering.)

[From article <600@mcgill-vision.UUCP>, same author, again quoting kwzh]
>> [Your suggestion] supports my contention that making getchar() an int
>> function was a mistake in the first place.**

I am now even more sure, btw, that making it (int)(unsigned char)c was wrong.
(Perhaps, as someone else suggested, (int)c would have been better; provided
EOF is defined as something out-of-band like 0x8000.)

>> **I do have what I think is a better idea, but I'm not going to
>> describe it in this posting.

(This was because I tend to do a lot of my posting in the wee hours of the
morning, and I didn't trust myself to give any details.)

>How about in another posting then?

Stay tuned.  I'll probably be posting it to comp.lang.misc (since "it isn't C
anymore") sometime in February (not sooner; I have a big project due).  Look
for "Error handling".

>What I normally do is something more like [char c; /*!*/ ... c = getchar();
>if (feof(stdin)) ...] ie, *ignore* the EOF return and check explicitly.

I think that's a better model in that it doesn't rely on the ability to cast
char into a larger type; the problem is that it's cumbersome.  The common
idiom "while ((c = getchar()) != EOF) ..." has to be written with a comma
("while (c = getchar(), !feof(stdin)) ...") or a test-in-the-middle loop
("for (;;) { c = getchar(); if (feof(stdin)) break; ... }").

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

gwyn@brl-smoke.ARPA (Doug Gwyn ) (01/20/87)

In article <44@houligan.UUCP> dave@murphy.UUCP writes:
>If anyone knows of any reason why the value of EOF can't be implementation-
>specific, I'd like to hear about it.

Of course EOF can be defined differently; its only constraint is that
it must be distinct from any possible valid value returned by getc()
(which, by the way, does NOT sign-extend input chars).

rbutterworth@watmath.UUCP (01/23/87)

In article <5541@brl-smoke.ARPA>, gwyn@brl-smoke.ARPA (Doug Gwyn ) writes:
> Of course EOF can be defined differently; its only constraint is that
> it must be distinct from any possible valid value returned by getc()
> (which, by the way, does NOT sign-extend input chars).

I think this discussion is going around in circles again.
The reason for the question about EOF being allowed to have a
value that does not correspond to the int value of any char,
signed or unsigned, was to allow getchar() to be redefined to
return a possibly sign-extended value.

If getchar() returned such a value, it would simplify things and
solve a number of problems.  The only things that would be hurt
are those programs that "know" what EOF looks like.

But if getchar() were to be so changed, then things like
    int c=getchar();
	if (c!=EOF){
        if (c==string[7])
would work correctly.  Under the current definition, "c" is not
sign extended but string[7] might be sign extended and the
comparison will fail even if the two characters are in fact the same.

Similarly the is*() and to*() functions could be defined to work
on both string[7] and getchar() results.

I don't know of any advantage or purpose to the current getchar()
behaviour.

Making EOF out-of-bounds allows getchar() to be defined differently
than it currently is, and thereby solves these problems.

gwyn@brl-smoke.UUCP (01/28/87)

In article <4604@watmath.UUCP> rbutterworth@watmath.UUCP (Ray Butterworth) writes:
>But if getchar() were to be so changed, then things like
>    int c=getchar();
>	if (c!=EOF){
>        if (c==string[7])
>would work correctly.  Under the current definition, "c" is not
>sign extended but string[7] might be sign extended and the
>comparison will fail even if the two characters are in fact the same.

This is actually a consequence of the sloppy-signedness of "plain"
char.  If string[] is an array of unsigned chars, or if one uses
an explicit (unsigned char) cast on the right side (or a (char)
cast on the left side), your example will work under the current
rules.

>I don't know of any advantage or purpose to the current getchar()
>behaviour.

It had the "advantage" of returning a single value rather than two.
This fit in with common style and supported things like
	while ( (c = getchar()) != EOF )
		putchar( c );
It is certainly too late to change the getchar() interface, even if
one agrees that it "should" have been designed differently.