[comp.std.c] How to use toupper

karl@haddock.ima.isc.com (Karl Heuer) (01/12/89)

This has mostly reduced to an ANSI-C-specific issue, so I'm redirecting
followups to comp.std.c.

In article <1989Jan6.231955.7445@sq.uucp> msb@sq.com (Mark Brader) writes:
>So for now, the best compromise seems to be:
>#ifdef __STDC__	/* [corrected --kwzh] */
>	if (*p >= 0) *p = toupper(*p);			/* Version 2 */
>#else
>	if (isascii(*p) && islower(*p)) *p = toupper(*p);  /* Version 5 */
>#endif

As Mark already pointed out, version 2 can break in an international
environment.  My recommendation (in a parallel article) was
	*p = toupper((unsigned char)*p);		/* Version 6 */
which has the subtle flaw that, if plain chars are signed and the result of
toupper() doesn't fit, ANSI C does not guarantee the integrity of the value
(the conversion is implementation-defined).

Mark further points out in e-mail:
>The trouble is that while Version 2 can break for some characters in the
>international environment, Version 6 can break for ALL characters in a
>vanilla environment ("C" locale)!

Well, not *all* characters; just those that appear negative (and hence don't
fit when converted back from unsigned char).  And this set is guaranteed to
exclude the minimal execution character set.  But the code as written could
still produce surprises on a sufficiently weird implementation which is still
within the letter of the Standard.

>The best you can do is to avoid "char" altogether and use "unsigned char".
>You probably have to do it throughout the program, in fact.

If the program has to be strictly conforming, you may be right.  (But then
string literals, and functions that expect `char *' arguments, may screw
things up; casting the pointers ought to be safe, though.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

rbutterworth@watmath.waterloo.edu (Ray Butterworth) (01/19/89)

getchar() presents yet another aspect of the problem.  Consider:
    switch (getchar()) {
        case EOF:
            ...
        case 'C':
            ...
    }
If 'C' is any character that sign extends, the switch won't work.

> karl@haddock.ima.isc.com (Karl Heuer)
> > msb@sq.com (Mark Brader)
> > The best you can do is to avoid "char" altogether and use "unsigned char".
> > You probably have to do it throughout the program, in fact.
> If the program has to be strictly conforming, you may be right.  (But then
> string literals, and functions that expect `char *' arguments, may screw
> things up; casting the pointers ought to be safe, though.)

i.e. you will have to say
    (unsigned char *)"string"
or
    (unsigned char)'C'
whenever you use any literal, and you'll have to cast all your (char*)
arguments to standard ANSI functions.  This is true for any application
that might be used in a locale with non-ASCII character sets and wants
to be portable to any conforming ANSI compiler that might have chosen to
treat chars as signed.

In general though, if the compiler is expected to produce programs that
can work on a local character set containing characters with the high
bit set, it is almost certain that the compiler will have to treat
(char) as (unsigned char).  Anyone that really wants to use chars to
perform signed arithmetic can now explicitly ask for (signed char).

The Standard should have explicitly stated that (char) is identical
with (unsigned char), and mentioned that compilers may, as an extension,
treat chars as signed for backward compatibility.  At least, this
should have been listed as a denigrated feature that will probably
be eliminated in future versions of the Standard.

In practice I'm sure that is the way it will eventually turn out.
I can't imagine any European ANSI compiler having (char) signed.
It would provide far too little benefit and far too many complications.


Much of this was mentioned to the Committee.
e.g. Letter P04 to the Second Public Review contained:

+  4.3 Character Handling:
+      Most of these functions don't work for signed char values if
+  the upper bit is on.  Is it unreasonable to expect that with
+      char c[10];
+      int i;
+      c[0] = i = getchar();
+  the function calls
+      isxxx(*c)
+  and
+      isxxx(i)
+  should behave the same way if "i" is not EOF?  This is not difficult
+  to do, and there certainly can't be any existing code that depends on
+  the described behavior.  Why not state that if the argument is not
+  EOF, the result will be the same as if the argument were cast to
+  unsigned char.  This would also remove the need for an equivalent to
+  the "isascii" function.

Perhaps I overestimated their abilities when I said "is not difficult".
Their response was:

+  This was considered a request for information, not an issue.
Well, it certainly looks like an issue to me.

+  It was never intended that they do so.  If you pass a signed char
+  argument and the sign is extended, the resulting value will not fit
+  in an unsigned char, as required.
Exactly.  I'm saying that you don't need to require it.
Drop that requirement and say that they are only defined to work on
values that can be returned by getchar().

+  Your suggestion would require the <ctype.h> functions to cast their
+  argument to unsigned char if it is non-EOF.
No it wouldn't.

+  This would require macro versions to evaluate their argument more
+  than once (once to test for EOF and once to cast them), rendering
+  them unsafe.
No, it would not require that macros evaluate their argument more
than once.  At worst it would require defining EOF as some negative
value other than -1, something that is explicitly allowed by the Standard.

rbutterworth@watmath.waterloo.edu (Ray Butterworth) (01/20/89)

In article <23156@watmath.waterloo.edu>, rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes:
> Drop that requirement and say that they are only defined to work on
> values that can be returned by getchar().

Hmm.  That's not quite what I meant to say.

The functions should be well defined for for all (int) input,
with the restriction that except for EOF only the low order byte
of the int will be examined.  The values returned will be those
that can be returned by getchar().

gwyn@smoke.BRL.MIL (Doug Gwyn ) (01/21/89)

In article <23156@watmath.waterloo.edu> rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes:
>If 'C' is any character that sign extends, the switch won't work.

In Standard C it will work for all 'C' in the required source
character set, which are required to have nonnegative values as
"chars", which match the character-constant integer values.

Sign extension of "chars" has ALWAYS been an issue for portable
C programming, and there is no way to fix that in the Standard
other than by outlawing signed implementation of "char", which
the Committee has repeatedly declined to do.

>Much of this was mentioned to the Committee.
>e.g. Letter P04 to the Second Public Review contained: ...

And letter L181 which was responded to in the third formal public
review repeated the point, which I summarized as:

	<ctype.h> macros should be made safe for "signed char"
	arguments.

The X3J11 response was:

	The Committee has voted against this idea.

	"char" arguments can always be cast to "unsigned char"
	type before calling a <ctype.h> function.  Indeed, since
	one cannot distinguish in classic signed "char"
	implementations between EOF (with value -1) and the
	8-bit char '377', whenever there might be a '377' value
	for a "char" the argument *must* be cast to an "unsigned
	char" for such implementations.

>+  ... there certainly can't be any existing code that depends on
>+  the described behavior.

But there is a LOT of code that does.  As you note, implementations
in environments with characters with high bits set are advised to
to implement plain "char" as if it were unsigned.

>Drop that requirement and say that they are only defined to work on
>values that can be returned by getchar().

What's the point?  getchar() can return ANY byte value, or EOF.  The
proposed Standard already requires that the argument to a ctype-
function be representable as an "unsigned char" or equal the value EOF.

You've tried at least twice now to convince X3J11 to change this part
of the specification, and you haven't convinced them.  Why carry on
about it here?

rbutterworth@watmath.waterloo.edu (Ray Butterworth) (01/26/89)

In article <9457@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes:
>     <ctype.h> macros should be made safe for "signed char"
>     arguments.
> The X3J11 response was:
>     The Committee has voted against this idea.
>     "char" arguments can always be cast to "unsigned char"
>     ... the argument *must* be cast to an "unsigned
>     char" for such implementations.

i.e. to write portable code, the user must put in lots of casts
rather than having the library do the messy stuff.
I thought the purpose of the Standard was to make writing portable
code easier.  Remember, the major argument people use for not
writing portable code is that it is too much trouble.  I really
can't see most people putting in these casts in the interest of
portability to other compilers or character sets.

> What's the point?  getchar() can return ANY byte value, or EOF.  The
> proposed Standard already requires that the argument to a ctype-
> function be representable as an "unsigned char" or equal the value EOF.
> 
> You've tried at least twice now to convince X3J11 to change this part
> of the specification, and you haven't convinced them.  Why carry on
> about it here?

Why?  Would you believe because I think I'm right and you are wrong.

But yes, I realize it is too late to change the Standard now.
On the other hand, there is no reason that library implementors
can't do it the "right" way, and with any luck the next version
of the Standard might go along with them.

A minor change in the wording of the standard would force a slight
change in the implementation, no change in efficiency, complete
compatibility with existing user code, and a vast improvement in
the programming environment provided by the language.

The minor change to the standard (i.e. valid input is EOF or the
low order byte of the argument),

would at worst require that the implementation increase the size
of the table used by macros by 50% (e.g. -128 to 255, plus EOF)

but would suddenly eliminate a lot of potential bugs and non-
portabilities in a lot of user code, and make it a lot simpler
for normal people to write portable programs without going to
any extra effort.  There would simply never be a need for anyone
ever to cast the arguments to the ctype functions for portability,
and the non-existence of isascii() would no longer be a problem.

Note that on most compilers EOF could still be defined as -1
(something that isn't required by the Standard anyway), since
none of the isxxx((unsigned char)'\377') functions are required
to return true for any of the existing compilers that use 7 bit
ASCII.

karl@haddock.ima.isc.com (Karl Heuer) (01/27/89)

In article <23261@watmath.waterloo.edu> rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes:
>The minor change to the standard (i.e. valid input is EOF or the low order
>byte of the argument), would at worst require that the implementation
>increase the size of the table used by macros by 50% (e.g. -128 to 255, plus
>EOF)

In order to make that guarantee, and continue to implement the routines as
safe macros, I believe you'd have to change your wording.  You can have '\200'
represented as both 0x80 and 0xFFFFFF80, but I don't see any way to make
`islower(0x12345680)' also work (in this worst-case scenario).

(The following is from an article that I started to write but decided not to
post.  I think.  If I did post it, you may have already seen this.)

The real problem, as I see it, predates the very existence of unsigned char.
The getchar() function has a return type that logically should be `char', but
instead is `int'.  If not for this use of an OOB value to flag the error
condition, there would be no need for the macro EOF, and the isxxx() routines
could have their domain defined as simply `char'.  I suspect that there
wouldn't even have been any need for signed/unsigned char at all, except for
arithmetic, and in that case it could have used the less misleading name
`[unsigned] short short int'.

That would still leave the problem of how to detect the error condition.  I
won't go into that, because I don't want to get involved% in the flame war
that would inevitably result.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
________
% Not today, anyway.

diamond@csl.sony.JUNET (Norman Diamond) (01/27/89)

In article <9457@smoke.BRL.MIL>, gwyn@smoke.BRL.MIL (Doug Gwyn ) writes:
> >     <ctype.h> macros should be made safe for "signed char"
> >     arguments.
> > The X3J11 response was:
> >     The Committee has voted against this idea.
> >     "char" arguments can always be cast to "unsigned char"
> >     ... the argument *must* be cast to an "unsigned
> >     char" for such implementations.

In article <23261@watmath.waterloo.edu>, rbutterworth@watmath.waterloo.edu (Ray Butterworth) writes:
> i.e. to write portable code, the user must put in lots of casts
> rather than having the library do the messy stuff.

alias cc cc -Dchar="unsigned char"

alias X3J11 X3J10.95

alias inews news + fillers
-- 
Norman Diamond, Sony Computer Science Lab (diamond%csl.sony.jp@relay.cs.net)
  The above opinions are my own.   |  Why are programmers criticized for
  If they're also your opinions,   |  re-inventing the wheel, when car
  you're infringing my copyright.  |  manufacturers are praised for it?