[comp.lang.c] How to use toupper

msb@sq.uucp (Mark Brader) (01/07/89)

Well, so far I've seen four incorrect articles on this topic, so
I suppose I'd better say something.  In this article assume before
every code sample the following lines:

	#include <ctype.h>
	char *p;

The subthread starts with Wayne Throop, who revises his original code:

>	*p = toupper(*p);				/* Version 1 */

by saying:

> Nearly as I can tell, dpANS says that ... ought to have been
>	if (*p >= 0) *p = toupper(*p);			/* Version 2 */

> because all functions defined in ctype.h take arguments that
> are integers, but give defined results only if those integers have
> values representable as unsigned integer (or are the constant EOF).

The corrected code is almost right, and the reason is also almost
right.  Doug Gwyn attempts to correct the latter by saying:

> I don't think EOF is supposed to be a valid argument for toupper() ...

which is wrong.  In fact, according to the dpANS (i.e. the draft standard),
specifically the beginning of section 4.3:

#  The header <ctype.h> declares several functions useful for testing
#  and mapping characters.  In all cases the argument is an "int", the
#  value of which shall be representable as an "unsigned char" or shall
#  equal the value of the macro EOF.

Note that that's "unsigned char", not "unsigned int".  Since *p is a
char, it is guaranteed to fit in an unsigned char if and only if its
value is nonnegative.  So Wayne's code will not call toupper() with
an incorrect argument (one that causes undefined behavior).

However, this doesn't necessarily mean that it works.  The reason is that
the dpANS provides support for languages other than English and character
sets other than ASCII.  Here's the description of toupper() [section 4.3.2.2]:

#  If the argument is a character for which "islower" is true and there is
#  a corresponding character for which "isupper" is true, the "toupper"
#  function returns the corresponding character; otherwise the result
#  is returned unchanged.

So now we have to know about islower() [section 4.3.1.6]:

#  "islower" tests for any character that is a lower-case letter or is one
#  of an implementation-defined set of characters for which none of "iscntrl",
#  "isdigit", "ispunct", or "isspace" is true.  In the "C" locale, "islower"
#  returns true only for the ... lower-case [English] letters ...

Now the value of a "char" is SURE to be nonnegative ONLY [section 3.1.2.5]:

#  [if] a member of the required source character set enumerated in section
#  2.2.1 is stored in [it].

Notice the term "required source character set".  This means that in a
non-English environment, i.e. with a locale other than the default "C",
it is possible for valid arguments of toupper() to have negative values
when stored in a "char".  So Wayne's Version 2, used in a loop, might
transform "francais" into, not "FRANCAIS", but "FRANcAIS"!
               '                    '               '
(The apostrophes below the line represent cedillas, of course.)

This turns out to be a real problem.  See, you have a "char" value coming
from *p, but you need an "unsigned char" value to convert to "int" to pass
to toupper().  Well, these conversions are well-defined (provided that "char"
is narrower than "int" or is an unsigned type), so that's not too bad.
But now the result has to be stored back in *p.  And section 3.2.1.2 says:

#  When an integer is demoted to a signed integer with smaller
#  size, or an unsigned integer is converted to its corresponding
#  signed integer, if the value cannot be represented the result is
#  implementation-defined.

That is, the transformation from "char" *p to "unsigned char" is NOT
necessarily reversible, even if toupper() doesn't change the value.
The truth is that the dpANS provides NO way to safely use toupper()
to do the desired transformation on a "char" object.  (I wish I'd
realized this while X3J11 was still taking comments!)  The best you
can do is to avoid "char" altogether and use "unsigned char".  You
probably have to do it throughout the program, in fact.

And once the program is changed in this fashion, Version 1 becomes correct
after all.


(Of course, "char" might be an unsigned type; or even if not, an
implementation might take the approach common today and define the
conversion from "char" to "unsigned char" as being reversible.
The point is that this is not guaranteed, just as an ASCII character
set is common but not guaranteed.)


Bruce Becker attacks essentially the same problem by writing:
>	*p = toupper( *p&0xff );			/* Version 3 */
> This gets at possible 256-character sets in the environments where
> the compiler &/| hardware has sign-extended the negative byte value.

But this solution is much too specific.  Bruce himself points out, thinking
of a specific implementation of the ctype family functions/macros, that:

> Not all _ctype arrays have the same range - some are only
> 128 bytes.  In those cases the '0xff' above becomes '0x7f'.

This is only one problem.  The number of bits in a "char" may not be 8.
And still more important, masking with & is not the right way to do the type
conversion.  That only works on 2's complement machines.


So far I have been talking about dpANSish environments, but in practice
we genearlly also have to consider older environments.  The behavior of
toupper() is one of the things that differs between C implementations.

Peter da Silva writes:

> Gee, I always do this:
>	if (islower(*p)) *p = toupper(*p);		/* Version 4 */
>
> While dpANS might have decided that toupper should bounds check, there
> are too many V7-oid compilers out there that it's better to put the
> bounds check in.

He's right.  I'm writing this article on a Sun running SunOS 3.2, which
resembles BSD 4.2 or 4.3, I'm told.  On this machine, toupper(x) just
returns (x)+'A'-'a', no matter what x is; this is quite different from
the dpANS definition above.

But Peter is wrong to use islower() alone as the check.  On this machine,
for instance, "man 3 ctype" states:

@  ... isascii is defined on all integer values; [islower and]
@  the rest are defined only where isascii(c) is true and on
@  the single non-ASCII value EOF (see stdio(3S)).

(This is an ancestor of the first-quoted section of the dpANS.)
So Peter's code, too, can fail if *p is negative and not EOF.
To conform with the requirements of our system, he could write:

	if (isascii(*p) && islower(*p)) *p = toupper(*p);  /* Version 5 */

And this is safe on all systems ... if they provide all of these functions.
Unfortunately, X3J11 got a little carried away in the direction of
internationalization, and refused to put in isascii().  [My suggestion
was that in a non-ASCII environment it should simply test for values
that are valid arguments for the other ctype family functions/macros.
They didn't want things to even look biased toward ASCII.  Despite which,
the dpANS does have a function called asctime().  Nobody's perfect.]


Now, I'm not aware of any existing systems that have toupper() and not
isascii().  But when the dpANS becomes a Standard, it's safe to assume
that there will be some.  So for now, the best compromise seems to be:

#ifdef ANSI_C
	if (*p >= 0) *p = toupper(*p);			/* Version 2 */
#else
	if (isascii(*p) && islower(*p)) *p = toupper(*p);  /* Version 5 */
#endif

If unsigned chars can be used instead of chars, Version 1 can be
substituted for Version 2 in the compromise code.  It shouldn't hurt
to use unsigned chars on older implementations either -- except those so
old they don't HAVE unsigned chars.


One more note.  Wayne's article continued as follows:

> Unless, of course, you are willing to otherwise ensure that s only points
> at strings of vanilla characters... then the loop is OK as it was.

If we take "vanilla characters" to mean those that are required to be
in the source character set, then he's right in a dpANS environment.
In an old-fashioned environment, however, toupper() may be safe ONLY
if the characters processed are known to be lower-case letters.


I'll probably be posting something to comp.std.c about the problems
with the dpANS and "char" and foreign character sets.  This article,
however, is quite long enough as it is.


Mark Brader	   The "I didn't think of that" type of failure occurs because
Toronto		   I didn't think of that, and the reason I didn't think of it
utzoo!sq!msb	   is because it never occurred to me.  If we'd been able to
msb@sq.com	   think of 'em, we would have.		-- John W. Campbell

karl@haddock.ima.isc.com (Karl Heuer) (01/12/89)

This has mostly reduced to an ANSI-C-specific issue, so I'm redirecting
followups to comp.std.c.

In article <1989Jan6.231955.7445@sq.uucp> msb@sq.com (Mark Brader) writes:
>So for now, the best compromise seems to be:
>#ifdef __STDC__	/* [corrected --kwzh] */
>	if (*p >= 0) *p = toupper(*p);			/* Version 2 */
>#else
>	if (isascii(*p) && islower(*p)) *p = toupper(*p);  /* Version 5 */
>#endif

As Mark already pointed out, version 2 can break in an international
environment.  My recommendation (in a parallel article) was
	*p = toupper((unsigned char)*p);		/* Version 6 */
which has the subtle flaw that, if plain chars are signed and the result of
toupper() doesn't fit, ANSI C does not guarantee the integrity of the value
(the conversion is implementation-defined).

Mark further points out in e-mail:
>The trouble is that while Version 2 can break for some characters in the
>international environment, Version 6 can break for ALL characters in a
>vanilla environment ("C" locale)!

Well, not *all* characters; just those that appear negative (and hence don't
fit when converted back from unsigned char).  And this set is guaranteed to
exclude the minimal execution character set.  But the code as written could
still produce surprises on a sufficiently weird implementation which is still
within the letter of the Standard.

>The best you can do is to avoid "char" altogether and use "unsigned char".
>You probably have to do it throughout the program, in fact.

If the program has to be strictly conforming, you may be right.  (But then
string literals, and functions that expect `char *' arguments, may screw
things up; casting the pointers ought to be safe, though.)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint

throopw@xyzzy.UUCP (Wayne A. Throop) (01/13/89)

> karl@haddock.ima.isc.com (Karl Heuer)
>> msb@sq.com (Mark Brader)
>>The best you can do is to avoid "char" altogether and use "unsigned char".
>>You probably have to do it throughout the program, in fact.
> If the program has to be strictly conforming, you may be right.  (But then
> string literals, and functions that expect `char *' arguments, may screw
> things up; casting the pointers ought to be safe, though.)

If (ah say *IF*) it ought to be safe to cast pointers between (char *)
and (unsigned char *) types, why can't the problematical case conversion
be done like so:

        unsigned char *p;
        char *s;
        ...
        for( p=(unsigned char *)s; *p; ++p )
            *p = toupper( *p );

But on the other hand... if the above code is unsafe (and I see nothing
in dpANS which makes it safe), why would it be safe to use unsigned
characters hither and thither and simply cast pointers to these to
apply standard signed-character-expecting library routines to them?

(Gad, don't the simplest issues turn out to be cans of worms at times?)

--
"I really ought to do better next year."
"It's the 'ought' that counts."
                                        --- paraphrase of Bloom County
-- 
Wayne Throop      <the-known-world>!mcnc!rti!xyzzy!throopw