msb@sq.uucp (Mark Brader) (01/07/89)
Well, so far I've seen four incorrect articles on this topic, so I suppose I'd better say something. In this article assume before every code sample the following lines: #include <ctype.h> char *p; The subthread starts with Wayne Throop, who revises his original code: > *p = toupper(*p); /* Version 1 */ by saying: > Nearly as I can tell, dpANS says that ... ought to have been > if (*p >= 0) *p = toupper(*p); /* Version 2 */ > because all functions defined in ctype.h take arguments that > are integers, but give defined results only if those integers have > values representable as unsigned integer (or are the constant EOF). The corrected code is almost right, and the reason is also almost right. Doug Gwyn attempts to correct the latter by saying: > I don't think EOF is supposed to be a valid argument for toupper() ... which is wrong. In fact, according to the dpANS (i.e. the draft standard), specifically the beginning of section 4.3: # The header <ctype.h> declares several functions useful for testing # and mapping characters. In all cases the argument is an "int", the # value of which shall be representable as an "unsigned char" or shall # equal the value of the macro EOF. Note that that's "unsigned char", not "unsigned int". Since *p is a char, it is guaranteed to fit in an unsigned char if and only if its value is nonnegative. So Wayne's code will not call toupper() with an incorrect argument (one that causes undefined behavior). However, this doesn't necessarily mean that it works. The reason is that the dpANS provides support for languages other than English and character sets other than ASCII. Here's the description of toupper() [section 4.3.2.2]: # If the argument is a character for which "islower" is true and there is # a corresponding character for which "isupper" is true, the "toupper" # function returns the corresponding character; otherwise the result # is returned unchanged. So now we have to know about islower() [section 4.3.1.6]: # "islower" tests for any character that is a lower-case letter or is one # of an implementation-defined set of characters for which none of "iscntrl", # "isdigit", "ispunct", or "isspace" is true. In the "C" locale, "islower" # returns true only for the ... lower-case [English] letters ... Now the value of a "char" is SURE to be nonnegative ONLY [section 3.1.2.5]: # [if] a member of the required source character set enumerated in section # 2.2.1 is stored in [it]. Notice the term "required source character set". This means that in a non-English environment, i.e. with a locale other than the default "C", it is possible for valid arguments of toupper() to have negative values when stored in a "char". So Wayne's Version 2, used in a loop, might transform "francais" into, not "FRANCAIS", but "FRANcAIS"! ' ' ' (The apostrophes below the line represent cedillas, of course.) This turns out to be a real problem. See, you have a "char" value coming from *p, but you need an "unsigned char" value to convert to "int" to pass to toupper(). Well, these conversions are well-defined (provided that "char" is narrower than "int" or is an unsigned type), so that's not too bad. But now the result has to be stored back in *p. And section 3.2.1.2 says: # When an integer is demoted to a signed integer with smaller # size, or an unsigned integer is converted to its corresponding # signed integer, if the value cannot be represented the result is # implementation-defined. That is, the transformation from "char" *p to "unsigned char" is NOT necessarily reversible, even if toupper() doesn't change the value. The truth is that the dpANS provides NO way to safely use toupper() to do the desired transformation on a "char" object. (I wish I'd realized this while X3J11 was still taking comments!) The best you can do is to avoid "char" altogether and use "unsigned char". You probably have to do it throughout the program, in fact. And once the program is changed in this fashion, Version 1 becomes correct after all. (Of course, "char" might be an unsigned type; or even if not, an implementation might take the approach common today and define the conversion from "char" to "unsigned char" as being reversible. The point is that this is not guaranteed, just as an ASCII character set is common but not guaranteed.) Bruce Becker attacks essentially the same problem by writing: > *p = toupper( *p&0xff ); /* Version 3 */ > This gets at possible 256-character sets in the environments where > the compiler &/| hardware has sign-extended the negative byte value. But this solution is much too specific. Bruce himself points out, thinking of a specific implementation of the ctype family functions/macros, that: > Not all _ctype arrays have the same range - some are only > 128 bytes. In those cases the '0xff' above becomes '0x7f'. This is only one problem. The number of bits in a "char" may not be 8. And still more important, masking with & is not the right way to do the type conversion. That only works on 2's complement machines. So far I have been talking about dpANSish environments, but in practice we genearlly also have to consider older environments. The behavior of toupper() is one of the things that differs between C implementations. Peter da Silva writes: > Gee, I always do this: > if (islower(*p)) *p = toupper(*p); /* Version 4 */ > > While dpANS might have decided that toupper should bounds check, there > are too many V7-oid compilers out there that it's better to put the > bounds check in. He's right. I'm writing this article on a Sun running SunOS 3.2, which resembles BSD 4.2 or 4.3, I'm told. On this machine, toupper(x) just returns (x)+'A'-'a', no matter what x is; this is quite different from the dpANS definition above. But Peter is wrong to use islower() alone as the check. On this machine, for instance, "man 3 ctype" states: @ ... isascii is defined on all integer values; [islower and] @ the rest are defined only where isascii(c) is true and on @ the single non-ASCII value EOF (see stdio(3S)). (This is an ancestor of the first-quoted section of the dpANS.) So Peter's code, too, can fail if *p is negative and not EOF. To conform with the requirements of our system, he could write: if (isascii(*p) && islower(*p)) *p = toupper(*p); /* Version 5 */ And this is safe on all systems ... if they provide all of these functions. Unfortunately, X3J11 got a little carried away in the direction of internationalization, and refused to put in isascii(). [My suggestion was that in a non-ASCII environment it should simply test for values that are valid arguments for the other ctype family functions/macros. They didn't want things to even look biased toward ASCII. Despite which, the dpANS does have a function called asctime(). Nobody's perfect.] Now, I'm not aware of any existing systems that have toupper() and not isascii(). But when the dpANS becomes a Standard, it's safe to assume that there will be some. So for now, the best compromise seems to be: #ifdef ANSI_C if (*p >= 0) *p = toupper(*p); /* Version 2 */ #else if (isascii(*p) && islower(*p)) *p = toupper(*p); /* Version 5 */ #endif If unsigned chars can be used instead of chars, Version 1 can be substituted for Version 2 in the compromise code. It shouldn't hurt to use unsigned chars on older implementations either -- except those so old they don't HAVE unsigned chars. One more note. Wayne's article continued as follows: > Unless, of course, you are willing to otherwise ensure that s only points > at strings of vanilla characters... then the loop is OK as it was. If we take "vanilla characters" to mean those that are required to be in the source character set, then he's right in a dpANS environment. In an old-fashioned environment, however, toupper() may be safe ONLY if the characters processed are known to be lower-case letters. I'll probably be posting something to comp.std.c about the problems with the dpANS and "char" and foreign character sets. This article, however, is quite long enough as it is. Mark Brader The "I didn't think of that" type of failure occurs because Toronto I didn't think of that, and the reason I didn't think of it utzoo!sq!msb is because it never occurred to me. If we'd been able to msb@sq.com think of 'em, we would have. -- John W. Campbell
karl@haddock.ima.isc.com (Karl Heuer) (01/12/89)
This has mostly reduced to an ANSI-C-specific issue, so I'm redirecting followups to comp.std.c. In article <1989Jan6.231955.7445@sq.uucp> msb@sq.com (Mark Brader) writes: >So for now, the best compromise seems to be: >#ifdef __STDC__ /* [corrected --kwzh] */ > if (*p >= 0) *p = toupper(*p); /* Version 2 */ >#else > if (isascii(*p) && islower(*p)) *p = toupper(*p); /* Version 5 */ >#endif As Mark already pointed out, version 2 can break in an international environment. My recommendation (in a parallel article) was *p = toupper((unsigned char)*p); /* Version 6 */ which has the subtle flaw that, if plain chars are signed and the result of toupper() doesn't fit, ANSI C does not guarantee the integrity of the value (the conversion is implementation-defined). Mark further points out in e-mail: >The trouble is that while Version 2 can break for some characters in the >international environment, Version 6 can break for ALL characters in a >vanilla environment ("C" locale)! Well, not *all* characters; just those that appear negative (and hence don't fit when converted back from unsigned char). And this set is guaranteed to exclude the minimal execution character set. But the code as written could still produce surprises on a sufficiently weird implementation which is still within the letter of the Standard. >The best you can do is to avoid "char" altogether and use "unsigned char". >You probably have to do it throughout the program, in fact. If the program has to be strictly conforming, you may be right. (But then string literals, and functions that expect `char *' arguments, may screw things up; casting the pointers ought to be safe, though.) Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
throopw@xyzzy.UUCP (Wayne A. Throop) (01/13/89)
> karl@haddock.ima.isc.com (Karl Heuer) >> msb@sq.com (Mark Brader) >>The best you can do is to avoid "char" altogether and use "unsigned char". >>You probably have to do it throughout the program, in fact. > If the program has to be strictly conforming, you may be right. (But then > string literals, and functions that expect `char *' arguments, may screw > things up; casting the pointers ought to be safe, though.) If (ah say *IF*) it ought to be safe to cast pointers between (char *) and (unsigned char *) types, why can't the problematical case conversion be done like so: unsigned char *p; char *s; ... for( p=(unsigned char *)s; *p; ++p ) *p = toupper( *p ); But on the other hand... if the above code is unsafe (and I see nothing in dpANS which makes it safe), why would it be safe to use unsigned characters hither and thither and simply cast pointers to these to apply standard signed-character-expecting library routines to them? (Gad, don't the simplest issues turn out to be cans of worms at times?) -- "I really ought to do better next year." "It's the 'ought' that counts." --- paraphrase of Bloom County -- Wayne Throop <the-known-world>!mcnc!rti!xyzzy!throopw