jwg@duke.UUCP (Jeffrey William Gillette) (10/16/86)
[] OK, I've been bitten. I admit it. MSC 4.0 defaults 'char' to 'signed char'. For standard ASCII there is no difference between 'signed char' and 'unsigned char'. When I get to IBM's extensions to ASCII the situation is much different! <ctype.h> makes the following #define: #define isupper(c) ( (_ctype+1)[c] & _UPPER ) where 'UPPER' is a bit mask used in a table of character definitions ('_ctype'). This works great when c = 65 ('A'), but when c = 154 ('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER ) an obviously foolish bug. The problem here lies with Microsoft. The #defines in <ctype.h> are sloppy. The example above should have been #define isupper(c) ( (_ctype+1)[(unsigned char)c] & _UPPER ) Beyond this particular and annoying consequence of MS's decision to make 'char' = 'signed char', I have two more general questions (thus the post to net.lang.c). 1) Do other C compilers make 'char' a signed quantity by default? 2) What possible justification is there for this default? Is not 'char' primarily a logical (as opposed to mathematical) quantity? What I mean is, what is the definition of a negative 'a'? I can understand the desirability of allowing 'signed char' for gonzo programmers who won't use 'short', or who want to risk future compatibility of their code on the bet that useful characters will always remain 7-bit entities. Peace, Jeffrey William Gillette Humanities Computing Facility Duke University duke!jwg -- Jeffrey William Gillette uucp: mcnc!duke!jwg Humanities Computing Project bitnet: DYBBUK @ TUCCVM Duke University
guy@sun.UUCP (10/17/86)
> 1) Do other C compilers make 'char' a signed quantity by default? Yes. Lots and lots of them, including the very first C compiler ever written (if there was an earlier one, Dennis, let me know...) - the PDP-11 C compiler. > 2) What possible justification is there for this default? 1) When the PDP-11 C compiler was written, ASCII characters *were* 7-bit characters, and there was no general use of 8-bit characters, and 2) the PDP-11 treated bytes as signed, rather than unsigned, so references to ASCII characters as unsigned rather than signed costs some time and bought you nothing. I suspect Microsoft did this to make less-than-portable code written for PDP-11s and VAXes work on 8086-family machines without change. > Is not 'char' primarily a logical (as opposed to mathematical) quantity? Yes, but the people to complain to here are ultimately the designers of the PDP-11 (although a lot of string manipulation on PDP-11s could be done using unsigned characters without much penalty). > I can understand the desirability of allowing 'signed char' for gonzo > programmers who won't use 'short', It's not a question of "gonzo programmers who won't use 'short'. There are times where you absolutely *must* have a one-byte number in a structure; "short" just won't cut it here. (Bit fields would, perhaps, except that you can't take the address of a bit field.) Structures representing device registers, or representing fields in other externally-specified data, are an example of this. Also, if you have a *huge* array of integers in the range -127 to 128, you may take a significant performance hit by using "short" rather than "char" (remember, "short" takes twice the amount of memory that "char" does on most implementations). > or who want to risk future compatibility of their code on the bet that > useful characters will always remain 7-bit entities. They're risking nothing. "signed char" is a gross way of saying "short short int", not a way of saying "signed character" (which, as you say, is meaningless). Unfortunately, C originally didn't have "short" or "long", and when they were added they did not cascade. I presume, by the way, that "isupper(<u-umlaut>)" is intended to return 0 and "isupper(<U-umlaut>)" is intended to return 1. If Microsoft didn't put the extended character set into the "ctype" tables, the way that the indexing is done is irrelevant. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
thomps@gitpyr.gatech.EDU (Ken Thompson) (10/18/86)
In article <8719@duke.duke.UUCP>, jwg@duke.UUCP (Jeffrey William Gillette) writes: > [] > 1) Do other C compilers make 'char' a signed quantity by default? I use a Masscomp system compatible with System V and BSD 4.2 which has signed characters by default. I find this very annoying when porting software from machines with the opposite convention but it causes no problem with code written for this machine. > > 2) What possible justification is there for this default? Is not > 'char' primarily a logical (as opposed to mathematical) quantity? What > I mean is, what is the definition of a negative 'a'? I can understand > the desirability of allowing 'signed char' for gonzo programmers who > won't use 'short', or who want to risk future compatibility of their > code on the bet that useful characters will always remain 7-bit entities. > I see no justification for signed characters and the concept of a signed character is somewhat strange. The problem arises because chars used in an expression in C are automatically converted to type int. Signed characters come about when the conversion is made from a char which is an 8 bit quantity to an int which is 16 bits or larger. I do not know of any C compilers which actually view a char as an 8 bit signed entity. Instead the char becomes negative to to sign extension during conversion to int. -- Ken Thompson Phone : (404) 894-7089 Georgia Tech Research Institute Georgia Insitute of Technology, Atlanta Georgia, 30332 ...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!thomps
plocher@puff.wisc.edu (John Plocher) (10/19/86)
guy@sun.UUCP respondes to another poster about ctype macros and characters on the IBM PC with the 8th bit set: >> 1) Do other C compilers make 'char' a signed quantity by default? > >I presume, by the way, that "isupper(<u-umlaut>)" is intended to return 0 >and "isupper(<U-umlaut>)" is intended to return 1. If Microsoft didn't put >the extended character set into the "ctype" tables, the way that the >indexing is done is irrelevant. I hope you both remember that isANYTHING(x) is only defined to work if isascii(x) is true! isascii(u-umlaut) is FALSE! Thus, isupper(u-umlaut) does not NEED to work. -- harvard-\ /- uwmacc!uwhsms!plocher (work) John Plocher seismo-->!uwvax!< topaz-/ \- puff!plocher (school) "Never trust an idea you get sitting down" - Nietzche
chapman@cory.Berkeley.EDU (Brent Chapman) (10/19/86)
In article <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes: >MSC 4.0 defaults 'char' to 'signed char'. [ it defaulted to 'unsigned char' in previous versions of MSC -- Brent] [ details relating to a gotcha in header files, because Microsoft didn't cast a (possibly) negative char value into an unsigned value when using it to index an array, deleted ] >What possible justification is there for this default? Is not >'char' primarily a logical (as opposed to mathematical) quantity? What >I mean is, what is the definition of a negative 'a'? I can understand >the desirability of allowing 'signed char' for gonzo programmers who >won't use 'short', or who want to risk future compatibility of their >code on the bet that useful characters will always remain 7-bit entities. This brings up some interesting questions and ambiguities concerning K&R's definition of C. I haven't seen the proposed ANSI standard, so I can't comment on it. But K&R will do to illustrate the ambiguities; perhaps someone else can point out if and how the proposed standard deals with them up. On page 34, K&R define a 'char' to be "a single byte, capable of holding one character in the local character set." On page 40, they say "The language does not specify whether variables of type char are signed or unsigned quantities." This seems to imply that the implementor is free to choose the default that he feels best suits his implementation. On most machines, this is a moot point, since most machines only use the 0 to 127 range for character values, which is available regardless of whether the char is signed or unsigned. On the PC, however, it _does_ make a difference, because the upper 128 characters of the PC's character set _are_ printable, and are numbered from 128 through 255. Logic would seem to indicate the 'unsigned char' is the reasonable choice for the default on a C compiler for the PC. Unfortunately, most other C implementations, especially UNIX C implemetations, seem to default char to 'signed'. (Note that I've been assured of this by knowledgeable sources, but don't have any first hand knowledge, so I could be wrong.) This is a reasonable choice because, in the original K&R C definition, there is no 'signed' keyword. Therefore, everything should default 'signed' because if it defaults 'unsigned', there's no way to change it to 'signed'. Many implementations now include the 'signed' keyword, however. I don't know if it is a part of the proposed ANSI standard, but I think that it probably is. Now, Microsoft apparently decided to change their default for chars from 'unsigned', which is what it was in versions of the compiler previous to Ver 4.0, and which makes sense for a PC, to 'signed', which makes sense because of K&R's lack of a 'signed' keyword, and because most other implementations are that way. The original poster got bitten because Microsoft used a 'char' (which could be negative) as an array index, instead of casting it to 'unsigned char', in one of their library header files. Perhaps the most general, portable solution is not to use char variables for counting or array indexing. If you need a counter, use a short, which will default signed unless you say otherwise. If you need an array index, cast to an 'unsigned char' or an 'unsigned short'. Unfortunately, there is no guarantee that a short is as small as a char, so you may be wasting some space. Worse, there is no guarantee that a short is as _long_ as a char, although I doubt there is any implemetation where this is true. You currently can't count on whether a char will be signed or unsigned. Does the proposed ANSI standard address this? Fortunately, with MSC Ver 4.0, you can have your cake and eat it too. There is a command-line option to the compiler that will change the default from 'signed' to 'unsigned'. I think it's '-J', but I'm not certain, since I'm at home and my manuals are at work. Brent -- Brent Chapman chapman@cory.berkeley.edu or ucbvax!cory!chapman
gwyn@brl-smoke.ARPA (Doug Gwyn ) (10/19/86)
In article <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes: > #define isupper(c) ( (_ctype+1)[c] & _UPPER ) > >The problem here lies with Microsoft. No, the problem lies with the programmer. The is*() functions have (int), not (char), arguments. When you feed one a (char), it will be promoted to an (int) by the usual rules, including possible sign extension. The macro definition acts the same as a function in this regard, since array indices are (int), not (char), also. Microsoft's definition is correct. >1) Do other C compilers make 'char' a signed quantity by default? Dennis Ritchie's original (PDP-11) C compiler did. >2) What possible justification is there for this default? (a) less cost on machines like the PDP-11 (b) the programmer can, using suitable code, force whatever behavior he wants >I mean is, what is the definition of a negative 'a'? It might surprise you to learn that 'a' represents an (int) constant, not a (char). C (char)s are just short integral types whose signedness depends on the implementation (however, (signed char) and (unsigned char) have definite signedness). Dennis intended that sizeof(char)==1 but I can make a strong argument that that isn't necessary. P.S. I suggest people learn what is going on before raving about it. That would sure reduce the noise level of net.lang.c.
mark@ems.UUCP (Mark Colburn) (10/20/86)
In article <8273@sun.uucp>, guy@sun.UUCP writes: > > I can understand the desirability of allowing 'signed char' for gonzo > > programmers who won't use 'short', > > It's not a question of "gonzo programmers who won't use 'short'. It is important to note that K&R define that: char 8 or more bits short 16 or more bits Although these values may be implementation specific. On my 68020 based machine, shorts are 16 bits. When I need an 8 bit unsigned value (e.g. a byte) in my code (which happens quite frequently when you are writing software to support 8 bit CPU's) I use 'unsigned char'. I got myself into all sorts of trouble when I was first using C because I assumed that if an int is 16 bits, then a short must be 8. Right? Wrong! On the compiler that I was using, int was 16 bits and so was short. This is consistent with K&R (and, I believe, the proposed ANSI standard). Therefore, the only portable way to express a true byte (8-bit) value is with an 'unsigned int' declaration. This may still get you into trouble when you are working on a compiler that uses characters that are more than 8 bits. Don't laugh, there are some out there. It is also allowed for in the language definition. Notice that a character may be 8 or more bits. Since machines that use chars that are larger than 8 bits are relatively infrequent, I callously disregard their existence in my code. (I am sure that I will get bitten by it one of these days, but hey, gives a guy some kinda job security). -- Mark H. Colburn UUCP: ihnp4!rosevax!ems!mark EMS/McGraw-Hill ATT: (612) 829-8200 9855 West 78th Street Eden Prairie, MN 55344
henry@utzoo.UUCP (Henry Spencer) (10/20/86)
> 1) Do other C compilers make 'char' a signed quantity by default? Yes. Almost any C compiler for machines like the PDP11, the VAX, the 8088, and so forth, will. > 2) What possible justification is there for this default? Is not > 'char' primarily a logical (as opposed to mathematical) quantity? ... The problem started with the PDP11, the first machine C was implemented on. A minor quirk of the 11 made it substantially more efficient to manipulate characters as signed entities. This hardware quirk has been carried over, unfortunately, into a good many newer machines that have imitated the 11 to some degree. Compilers for these machines have a choice of generating inefficient code or using signed characters. Since any decent C documentation warns you that the signedness or lack thereof of characters is not portable, this is considered legitimate. I believe Dennis is on record as mildly regretting the original decision to go along with the hardware's prejudices, but it's a bit late now. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
chapman@cory.Berkeley.EDU (Brent Chapman) (10/21/86)
In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes: >It is important to note that K&R define that: > > char 8 or more bits > short 16 or more bits Just _WHERE_ does K&R say this? No place that I've ever seen... The only thing that I can figure is that you are inferring these "minimum" values from the table of _sample_ type sizes on p. 34; this is not a good thing to do. Note to everyone: If you're going to quote from something, especially K&R, _please_ check to make sure it says what you _think_ it says, and then include the page number of the info which supports your posting. >Although these values may be implementation specific. On my 68020 based >machine, shorts are 16 bits. When I need an 8 bit unsigned value (e.g. a byte) >in my code (which happens quite frequently when you are writing software to >support 8 bit CPU's) I use 'unsigned char'. > >I got myself into all sorts of trouble when I was first using C because I >assumed that if an int is 16 bits, then a short must be 8. Right? Wrong! Definitely wrong. On p. 34, K&R say "The intent is that short and long should provide different lengths of ingeters _where practical_ [emphasis mine -- Brent]; int will normall reflect the most "natural" size for a particular machine. As you can see, each compiler is free to interpret short and long as appropriate for its own hardware. About all you should count on is that short is no longer than long." Nowhere (that I'm aware of, anyway, and I looked carefully for it) does K&R say that ints must be at least 16 bits, nor that chars must be at least 8 bits. I seem to recall hearing about some screwey machine whose "character size" and "most natural integer size" were both 12 bits; for that machine, types 'char', 'int', and 'short' were all 12-bit quantities. >Therefore, the only portable way to express a true byte (8-bit) value is with >an 'unsigned int' declaration. This may still get you into trouble when you >are working on a compiler that uses characters that are more than 8 bits. 'unsigned int'? Are you sure you don't mean 'unsigned char'? But even if you do, there's no guarantee that you get what you call a "true byte"; there's nothing in K&R that outlaws a 7-bit char, for instance. The definiton of char (again, on p. 34) is "a single byte, capable of holding one character in the local character set". Note that "byte" doesn't automatically mean "8 bits". Brent chapman@cory.berkeley.edu or ucbvax!cory!chapman -- Brent Chapman chapman@cory.berkeley.edu or ucbvax!cory!chapman
mikes@apple.UUCP (Mike Shannon) (10/22/86)
In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes: > >MSC 4.0 defaults 'char' to 'signed char'. ... > This works great when c = 65 ('A'), but when c = 154 >('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER ) > .... The problem is that the u-umlaut char is being treated as negative. K&R, page 183, section 6.1 says "... but it is guaranteed that a member of the standard character set is non-negative." Apple experienced the same problem with an extended character set. I believe that u-umlaut is part of your machine's standard character set, and so I would argue that MSC does not conform to K&R in this respect. -- Michael Shannon {apple!mikes}
james@reality1.UUCP (james) (10/22/86)
In article <8719@duke.duke.UUCP>, jwg@duke.UUCP (Jeffrey William Gillette) writes: > MSC 4.0 defaults 'char' to 'signed char'. ... > 2) What possible justification is there for this default? ... I hate to bring up the manual, but there is a /J option now that makes the char default to an unsigned value. There is also a keyword "signed" to override that default when /J is used. Which also brings up keeping your software up to date if you haven't upgraded to version 4.0 yet... -- James R. Van Artsdalen ...!ut-ngp!utastro!osi3b2!james "Live Free or Die"
tim@ism780c.UUCP (Tim Smith) (10/25/86)
In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes: > >It is important to note that K&R define that: > > char 8 or more bits > short 16 or more bits > Where do K&R say this? -- member, all HASA divisions POELOD ECBOMB -------------- ^-- Secret Satanic Message Tim Smith USENET: sdcrdcf!ism780c!tim Compuserve: 72257,3706 Delphi or GEnie: mnementh
chris@umcp-cs.UUCP (Chris Torek) (10/29/86)
>In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes: >>It is important to note that K&R define that: >> char 8 or more bits >> short 16 or more bits In article <661@zen.BERKELEY.EDU> chapman@cory.Berkeley.EDU.UUCP (Brent Chapman) writes: >Just _WHERE_ does K&R say this? K&R do not define minimum sizes. They do provide a table listing some existing implementations (pp. 34 and 182), and they say also this: A character or a short integer may be used wherever an integer may be used. In all cases the value is converted to an integer. Conversion of a shorter integer to a longer always involves sign extension; integers are signed quantities. Whether or not sign-extension occurs for characters is machine dependent, but it is guaranteed that a member of the standard charcter set is non-negative. Of the machines treated by this manual, only the PDP-11 sign-extends. The most recent X3J11 (`ANSI C') draft I saw, however, *did* define minimum sizes in bits for `char', `short', `int', and `long'. It still left `char' sign extension up to the compiler, but added the types `signed char' and `unsigned char'. >>Although these values may be implementation specific. On my 68020 >>based machine, shorts are 16 bits. When I need an 8 bit unsigned >>value (e.g. a byte) in my code (which happens quite frequently when >>you are writing software to support 8 bit CPU's) I use 'unsigned >>char'. Note that `unsigned char' is not valid according to K&R (although most C compilers have such a type). In a particular set of `portable' programs I wrote (and am still writing), I needed 8, 16, 24, and 32 bit integers, with both signed and unsigned varieties for 8, 16, and 24 bits. Toward this end I have one machine-dependent `#include' file called `types.h'; in it I define the following: UnSign8(n) produce an unsigned 8 bit integer value given the possibly-signed integer value n Sign8(n) produce a sign extended 8 bit integer value (i.e., 128 -> -128; 255 -> -1) UnSign16(n) produce an unsigned 16 bit value Sign16(n) sign extend a 16 bit value UnSign24(n) produce an unsigned 24 bit value Sign24(n) sign extend a 24 bit value i32 a 32 (minimum) bit integer type Instead of trying to find types of the proper sizes, I have one that is large enough for all, and a set of macros to coerce it so as to properly represent the smaller values. I believe this can be implemented on any machine on which the software could ever run. The macros themselves are machine dependent, but well-isolated. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu
bet@ecsvax.UUCP (Bennett E. Todd III) (10/31/86)
In article <228@apple.UUCP> mikes@apple.UUCP (Mike Shannon) writes: >In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes: >> >>MSC 4.0 defaults 'char' to 'signed char'. ... >> This works great when c = 65 ('A'), but when c = 154 >>('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER ) >> .... > The problem is that the u-umlaut char is being treated as negative. >K&R, page 183, section 6.1 says "... but it is guaranteed that a member of the >standard character set is non-negative." > Apple experienced the same problem with an extended character set. >I believe that u-umlaut is part of your machine's standard character set, and >so I would argue that MSC does not conform to K&R in this respect. I haven't looked at MSC 4.0 yet, but I've been using MSC 3.0 for a while, and it sounds like this hasn't changed. In the documentation, it is exceedingly clear about this (and agrees with proper portable programming practice for UNIX systems): int isascii(c); /* test for ASCII character (0x00-0x7F) */ [...] "The isacii routine produces a meaningful result for all integer values. However, the remaining routines produce a defined result only for integer values corresponding to the ASCII character set (that is, only where isascii holds true) or for the non-ASCII value EOF (defined in stdio.h)." I'd say MSC is completely in the right on this one; C is a portable programming language, and MSC is the best implementation I've seen for porting code to the PC. The documentation is clear; they implement a C environment with an ASCII, not "Extended ASCII", character set. If portability is not of interest, then you can hack up the ctype macros, or whatever. However, portable programs should always use an idiom like if (isascii(c) && isupper(c)) { or whatever. Program defensively! Write programs that run *anywhere*, and check for anything that might possibly go wrong. Then give them away for free. -Bennett -- Bennett Todd -- Duke Computation Center, Durham, NC 27706-7756; (919) 684-3695 UUCP: ...{decvax,seismo,philabs,ihnp4,akgua}!mcnc!ecsvax!duccpc!bet BITNET: DBTODD@TUCC.BITNET -or- DBTODD@TUCCVM.BITNET -or- bet@ECSVAX.BITNET
jsdy@hadron.UUCP (Joseph S. D. Yao) (11/07/86)
In article <228@apple.UUCP> mikes@apple.UUCP (Mike Shannon) writes: >In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes: >>MSC 4.0 defaults 'char' to 'signed char'. ... >> This works great when c = 65 ('A'), but when c = 154 >>('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER ) >> .... The standard (K&R, and now X3J11) has always held that 'char' is signed or unsigned, depending on the machine. For good, usable, portable code I've tended to do arithmetic ops (such as indexing an array) in shorts or ints, masking input chars. This is also great for testing EOF ... Also, the ctype macros should NOT be used with any char > 0177, simply because many or most current implementations only use a buffer that large. -- Joe Yao hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP} jsdy@hadron.COM (not yet domainised)