ra@is.uu.no (Robert Andersson) (02/10/90)
An important subject for us Europeans with all our strange character sets. But, what is the best way to do it? Of course, using the eight bit as a flag, assuming that letters range from 'A' to 'Z' etc. are well known things to avoid, and that is not what I would like to discuss. What I would like is to hear other peoples opinion on the signed/unsigned char issue. The type 'char' as defined in C on some implementations range from -127 to 128, on others from 0 to 255. The rest of this message assumes the compiler has signed chars as a default. In an 8-bit set some character will have values > 128. Is it kosher to store these values in char variables? Suppose you do it, then as long as you simply use char variables in simple assignments/test or as buffers, all is probably OK. As soon as you use the variable in arithmetic expressions or assigments to other types things become more muddy. So, the better way to do it might be to change all char variables in your program to unsigned char? But that opens another can of worms. Many compilers spit out warnings for expression like: unsigned char *junk = "morejunk"; lint dislikes things like: unsigned char buff[100]; write(fd, buff, 100); Of course you could add casts in all those places, but is sort of doesn't 'feel' right to do that. It seems you lose in either case. Opinions? -- Robert Andersson, International Systems A/S, Oslo, Norway. Internet: ra@is.uu.no UUCP: ...!{uunet,mcsun,ifi}!is.uu.no!ra
henry@utzoo.uucp (Henry Spencer) (02/11/90)
In article <1990Feb10.151053.16702@is.uu.no> ra@is.uu.no (Robert Andersson) writes: >In an 8-bit set some character will have values > 128. >Is it kosher to store these values in char variables? Well, there is of course the question of whether the variable is big enough for the value, disregarding the sign issue. However, that aside, the real issue here is assigning a value which would fit if chars were unsigned but isn't in the value space of a char which is signed. This conversion has implementation-defined results (section 3.2.1.2, Oct 88 draft). It might cause an overflow if somebody is being paranoid. One would hope that compilers which do that will quickly be fixed not to. >Suppose you do it, then as long as you simply use char variables in >simple assignments/test or as buffers, all is probably OK. As soon as >you use the variable in arithmetic expressions or assigments to other >types things become more muddy. By definition, expression/conversion code which cares whether char is signed or not is unportable. The correct approach is to fix it, in one way or another. >So, the better way to do it might be to change all char variables in >your program to unsigned char? But that opens another can of worms. >Many compilers spit out warnings... Moreover, on some machines you will take a serious efficiency hit for this. The right thing to do is to use "signed char" or "unsigned char" in places where the code cares, and use "char" when it does not care. Most of the time, cleanly-written code does not care. Using chars as characters -- rather than small integers -- seldom runs into sign issues (although there are occasional hassles in table lookup and function calls). Code which uses chars as small integers almost certainly should be using "signed char" or "unsigned char" to explicitly indicate its needs. -- SVR4: every feature you ever | Henry Spencer at U of Toronto Zoology wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu
minow@mountn.dec.com (Martin Minow) (02/11/90)
In article <1990Feb10.151053.16702@is.uu.no> ra@is.uu.no (Robert Andersson) asks for ideas regarding 8-bit character sets. > >It seems you lose in either case. >Opinions? >-- Unfortunately, about all you can do is to sprinkle casts (char *) to (unsigned char *) or (void *) throughout your program. You can make this slightly more palatible by typedefs, such as typedef unsigned char byte; typedef byte *string; but you still have to tiptoe around lint. If you have an Ansi-complient compiler, you can cheat by defining function arguments (and string pointers) as "void *". A more subtle problem is that you must make sure that comparison routines such as strcmp and the more complicated regular-expression routines work correctly for 8-bit characters. Martin Minow minow@thundr.enet.dec.com
karl@haddock.ima.isc.com (Karl Heuer) (02/15/90)
In article <1990Feb11.012110.2338@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >Most of the time, cleanly-written code does not care. Using chars as >characters -- rather than small integers -- seldom runs into sign issues The problem is that getc() and the <ctype.h> functions deal with a type which should have been "char" but in fact is a certain subrange of "int", namely the union of {EOF} and the values of "unsigned char". Thus, even cleanly-written code has to be sprinkled with casts to convert "plain char" to "unsigned char" before handing it to "isprint()". (I'm assuming ANSI C semantics here, so there's no isascii() nonsense.) Fixing this (by which I mean doing it *right*, not slapping on a backward- compatible patch) would involve getting rid of the constant EOF entirely. Of course, it's too late to change it now. Karl W. Z. Heuer (karl@ima.ima.isc.com or harvard!ima!karl), The Walking Lint
jba@harald.ruc.dk (Jan B. Andersen) (02/15/90)
ra@is.uu.no (Robert Andersson) writes: >I would like to discuss. What I would like is to hear other peoples opinion >on the signed/unsigned char issue. >The type 'char' as defined in C on some implementations range from >-127 to 128, on others from 0 to 255. The rest of this message assumes >the compiler has signed chars as a default. >In an 8-bit set some character will have values > 128. Formally speaking, no! As you point out yourself, we're talking about 8 bit character sets. This simply means, that all 8 bits are (potentially) used to represent a character. Of course, one may look upon those bits as an unsigned integer with values from 0 to 255. >Is it kosher to store these values in char variables? I should think so. As long as |char| is guaranteed to hold at least 8 bit, the compiler should be able to map any character constant into a unique value before storing it. >Suppose you do it, then as long as you simply use char variables in >simple assignments/test or as buffers, all is probably OK. As soon as >you use the variable in arithmetic expressions or assigments to other >types things become more muddy. >So, the better way to do it might be to change all char variables in >your program to unsigned char? But that opens another can of worms. >Many compilers spit out warnings for expression like: > unsigned char *junk = "morejunk"; I'll take your word for it, although I don't see way it should complain. >lint dislikes things like: > unsigned char buff[100]; > write(fd, buff, 100); Probably because |buff| was declared as |char *buff| in the header. It might be more correct to declare it as |void *buff|. >Opinions? Just my 0.25 kr. -- Jan B. Andersen <jba@dat.ruc.dk> ("SIMULA does it with CLASS")