greg@sdcsvax.UUCP (09/06/84)
Recently there was a discussion as to whether "char" variables should be signed or unsigned. After some debate on the subject, the consensus was that the choice should be whichever is "most optimal" for the architecture involved. I don't want to add fuel to that controversy, but I have an interesting variant on that topic: suppose neither choice is optimal all the time? In particular, we have an architecture which, for reasons not relevant here, is asymmetric with respect to sign-extension on characters. If the character value is referenced directly via a pointer (or actually, any calculation that produces a pointer directly to the variable) it is more efficient to access the variable as signed. At all other times, it is more efficient to access the variable as unsigned. OK, what's the choice? Should our characters be signed or unsigned? Or should we use the "most optimal" rule and have the characters be signed or unsigned depending upon the type of access? We are all aware of the portability problems associated with assuming one or the other; what will happen if char variables are randomly sign-extended? In other words, does a portable program assume that char variables are consistent in their sign-extension? Note that if consistency is desired, the "most optimal" choice will vary with the application. If lots of references are made to char variables via pointers, the choice will be sign-extended chars; if lots of references are made to ordinary variables (or anything requiring an offset from a pointer), the choice will be unsigned chars. Which access type predominates? I'd appreciate any thoughts you might have on this issue. Are there any C gurus out there who can offer any enlightenment? What does the emerging C standard have to say on the subject? -- -- Greg Noel, NCR Torrey Pines Greg@sdcsvax.UUCP or Greg@nosc.ARPA
jejones@ea.UUCP (09/10/84)
#R:sdcsvax:-3000:ea:5700015:000:459 ea!jejones Sep 10 13:36:00 1984 Sigh. "Characters" are glyphs, which are stored in some (machine-dependent) form internally. What C actually provides is a byte type. Programmers tend to use it as short short int, since on some machines it is treated as an eight-bit two's-complement number, and since one can't have arrays of bit fields. I would think that the standard should say that the behavior of "char" to int coercion is undefined, other than being a 1-1 mapping. James Jones
henry@utzoo.UUCP (Henry Spencer) (09/12/84)
> ... what will > happen if char variables are randomly sign-extended? In other words, does > a portable program assume that char variables are consistent in their > sign-extension? Interesting question. One can argue (I have been heard to do so) that if a program is to be portable, it can use char variables for only two things: (1) characters, which are guaranteed non-negative by C, and (2) small non-negative integers. If a program is portable in this fairly-strong sense, there's no problem because the top bit is never on and the sign-extension behavior is irrelevant. One place where I would foresee problems is in things like hashing and checksums. I have been known to write code which stated, in a comment, "doesn't matter whether chars are signed or not, but it better be consistent!". I never analyzed the programs deeply to determine whether there really would be problems, but there was obviously enough rope there to hang oneself with. I guess my overall reaction is that there's a good chance that inconsistent sign extension wouldn't foul up too many things, but I would hate to have to bet money on it. The current draft of the ANSI standard says: ... If [things other than `ordinary' characters] are stored in a char object, the behavior is implementation-defined: the values may be treated as either signed or non-negative integers. [Section 2.2.5, draft of 21 Aug 1984] Implementation-defined behavior -- behavior that depends on the characteristics of the implementation and that must be documented for each implementation. [Section 1.1, draft of 21 Aug 1984] The wording could probably be improved, but the current version seems to say that you had better be able to document just how your chars behave, rather than just saying that sign extension occurs or doesn't occur at random. (Note that compiler optimizations etc. may alter the exact form used to access a character variable, so the source code isn't a reliable guide unless the compiler is very careful.) > Note that if consistency is desired, the "most optimal" choice will vary > with the application. If lots of references are made to char variables > via pointers, the choice will be sign-extended chars; if lots of references > are made to ordinary variables (or anything requiring an offset from a > pointer), the choice will be unsigned chars. Which access type predominates? Chars are accessed an awful lot via pointers, since that's how all string manipulation is done in C. I would think that simple char variables and offset references would be rather less common than just "*cp". -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry