[net.lang.c] Randomly-Signed Character Variables

greg@sdcsvax.UUCP (09/06/84)

Recently there was a discussion as to whether "char" variables should be
signed or unsigned.  After some debate on the subject, the consensus was
that the choice should be whichever is "most optimal" for the architecture
involved.

I don't want to add fuel to that controversy, but I have an interesting
variant on that topic: suppose neither choice is optimal all the time?
In particular, we have an architecture which, for reasons not relevant
here, is asymmetric with respect to sign-extension on characters.  If the
character value is referenced directly via a pointer (or actually, any
calculation that produces a pointer directly to the variable) it is more
efficient to access the variable as signed.  At all other times, it is
more efficient to access the variable as unsigned.

OK, what's the choice?  Should our characters be signed or unsigned?  Or
should we use the "most optimal" rule and have the characters be signed
or unsigned depending upon the type of access?  We are all aware of the
portability problems associated with assuming one or the other; what will
happen if char variables are randomly sign-extended?  In other words, does
a portable program assume that char variables are consistent in their
sign-extension?

Note that if consistency is desired, the "most optimal" choice will vary
with the application.  If lots of references are made to char variables
via pointers, the choice will be sign-extended chars; if lots of references
are made to ordinary variables (or anything requiring an offset from a
pointer), the choice will be unsigned chars.  Which access type predominates?

I'd appreciate any thoughts you might have on this issue.  Are there any
C gurus out there who can offer any enlightenment?  What does the emerging
C standard have to say on the subject?
-- 
-- Greg Noel, NCR Torrey Pines       Greg@sdcsvax.UUCP or Greg@nosc.ARPA

jejones@ea.UUCP (09/10/84)

#R:sdcsvax:-3000:ea:5700015:000:459
ea!jejones    Sep 10 13:36:00 1984

Sigh. "Characters" are glyphs, which are stored in some (machine-dependent)
form internally. What C actually provides is a byte type. Programmers tend
to use it as short short int, since on some machines it is treated as an
eight-bit two's-complement number, and since one can't have arrays of
bit fields.

I would think that the standard should say that the behavior of "char" to
int coercion is undefined, other than being a 1-1 mapping.

						James Jones

henry@utzoo.UUCP (Henry Spencer) (09/12/84)

> ...  what will
> happen if char variables are randomly sign-extended?  In other words, does
> a portable program assume that char variables are consistent in their
> sign-extension?

Interesting question.  One can argue (I have been heard to do so) that
if a program is to be portable, it can use char variables for only two
things:  (1) characters, which are guaranteed non-negative by C, and
(2) small non-negative integers.  If a program is portable in this
fairly-strong sense, there's no problem because the top bit is never on
and the sign-extension behavior is irrelevant.

One place where I would foresee problems is in things like hashing and
checksums.  I have been known to write code which stated, in a comment,
"doesn't matter whether chars are signed or not, but it better be
consistent!".  I never analyzed the programs deeply to determine whether
there really would be problems, but there was obviously enough rope there
to hang oneself with.

I guess my overall reaction is that there's a good chance that inconsistent
sign extension wouldn't foul up too many things, but I would hate to have
to bet money on it.

The current draft of the ANSI standard says:

	... If [things other than `ordinary' characters] are stored
	in a char object, the behavior is implementation-defined: the
	values may be treated as either signed or non-negative integers.
	[Section 2.2.5, draft of 21 Aug 1984]

	Implementation-defined behavior -- behavior that depends on the
	characteristics of the implementation and that must be documented
	for each implementation.  [Section 1.1, draft of 21 Aug 1984]

The wording could probably be improved, but the current version seems to
say that you had better be able to document just how your chars behave,
rather than just saying that sign extension occurs or doesn't occur at
random.  (Note that compiler optimizations etc. may alter the exact form
used to access a character variable, so the source code isn't a reliable
guide unless the compiler is very careful.)

> Note that if consistency is desired, the "most optimal" choice will vary
> with the application.  If lots of references are made to char variables
> via pointers, the choice will be sign-extended chars; if lots of references
> are made to ordinary variables (or anything requiring an offset from a
> pointer), the choice will be unsigned chars.  Which access type predominates?

Chars are accessed an awful lot via pointers, since that's how all string
manipulation is done in C.  I would think that simple char variables and
offset references would be rather less common than just "*cp".
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,linus,decvax}!utzoo!henry