[net.micro.pc] Signed char - What Foolishness Is This!

jwg@duke.UUCP (Jeffrey William Gillette) (10/16/86)

[]

OK, I've been bitten.  I admit it.

MSC 4.0 defaults 'char' to 'signed char'.  For standard ASCII there
is no difference between 'signed char' and 'unsigned char'.  When I
get to IBM's extensions to ASCII the situation is much different!
<ctype.h> makes the following #define:

     #define isupper(c)     ( (_ctype+1)[c] & _UPPER )

where 'UPPER' is a bit mask used in a table of character definitions
('_ctype').  This works great when c = 65 ('A'), but when c = 154
('U'umlaut) the macro produces the following:

        ( (_ctype+1)[-102] & UPPER )

an obviously foolish bug.

The problem here lies with Microsoft.  The #defines in <ctype.h> 
are sloppy.  The example above should have been

     #define isupper(c)     ( (_ctype+1)[(unsigned char)c] & _UPPER )

Beyond this particular and annoying consequence of MS's decision 
to make 'char' = 'signed char', I have two more general questions
(thus the post to net.lang.c).

1)      Do other C compilers make 'char' a signed quantity by default?

2)      What possible justification is there for this default?  Is not
'char' primarily a logical (as opposed to mathematical) quantity?  What
I mean is, what is the definition of a negative 'a'?  I can understand 
the desirability of allowing 'signed char' for gonzo programmers who
won't use 'short', or who want to risk future compatibility of their
code on the bet that useful characters will always remain 7-bit entities.

Peace,

Jeffrey William Gillette
Humanities Computing Facility
Duke University
        duke!jwg

-- 
Jeffrey William Gillette	uucp:  mcnc!duke!jwg
Humanities Computing Project 	bitnet: DYBBUK @ TUCCVM
Duke University

guy@sun.UUCP (10/17/86)

> 1)      Do other C compilers make 'char' a signed quantity by default?

Yes.  Lots and lots of them, including the very first C compiler ever
written (if there was an earlier one, Dennis, let me know...) - the PDP-11 C
compiler.

> 2)      What possible justification is there for this default?

1) When the PDP-11 C compiler was written, ASCII characters *were* 7-bit
characters, and there was no general use of 8-bit characters, and 2) the
PDP-11 treated bytes as signed, rather than unsigned, so references to ASCII
characters as unsigned rather than signed costs some time and bought you
nothing.  I suspect Microsoft did this to make less-than-portable code
written for PDP-11s and VAXes work on 8086-family machines without change.

> Is not 'char' primarily a logical (as opposed to mathematical) quantity?

Yes, but the people to complain to here are ultimately the designers of the
PDP-11 (although a lot of string manipulation on PDP-11s could be done using
unsigned characters without much penalty).

> I can understand the desirability of allowing 'signed char' for gonzo
> programmers who won't use 'short',

It's not a question of "gonzo programmers who won't use 'short'.  There are
times where you absolutely *must* have a one-byte number in a structure;
"short" just won't cut it here.  (Bit fields would, perhaps, except that you
can't take the address of a bit field.)  Structures representing device
registers, or representing fields in other externally-specified data, are an
example of this.  Also, if you have a *huge* array of integers in the range
-127 to 128, you may take a significant performance hit by using "short"
rather than "char" (remember, "short" takes twice the amount of memory that
"char" does on most implementations).

> or who want to risk future compatibility of their code on the bet that
> useful characters will always remain 7-bit entities.

They're risking nothing.  "signed char" is a gross way of saying "short
short int", not a way of saying "signed character" (which, as you say, is
meaningless).  Unfortunately, C originally didn't have "short" or "long",
and when they were added they did not cascade.

I presume, by the way, that "isupper(<u-umlaut>)" is intended to return 0
and "isupper(<U-umlaut>)" is intended to return 1.  If Microsoft didn't put
the extended character set into the "ctype" tables, the way that the
indexing is done is irrelevant.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

thomps@gitpyr.gatech.EDU (Ken Thompson) (10/18/86)

In article <8719@duke.duke.UUCP>, jwg@duke.UUCP (Jeffrey William Gillette) writes:
> []
> 1)      Do other C compilers make 'char' a signed quantity by default?

I use a Masscomp system compatible with System V and BSD 4.2 which
has signed characters by default. I find this very annoying when
porting software from machines with the opposite convention but
it causes no problem with code written for this machine.
> 
> 2)      What possible justification is there for this default?  Is not
> 'char' primarily a logical (as opposed to mathematical) quantity?  What
> I mean is, what is the definition of a negative 'a'?  I can understand 
> the desirability of allowing 'signed char' for gonzo programmers who
> won't use 'short', or who want to risk future compatibility of their
> code on the bet that useful characters will always remain 7-bit entities.
> 
I see no justification for signed characters and the concept of a signed
character is somewhat strange. The problem arises because chars used in
an expression in C are automatically converted to type int. Signed
characters come about when the conversion is made from a char which is
an 8 bit quantity to an int which is 16 bits or larger. I do not know
of any C compilers which actually view a char as an 8 bit signed entity.
Instead the char becomes negative to to sign extension during conversion
to int.

-- 
Ken Thompson  Phone : (404) 894-7089
Georgia Tech Research Institute
Georgia Insitute of Technology, Atlanta Georgia, 30332
...!{akgua,allegra,amd,hplabs,ihnp4,seismo,ut-ngp}!gatech!gitpyr!thomps

plocher@puff.wisc.edu (John Plocher) (10/19/86)

guy@sun.UUCP respondes to another poster about ctype macros
and characters on the IBM PC with the 8th bit set:
>> 1)      Do other C compilers make 'char' a signed quantity by default?
>
>I presume, by the way, that "isupper(<u-umlaut>)" is intended to return 0
>and "isupper(<U-umlaut>)" is intended to return 1.  If Microsoft didn't put
>the extended character set into the "ctype" tables, the way that the
>indexing is done is irrelevant.

  I hope you both remember that isANYTHING(x) is only defined to work
if isascii(x) is true!  isascii(u-umlaut) is FALSE!  Thus, isupper(u-umlaut)
does not NEED to work.
-- 
		harvard-\         /- uwmacc!uwhsms!plocher        (work)
John Plocher     seismo-->!uwvax!<
		  topaz-/         \- puff!plocher                 (school)
"Never trust an idea you get sitting down" - Nietzche

chapman@cory.Berkeley.EDU (Brent Chapman) (10/19/86)

In article <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes:
>MSC 4.0 defaults 'char' to 'signed char'.  

[ it defaulted to 'unsigned char' in previous versions of MSC -- Brent]

	[ details relating to a gotcha in header files, because Microsoft
	  didn't cast a (possibly) negative char value into an unsigned
	  value when using it to index an array, deleted ]

>What possible justification is there for this default?  Is not
>'char' primarily a logical (as opposed to mathematical) quantity?  What
>I mean is, what is the definition of a negative 'a'?  I can understand 
>the desirability of allowing 'signed char' for gonzo programmers who
>won't use 'short', or who want to risk future compatibility of their
>code on the bet that useful characters will always remain 7-bit entities.

This brings up some interesting questions and ambiguities concerning
K&R's definition of C.  I haven't seen the proposed ANSI standard, so
I can't comment on it.  But K&R will do to illustrate the ambiguities;
perhaps someone else can point out if and how the proposed standard
deals with them up.

On page 34, K&R define a 'char' to be "a single byte, capable of holding
one character in the local character set."  On page 40, they say "The
language does not specify whether variables of type char are signed or
unsigned quantities."  This seems to imply that the implementor is free
to choose the default that he feels best suits his implementation.  On
most machines, this is a moot point, since most machines only use the
0 to 127 range for character values, which is available regardless of
whether the char is signed or unsigned.  On the PC, however, it _does_
make a difference, because the upper 128 characters of the PC's character
set _are_ printable, and are numbered from 128 through 255.  Logic would
seem to indicate the 'unsigned char' is the reasonable choice for the
default on a C compiler for the PC.

Unfortunately, most other C implementations, especially UNIX C implemetations,
seem to default char to 'signed'.  (Note that I've been assured of this
by knowledgeable sources, but don't have any first hand knowledge, so I could
be wrong.)  This is a reasonable choice because, in the original K&R
C definition, there is no 'signed' keyword.  Therefore, everything should
default 'signed' because if it defaults 'unsigned', there's no way to change
it to 'signed'.  Many implementations now include the 'signed' keyword,
however.  I don't know if it is a part of the proposed ANSI standard,
but I think that it probably is.

Now, Microsoft apparently decided to change their default for chars from
'unsigned', which is what it was in versions of the compiler previous to
Ver 4.0, and which makes sense for a PC, to 'signed', which makes sense
because of K&R's lack of a 'signed' keyword, and because most other 
implementations are that way.  The original poster got bitten because
Microsoft used a 'char' (which could be negative) as an array index, instead
of casting it to 'unsigned char', in one of their library header files.

Perhaps the most general, portable solution is not to use char variables
for counting or array indexing.  If you need a counter, use a short,
which will default signed unless you say otherwise.  If you need an
array index, cast to an 'unsigned char' or an 'unsigned short'. 
Unfortunately, there is no guarantee that a short is as small as a char,
so you may be wasting some space.  Worse, there is no guarantee that a
short is as _long_ as a char, although I doubt there is any
implemetation where this is true.  You currently can't count on whether
a char will be signed or unsigned.  Does the proposed ANSI standard
address this? 

Fortunately, with MSC Ver 4.0, you can have your cake and eat it too.  There
is a command-line option to the compiler that will change the default from
'signed' to 'unsigned'.  I think it's '-J', but I'm not certain, since I'm
at home and my manuals are at work.

Brent
--
Brent Chapman

chapman@cory.berkeley.edu	or	ucbvax!cory!chapman

gwyn@brl-smoke.ARPA (Doug Gwyn ) (10/19/86)

In article <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes:
>     #define isupper(c)     ( (_ctype+1)[c] & _UPPER )
>
>The problem here lies with Microsoft.

No, the problem lies with the programmer.  The is*() functions have
(int), not (char), arguments.  When you feed one a (char), it will
be promoted to an (int) by the usual rules, including possible sign
extension.  The macro definition acts the same as a function in this
regard, since array indices are (int), not (char), also.  Microsoft's
definition is correct.

>1)      Do other C compilers make 'char' a signed quantity by default?

Dennis Ritchie's original (PDP-11) C compiler did.

>2)      What possible justification is there for this default?

(a) less cost on machines like the PDP-11

(b) the programmer can, using suitable code, force whatever behavior
	he wants

>I mean is, what is the definition of a negative 'a'?

It might surprise you to learn that 'a' represents an (int) constant,
not a (char).  C (char)s are just short integral types whose signedness
depends on the implementation (however, (signed char) and (unsigned char)
have definite signedness).  Dennis intended that sizeof(char)==1 but
I can make a strong argument that that isn't necessary.

P.S.  I suggest people learn what is going on before raving about it.
That would sure reduce the noise level of net.lang.c.

mark@ems.UUCP (Mark Colburn) (10/20/86)

In article <8273@sun.uucp>, guy@sun.UUCP writes:
> > I can understand the desirability of allowing 'signed char' for gonzo
> > programmers who won't use 'short',
> 
> It's not a question of "gonzo programmers who won't use 'short'.

It is important to note that K&R define that:

	char  8 or more bits
	short 16 or more bits

Although these values may be implementation specific.  On my 68020 based
machine, shorts are 16 bits.  When I need an 8 bit unsigned value (e.g. a byte)
in my code (which happens quite frequently when you are writing software to
support 8 bit CPU's) I use 'unsigned char'.

I got myself into all sorts of trouble when I was first using C because I
assumed that if an int is 16 bits, then a short must be 8.  Right?  Wrong!
On the compiler that I was using, int was 16 bits and so was short.  This
is consistent with K&R (and, I believe, the proposed ANSI standard).

Therefore, the only portable way to express a true byte (8-bit) value is with
an 'unsigned int' declaration.  This may still get you into trouble when you
are working on a compiler that uses characters that are more than 8 bits.
Don't laugh, there are some out there.  It is also allowed for in the language
definition.  Notice that a character may be 8 or more bits.  Since machines
that use chars that are larger than 8 bits are relatively infrequent, I
callously disregard their existence in my code.  (I am sure that I will get
bitten by it one of these days, but hey, gives a guy some kinda job security).

-- 
Mark H. Colburn             UUCP: ihnp4!rosevax!ems!mark
EMS/McGraw-Hill              ATT: (612) 829-8200
9855 West 78th Street
Eden Prairie, MN  55344

henry@utzoo.UUCP (Henry Spencer) (10/20/86)

> 1)      Do other C compilers make 'char' a signed quantity by default?

Yes.  Almost any C compiler for machines like the PDP11, the VAX, the
8088, and so forth, will.

> 2)      What possible justification is there for this default?  Is not
> 'char' primarily a logical (as opposed to mathematical) quantity?  ...

The problem started with the PDP11, the first machine C was implemented
on.  A minor quirk of the 11 made it substantially more efficient to
manipulate characters as signed entities.  This hardware quirk has been
carried over, unfortunately, into a good many newer machines that have
imitated the 11 to some degree.  Compilers for these machines have a
choice of generating inefficient code or using signed characters.  Since
any decent C documentation warns you that the signedness or lack thereof
of characters is not portable, this is considered legitimate.

I believe Dennis is on record as mildly regretting the original decision
to go along with the hardware's prejudices, but it's a bit late now.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

chapman@cory.Berkeley.EDU (Brent Chapman) (10/21/86)

In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes:
>It is important to note that K&R define that:
>
>	char  8 or more bits
>	short 16 or more bits

Just _WHERE_ does K&R say this?  No place that I've ever seen...  The only
thing that I can figure is that you are inferring these "minimum" values
from the table of _sample_ type sizes on p. 34; this is not a good thing
to do.

Note to everyone: If you're going to quote from something, especially K&R,
_please_ check to make sure it says what you _think_ it says, and then 
include the page number of the info which supports your posting.

>Although these values may be implementation specific.  On my 68020 based
>machine, shorts are 16 bits.  When I need an 8 bit unsigned value (e.g. a byte)
>in my code (which happens quite frequently when you are writing software to
>support 8 bit CPU's) I use 'unsigned char'.
>
>I got myself into all sorts of trouble when I was first using C because I
>assumed that if an int is 16 bits, then a short must be 8.  Right?  Wrong!

Definitely wrong.  On p. 34, K&R say "The intent is that short and long 
should provide different lengths of ingeters _where practical_ [emphasis
mine -- Brent]; int will normall reflect the most "natural" size for a
particular machine.  As you can see, each compiler is free to interpret
short and long as appropriate for its own hardware.  About all you should
count on is that short is no longer than long."

Nowhere (that I'm aware of, anyway, and I looked carefully for it) does
K&R say that ints must be at least 16 bits, nor that chars must be at
least 8 bits.  I seem to recall hearing about some screwey machine 
whose "character size" and "most natural integer size" were both 12
bits; for that machine, types 'char', 'int', and 'short' were all 12-bit
quantities.

>Therefore, the only portable way to express a true byte (8-bit) value is with
>an 'unsigned int' declaration.  This may still get you into trouble when you
>are working on a compiler that uses characters that are more than 8 bits.

'unsigned int'?  Are you sure you don't mean 'unsigned char'?  But even if
you do, there's no guarantee that you get what you call a "true byte"; there's
nothing in K&R that outlaws a 7-bit char, for instance.  The definiton of
char (again, on p. 34) is "a single byte, capable of holding one character
in the local character set".  Note that "byte" doesn't automatically mean
"8 bits".

Brent

chapman@cory.berkeley.edu	or	ucbvax!cory!chapman
--
Brent Chapman

chapman@cory.berkeley.edu	or	ucbvax!cory!chapman

mikes@apple.UUCP (Mike Shannon) (10/22/86)

In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes:
>
>MSC 4.0 defaults 'char' to 'signed char'.  ...
>             This works great when c = 65 ('A'), but when c = 154
>('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER )
> ....
	The problem is that the u-umlaut char is being treated as negative.
K&R, page 183, section 6.1 says "... but it is guaranteed that a member of the
standard character set is non-negative."
	Apple experienced the same problem with an extended character set.
I believe that u-umlaut is part of your machine's standard character set, and
so I would argue that MSC does not conform to K&R in this respect.
-- 
			Michael Shannon {apple!mikes}

james@reality1.UUCP (james) (10/22/86)

In article <8719@duke.duke.UUCP>, jwg@duke.UUCP (Jeffrey William Gillette) writes:
> MSC 4.0 defaults 'char' to 'signed char'.
  ...
> 2)      What possible justification is there for this default?
  ...

I hate to bring up the manual, but there is a /J option now that makes the
char default to an unsigned value.  There is also a keyword "signed" to
override that default when /J is used.  Which also brings up keeping your
software up to date if you haven't upgraded to version 4.0 yet...

-- 
James R. Van Artsdalen    ...!ut-ngp!utastro!osi3b2!james    "Live Free or Die"

tim@ism780c.UUCP (Tim Smith) (10/25/86)

In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes:
>
>It is important to note that K&R define that:
>
>	char  8 or more bits
>	short 16 or more bits
>

Where do K&R say this?
-- 
member, all HASA divisions              POELOD  ECBOMB
                                        --------------
                                               ^-- Secret Satanic Message

Tim Smith       USENET: sdcrdcf!ism780c!tim   Compuserve: 72257,3706
                Delphi or GEnie: mnementh

chris@umcp-cs.UUCP (Chris Torek) (10/29/86)

>In article <14@ems.UUCP> mark@ems.UUCP (Mark Colburn) writes:
>>It is important to note that K&R define that:
>>	char  8 or more bits
>>	short 16 or more bits

In article <661@zen.BERKELEY.EDU> chapman@cory.Berkeley.EDU.UUCP
(Brent Chapman) writes:
>Just _WHERE_ does K&R say this?

K&R do not define minimum sizes.  They do provide a table listing
some existing implementations (pp. 34 and 182), and they say also
this:

  A character or a short integer may be used wherever an integer
  may be used.  In all cases the value is converted to an integer.
  Conversion of a shorter integer to a longer always involves sign
  extension; integers are signed quantities.  Whether or not
  sign-extension occurs for characters is machine dependent, but
  it is guaranteed that a member of the standard charcter set is
  non-negative.  Of the machines treated by this manual, only the
  PDP-11 sign-extends.

The most recent X3J11 (`ANSI C') draft I saw, however, *did* define
minimum sizes in bits for `char', `short', `int', and `long'.  It
still left `char' sign extension up to the compiler, but added the
types `signed char' and `unsigned char'.

>>Although these values may be implementation specific.  On my 68020
>>based machine, shorts are 16 bits.  When I need an 8 bit unsigned
>>value (e.g. a byte) in my code (which happens quite frequently when
>>you are writing software to support 8 bit CPU's) I use 'unsigned
>>char'.

Note that `unsigned char' is not valid according to K&R (although
most C compilers have such a type).

In a particular set of `portable' programs I wrote (and am still
writing), I needed 8, 16, 24, and 32 bit integers, with both
signed and unsigned varieties for 8, 16, and 24 bits.  Toward this
end I have one machine-dependent `#include' file called `types.h';
in it I define the following:

	UnSign8(n)		produce an unsigned 8 bit integer value
				given the possibly-signed integer value n
	Sign8(n)		produce a sign extended 8 bit integer
				value (i.e., 128 -> -128; 255 -> -1)
	UnSign16(n)		produce an unsigned 16 bit value
	Sign16(n)		sign extend a 16 bit value
	UnSign24(n)		produce an unsigned 24 bit value
	Sign24(n)		sign extend a 24 bit value

	i32			a 32 (minimum) bit integer type

Instead of trying to find types of the proper sizes, I have one
that is large enough for all, and a set of macros to coerce it so
as to properly represent the smaller values.  I believe this can
be implemented on any machine on which the software could ever run.
The macros themselves are machine dependent, but well-isolated.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7690)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

bet@ecsvax.UUCP (Bennett E. Todd III) (10/31/86)

In article <228@apple.UUCP> mikes@apple.UUCP (Mike Shannon) writes:
>In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes:
>>
>>MSC 4.0 defaults 'char' to 'signed char'.  ...
>>             This works great when c = 65 ('A'), but when c = 154
>>('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER )
>> ....
>	The problem is that the u-umlaut char is being treated as negative.
>K&R, page 183, section 6.1 says "... but it is guaranteed that a member of the
>standard character set is non-negative."
>	Apple experienced the same problem with an extended character set.
>I believe that u-umlaut is part of your machine's standard character set, and
>so I would argue that MSC does not conform to K&R in this respect.

I haven't looked at MSC 4.0 yet, but I've been using MSC 3.0 for a
while, and it sounds like this hasn't changed. In the documentation, it
is exceedingly clear about this (and agrees with proper portable
programming practice for UNIX systems):

	int isascii(c);	/* test for ASCII character (0x00-0x7F) */
[...]
	"The isacii routine produces a meaningful result for all integer
	 values. However, the remaining routines produce a defined result
	 only for integer values corresponding to the ASCII character
	 set (that is, only where isascii holds true) or for the non-ASCII
	 value EOF (defined in stdio.h)."

I'd say MSC is completely in the right on this one; C is a portable
programming language, and MSC is the best implementation I've seen for
porting code to the PC. The documentation is clear; they implement a C
environment with an ASCII, not "Extended ASCII", character set. If
portability is not of interest, then you can hack up the ctype macros,
or whatever. However, portable programs should always use an idiom like

	if (isascii(c) && isupper(c)) {

or whatever. Program defensively! Write programs that run *anywhere*,
and check for anything that might possibly go wrong. Then give them away
for free.

-Bennett
-- 

Bennett Todd -- Duke Computation Center, Durham, NC 27706-7756; (919) 684-3695
UUCP: ...{decvax,seismo,philabs,ihnp4,akgua}!mcnc!ecsvax!duccpc!bet
BITNET: DBTODD@TUCC.BITNET -or- DBTODD@TUCCVM.BITNET -or- bet@ECSVAX.BITNET

jsdy@hadron.UUCP (Joseph S. D. Yao) (11/07/86)

In article <228@apple.UUCP> mikes@apple.UUCP (Mike Shannon) writes:
>In <8719@duke.duke.UUCP> jwg@duke.UUCP (Jeffrey William Gillette) writes:
>>MSC 4.0 defaults 'char' to 'signed char'.  ...
>>             This works great when c = 65 ('A'), but when c = 154
>>('U'umlaut) the macro produces the following: ( (_ctype+1)[-102] & UPPER )
>> ....

The standard (K&R, and now X3J11) has always held that 'char' is
signed or unsigned, depending on the machine.  For good, usable,
portable code I've tended to do arithmetic ops (such as indexing
an array) in shorts or ints, masking input chars.  This is also
great for testing EOF ...  Also, the ctype macros should NOT be
used with any char > 0177, simply because many or most current
implementations only use a buffer that large.
-- 

	Joe Yao		hadron!jsdy@seismo.{CSS.GOV,ARPA,UUCP}
			jsdy@hadron.COM (not yet domainised)