[comp.lang.c] Why unsigned chars not default?

mendozag@pur-ee.UUCP (Grado) (10/21/88)

  A guy around here is trying to port to several machines a program he
 hacked away in a PC using Lattice C.  For some obscure reason in his
 original program he decided to use only low-level I/O. That forced 
 him to "split" integers and then save them as 2 bytes and then later
 when the file is read back the integers are put together(!). 

  However, much to his dismay, other compilers (LSC, MSC, and Unix)
 require him to declare as unsigned char the I/O buffer (which he
 also uses for arithmetic operations) else the chars are negative
 numbers when the their contents represents value > 127. (He does
 a lot of arithmetic with characters representing integers).

   He claims the compilers are at fault and that all the compilers
 should have 'unsigned char' as default for characters so you
 can do all sorts of arithmetic with them.
 Any comments and/or suggestions I can pass along?
 [He basically learned C while developing this program and now has
  the chance of porting it to other machines, with copyright and all!].


 mendozag@ecn.purdue.edu

friedl@vsi.COM (Stephen J. Friedl) (10/22/88)

In article <9563@pur-ee.UUCP>, mendozag@pur-ee.UUCP (Grado) writes:
> 
>    He claims the compilers are at fault and that all the compilers
>  should have 'unsigned char' as default for characters so you
>  can do all sorts of arithmetic with them.
>  Any comments and/or suggestions I can pass along?

There are very good reasons for this.

The large (overwhelming?) use of char variables is for
characters, where sign is not an issue.  While most modern
architectures can handle all data types in both signed and
unsigned manners, older machines had a "natural" method for byte
handling with a substantial penalty for doing it "the other way".
Apparently it was felt that this penalty was too high for what
was seen as limited utility.

If a machine supports signed and unsigned byte operations, it
is up to the compiler writer to select which ever one she likes
the most.

The dpANS will allow the /signed/ keyword to do the obvious
thing to chars, but it is unwise to rely on anything other than
unsigned.

-- 
Steve Friedl    V-Systems, Inc.  +1 714 545 6442    3B2-kind-of-guy
friedl@vsi.com     {backbones}!vsi.com!friedl    attmail!vsi!friedl
---------Nancy Reagan on the Three Stooges: "Just say Moe"---------

gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/23/88)

In article <9563@pur-ee.UUCP> mendozag@ee.ecn.purdue.edu (Victor M Grado) writes:
-   He claims the compilers are at fault and that all the compilers
- should have 'unsigned char' as default for characters so you
- can do all sorts of arithmetic with them.
- Any comments and/or suggestions I can pass along?

Perhaps you could tell him to learn C instead of guessing about it.

henry@utzoo.uucp (Henry Spencer) (10/23/88)

In article <9563@pur-ee.UUCP> mendozag@ee.ecn.purdue.edu (Victor M Grado) writes:
>   He claims the compilers are at fault and that all the compilers
> should have 'unsigned char' as default for characters...

[Possibly we ought to have a "frequently-asked questions" posting in this
group.  Here, slightly modified, is something I posted two years ago,
when a debate raged on this issue.]

Would he still feel this way if all manipulations of unsigned char took
three times as long as those of signed char?  It can happen.

All potential participants in this debate please attend to the following.

- There exist machines (e.g. pdp11) on which unsigned chars are a lot less
	efficient than signed chars.

- There exist machines (e.g. ibm370) on which signed chars are a lot less
	efficient than unsigned chars.

- Many applications do not care whether the chars are signed or unsigned,
	so long as they can be twiddled efficiently.

- For this reason, char is intended to be the more efficient of the two.

- Many old programs assume that char is signed; this does not make it so.
	Those programs are wrong, and have been all along.  Alas, this is
	not a comfort if you have to run them.

- The Father, the Son, and the Holy Ghost (K&R1, H&S, and X3J11 resp.) all
	agree that characters in the "source character set" (roughly, those
	one uses to write C) must look positive.  Actually, the Father and
	the Son gave considerably broader guarantees, but the Holy Ghost
	had to water them down a bit.

- The "unsigned char" type exists (in most newer compilers) because there
	are a number of situations where sign extension is very awkward.
	For example, getchar() wants to do a non-sign-extended conversion
	from char to int.

- X3J11, in its semi-infinite wisdom, has decided that it would be nice to
	have a signed counterpart to "unsigned char", to wit "signed char".
	Therefore it is reasonable to expect that most new compilers, and
	old ones brought into conformance with the yet-to-be-issued standard,
	will give you the full choice:  signed char if you need signs,
	unsigned char if you need everything positive, and char if you don't
	care but want it to run fast.

- Given that many compilers have not yet been upgraded to match even the
	current X3J11 drafts, much less the final endproduct (which doesn't
	exist yet), any application which cares about signedness should use
	typedefs or macros for its char types, so that the definitions can
	be revised later.

- The only things you can safely put into a char variable, and depend on
	having them come out unchanged, are characters from the native
	character set and small *positive* integers.

- Dennis Ritchie is on record, as I recall, as saying that if he had to do
	it all over again, he would consider changing his mind about making
	chars signed on the pdp11 (which is how this mess got started).
	The pdp11 hardware strongly encouraged this, but it *has* caused a
	lot of trouble since.  It is, however, much too late to make such
	a change to C.
-- 
The meek can have the Earth;    |    Henry Spencer at U of Toronto Zoology
the rest of us have other plans.|uunet!attcan!utzoo!henry henry@zoo.toronto.edu

carroll@s.cs.uiuc.edu (10/23/88)

In article <9563@pur-ee.UUCP> mendozag@ee.ecn.purdue.edu (Victor M Grado) writes:
- (...)  He claims the compilers are at fault and that all the compilers
- should have 'unsigned char' as default for characters (...)

Absolutely not! The reason is that 'unsigned' is a keyword in C, and
'signed' is not. I got screwed by this porting stuff to the 3b systems,
where unsigned in the default, but the code thought signed was the
default. There is no way to fix that. Where as, if you assumed unsigned,
you merely have to put the 'unsigned' keyword in front of your chars.
FLAME ON -
This bug shows up in 'units', where the exponents are stored in chars,
*signed* chars. On a 3b, this means that units can't deal with negative
powers of dimensions, which is somewhat of a fatal flaw. Although there is
a simple fix (change 'char' to 'short int'), AT&T, through several releases,
*still* hasn't gotten it to work. Who knows what other bugs are floating around
because of something like this?
FLAME OFF

Alan M. Carroll          "How many danger signs did you ignore?
carroll@s.cs.uiuc.edu     How many times had you heard it all before?" - AP&EW
CS Grad / U of Ill @ Urbana    ...{ucbvax,pur-ee,convex}!s.cs.uiuc.edu!carroll

knudsen@ihlpl.ATT.COM (Knudsen) (10/25/88)

In article <9563@pur-ee.UUCP>, mendozag@pur-ee.UUCP (Grado) writes:
>   A guy around here is trying to port to several machines a program he
>  hacked away in a PC using Lattice C.  For some obscure reason in his
>  original program he decided to use only low-level I/O. That forced 
>  him to "split" integers and then save them as 2 bytes and then later
>  when the file is read back the integers are put together(!). 

At least on a Motorola micro (6809 or 680x0) you can say
write(chan, int, 2) and put out the whole integer at once.

>  require him to declare as unsigned char the I/O buffer (which he
>  also uses for arithmetic operations) else the chars are negative
>  numbers when the their contents represents value > 127. (He does
>  a lot of arithmetic with characters representing integers).

This is often a problem.  If he doesn't want to declare the buffer
unsigned (or his compiler, like mine, doesn't support unsigned char),
he can replace c with (c & 255) whenever c is used as an int.

>    He claims the compilers are at fault and that all the compilers
>  should have 'unsigned char' as default for characters so you
>  can do all sorts of arithmetic with them.

All compilers should have unsigned char, but why as default?
Half the time you want short-range *signed* variables -128 to +127.
And if no unsigned char type is supported, the (c & 255) fixes it
relatively cheap; the reverse fix (unsigned to signed) is harder.
Also the (c & 255) fix protects you against unknown compilers,
by guaranteeing unsigned no matter what the default is.

I DO wish compilers would tell you somehow what the default is;
the 3B2 compilers seem to default to unsigned char, which breaks
a lot of old EOF loops.

Finally, your friend should minimize char->int conversions as much
as possible; read the stuff in, transfer to an int variable,
and work on that exclusively.
Since he learned C on it, he should now learn some more C by
thoroughly re-working the code for style and efficiency anyway.
I can't stomach some of the stuff I wrote a few years back.
-- 
Mike Knudsen  Bell Labs(AT&T)   att!ihlpl!knudsen
"Lawyers are like handguns and nuclear bombs.  Nobody likes them,
but the other guy's got one, so I better get one too."

crossgl@ingr.UUCP (Gordon Cross) (10/25/88)

In article <9563@pur-ee.UUCP>, mendozag@pur-ee.UUCP (Grado) writes:
> 
>   However, much to his dismay, other compilers (LSC, MSC, and Unix)
>  require him to declare as unsigned char the I/O buffer (which he
>  also uses for arithmetic operations) else the chars are negative
>  numbers when the their contents represents value > 127. (He does
>  a lot of arithmetic with characters representing integers).
> 

The proposed ANSI C standard states (I am quoting directly from the document):

"   An object declared as a character (char) is large enough to store any
 member of the required source charcater set [ .. ].  If such a character is
 stored in a char object, its value is guaranteed to be non-negative.  If other
 quantities are stored in a char object, the behavior is implementation
 defined:  the values are treated as either signed or non-negative integers."

Basically, this allows each complier writer to explore his whims.  Hope it
helps!

                               Gordon Cross

dzoey@umd5.umd.edu (Joe Herman) (10/25/88)

From article <7354@ihlpl.ATT.COM>, by knudsen@ihlpl.ATT.COM (Knudsen):
> In article <9563@pur-ee.UUCP>, mendozag@pur-ee.UUCP (Grado) writes:
>>   A guy around here is trying to port to several machines a program he
>>  hacked away in a PC using Lattice C.  For some obscure reason in his
>>  original program he decided to use only low-level I/O. That forced 
>>  him to "split" integers and then save them as 2 bytes and then later
>>  when the file is read back the integers are put together(!).
Try this:
	union intaschar {
	    char hilo[sizeof (int)];
	    int val;
        } foo;

	foo.val = somenumber

	write (fh, foo.hilo, sizeof (int));

if for some reason he can't just write the integer out like below.

> 
> At least on a Motorola micro (6809 or 680x0) you can say
> write(chan, int, 2) and put out the whole integer at once.

Ick, I assume you mean:
  write (chan, &int, sizeof (int));   /* excuse the overloading of 'int' */

otherwise you're writing out two bytes of the address of 'int'.

Also, for PC's (at least with microsoft) make sure you open the file in
binary mode if you're going to do binary I/O.


> I DO wish compilers would tell you somehow what the default is;
> the 3B2 compilers seem to default to unsigned char, which breaks
> a lot of old EOF loops.

Remember, functions like getchar, getc &c, return an int, not a char
which gets you around the problem of '\0377' being confused with EOF.


> -- 
> Mike Knudsen  Bell Labs(AT&T)   att!ihlpl!knudsen
> "Lawyers are like handguns and nuclear bombs.  Nobody likes them,
> but the other guy's got one, so I better get one too."

Nice quote.

			Joe Herman
			The University of Maryland

dzoey@terminus.umd.edu


-- 
"Everything is wonderful until you know something about it."

gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/26/88)

In article <2723@ingr.UUCP> crossgl@ingr.UUCP (Gordon Cross) writes:
>The proposed ANSI C standard states (I am quoting directly from the document):

The definition of "character" has been changed.  However, whether a plain
char acts like signed char or unsigned char is up to the implementation,
as it has been since the early days of C.

gwyn@smoke.BRL.MIL (Doug Gwyn ) (10/26/88)

In article <207600005@s.cs.uiuc.edu> carroll@s.cs.uiuc.edu writes:
>Although there is a simple fix (change 'char' to 'short int'), AT&T,
>through several releases, *still* hasn't gotten it to work.
>Who knows what other bugs are floating around because of something
>like this?

Apparently nobody is paid to go around cleaning up old (yet still
important) code.  My favorite was the "#if u3b5|u3b2"s scattered
around in the PWB/Graphics sources where fixing the original bug
("char c=getchar()" etc.) right would have been much simpler.

One would hope that by now all the bugs that Guy Harris, I, and
others tracked down have been fixed in the AT&T master sources,
but somehow I doubt it.