[comp.lang.c] How to write an 8-bit clean program

ra@is.uu.no (Robert Andersson) (02/10/90)

An important subject for us Europeans with all our strange character sets.
But, what is the best way to do it?

Of course, using the eight bit as a flag, assuming that letters range from
'A' to 'Z' etc. are well known things to avoid, and that is not what
I would like to discuss. What I would like is to hear other peoples opinion 
on the signed/unsigned char issue.

The type 'char' as defined in C on some implementations range from 
-127 to 128, on others from 0 to 255. The rest of this message assumes
the compiler has signed chars as a default.

In an 8-bit set some character will have values > 128.
Is it kosher to store these values in char variables?

Suppose you do it, then as long as you simply use char variables in
simple assignments/test or as buffers, all is probably OK. As soon as
you use the variable in arithmetic expressions or assigments to other
types things become more muddy.

So, the better way to do it might be to change all char variables in
your program to unsigned char? But that opens another can of worms.
Many compilers spit out warnings for expression like:
	unsigned char *junk = "morejunk";
lint dislikes things like:
	unsigned char buff[100];
	write(fd, buff, 100);
Of course you could add casts in all those places, but is sort of doesn't
'feel' right to do that.

It seems you lose in either case.
Opinions?
-- 
Robert Andersson, International Systems A/S, Oslo, Norway.
Internet:         ra@is.uu.no
UUCP:             ...!{uunet,mcsun,ifi}!is.uu.no!ra

henry@utzoo.uucp (Henry Spencer) (02/11/90)

In article <1990Feb10.151053.16702@is.uu.no> ra@is.uu.no (Robert Andersson) writes:
>In an 8-bit set some character will have values > 128.
>Is it kosher to store these values in char variables?

Well, there is of course the question of whether the variable is big enough
for the value, disregarding the sign issue.  However, that aside, the real
issue here is assigning a value which would fit if chars were unsigned but
isn't in the value space of a char which is signed.  This conversion has
implementation-defined results (section 3.2.1.2, Oct 88 draft).  It might
cause an overflow if somebody is being paranoid.  One would hope that
compilers which do that will quickly be fixed not to.

>Suppose you do it, then as long as you simply use char variables in
>simple assignments/test or as buffers, all is probably OK. As soon as
>you use the variable in arithmetic expressions or assigments to other
>types things become more muddy.

By definition, expression/conversion code which cares whether char is
signed or not is unportable.  The correct approach is to fix it, in
one way or another.

>So, the better way to do it might be to change all char variables in
>your program to unsigned char? But that opens another can of worms.
>Many compilers spit out warnings...

Moreover, on some machines you will take a serious efficiency hit for
this.  The right thing to do is to use "signed char" or "unsigned char"
in places where the code cares, and use "char" when it does not care.
Most of the time, cleanly-written code does not care.  Using chars as
characters -- rather than small integers -- seldom runs into sign issues
(although there are occasional hassles in table lookup and function calls).
Code which uses chars as small integers almost certainly should be using
"signed char" or "unsigned char" to explicitly indicate its needs.
-- 
SVR4:  every feature you ever |     Henry Spencer at U of Toronto Zoology
wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu

minow@mountn.dec.com (Martin Minow) (02/11/90)

In article <1990Feb10.151053.16702@is.uu.no> ra@is.uu.no (Robert Andersson)
asks for ideas regarding 8-bit character sets.
>
>It seems you lose in either case.
>Opinions?
>-- 
Unfortunately, about all you can do is to sprinkle casts (char *) to
(unsigned char *) or (void *) throughout your program.  You can make
this slightly more palatible by typedefs, such as
	typedef unsigned char	byte;
	typedef byte		*string;
but you still have to tiptoe around lint.  If you have an Ansi-complient
compiler, you can cheat by defining function arguments (and string
pointers) as "void *".

A more subtle problem is that you must make sure that comparison
routines such as strcmp and the more complicated regular-expression
routines work correctly for 8-bit characters.

Martin Minow
minow@thundr.enet.dec.com

karl@haddock.ima.isc.com (Karl Heuer) (02/15/90)

In article <1990Feb11.012110.2338@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes:
>Most of the time, cleanly-written code does not care.  Using chars as
>characters -- rather than small integers -- seldom runs into sign issues

The problem is that getc() and the <ctype.h> functions deal with a type which
should have been "char" but in fact is a certain subrange of "int", namely the
union of {EOF} and the values of "unsigned char".  Thus, even cleanly-written
code has to be sprinkled with casts to convert "plain char" to "unsigned char"
before handing it to "isprint()".  (I'm assuming ANSI C semantics here, so
there's no isascii() nonsense.)

Fixing this (by which I mean doing it *right*, not slapping on a backward-
compatible patch) would involve getting rid of the constant EOF entirely.  Of
course, it's too late to change it now.

Karl W. Z. Heuer (karl@ima.ima.isc.com or harvard!ima!karl), The Walking Lint

jba@harald.ruc.dk (Jan B. Andersen) (02/15/90)

ra@is.uu.no (Robert Andersson) writes:

>I would like to discuss. What I would like is to hear other peoples opinion 
>on the signed/unsigned char issue.

>The type 'char' as defined in C on some implementations range from 
>-127 to 128, on others from 0 to 255. The rest of this message assumes
>the compiler has signed chars as a default.

>In an 8-bit set some character will have values > 128.

Formally speaking, no! As you point out yourself, we're talking about
8 bit character sets. This simply means, that all 8 bits are (potentially)
used to represent a character. Of course, one may look upon those bits
as an unsigned integer with values from 0 to 255.

>Is it kosher to store these values in char variables?

I should think so. As long as |char| is guaranteed to hold at least 8 bit,
the compiler should be able to map any character constant into a unique
value before storing it.

>Suppose you do it, then as long as you simply use char variables in
>simple assignments/test or as buffers, all is probably OK. As soon as
>you use the variable in arithmetic expressions or assigments to other
>types things become more muddy.

>So, the better way to do it might be to change all char variables in
>your program to unsigned char? But that opens another can of worms.
>Many compilers spit out warnings for expression like:
>	unsigned char *junk = "morejunk";

I'll take your word for it, although I don't see way it should complain.

>lint dislikes things like:
>	unsigned char buff[100];
>	write(fd, buff, 100);

Probably because |buff| was declared as |char *buff| in the header. It
might be more correct to declare it as |void *buff|.

>Opinions?

Just my 0.25 kr.
--
Jan B. Andersen <jba@dat.ruc.dk>             ("SIMULA does it with CLASS")