[net.lang.c] type casting problem

betz@decvax.UUCP (06/09/83)

I have been trying to write a CRC routine in C that is transportable
between machines.  In the process of doing this I have encountered the
need to fetch unsigned characters through a char pointer.  This has
revealed a difference of opinion among the various C compilers that I
have been using.

Example:

	register char *bp;
	register unsigned int csum;

	csum ^= (unsigned) *bp++;

Under VAX-11 C and under Berkeley C, the character fetched through
bp ends up being sign extended in spite of the unsigned type cast.
Under DECUS C and V7 unix C, the character is not sign extended.

Which is correct?  If the character is supposed to be sign extended,
how do I fetch an unsigned character from a char pointer?  (short of
anding the resulting character with 0xFF)  Under some compilers it
is possible to declare a variable as unsigned char, but this is not
a universal feature.  Does anyone have any suggestions on how to make
this code machine independant?

Thanks in advance,

David Betz
decvax!betz

leichter@yale-com.UUCP (06/09/83)

The only guaranteed way I know of to get a REALLY "unsigned" character is not
to do it at all - rather, step back and look at what you are trying to do and
do it directly.  To look at it simply:

	You want to do:

		a = *cp;

	where cp is a (char *) and a is "unsigned".

	What you REALLY mean is:  Set the bottom 8 bits of a to the 8 bits at
	the location pointed to by cp, and clear the rest of a.  The "right"
	code is then:

		a = (*cp) & 0377.

(I went through this while doing something similar to calculating CRC's - I
was computing a hash function.  Various combinations of unsigned's failed to
get the right result, although I eventually found a way to write it using a
temporary that did the job; I don't remember how, though.  Note that, on an
11 or VAX, a good peephole optimizer OUGHT to recognize that it can avoid the
AND; on a VAX, it can do this assignment directly (convert byte to word, or
whatever); on an 11, it's a CLR followed by a BISB).  I don't know which
compilers are likely to find this; DECUS C, unfortunately, will not (no peep-
hole!).)

BTW:  If you want REAL portability, you are in trouble, since you shouldn't
assume 8-bit bytes.  There is a (small but significant) class of problems, of
which CRC and hashing functions are typical, in which knowing the actual size
of various data objects is essential.  I'd like to see a small but comprehen-
sive set of "machine constants" defined in some known include file for just
such cases.  Things that should be there are:  bits per:  byte, int, etc.;
masks for each (tough to compute at compile time since there is no exponen-
tiation operator); maximum/minimum values in each size object; not to men-
tion the basic floating point quantities (max precision, etc.).  Of course,
you'd still lose out on non-binary machines, but what can you do...
							-- Jerry
						decvax!yale-comix!leichter
							leichter@yale

mp@mit-eddi.UUCP (Mark Plotnick) (06/10/83)

Well, the compiler SHOULD generate code to widen the char to an int
before converting to an unsigned (according to the K&R book).  The
conversion from char to int is done with a cvtbl on the vax and a movb
on the 11, both of which do sign extension (the vax, pdp11, and most
other machines treat chars as signed).  The conversion from int to
unsigned doesn't require any instructions at all.

So it would appear that the VAX compilers are doing the right thing,
but the V7 compiler isn't. 

Notice that I said "SHOULD" in my first sentence.  The V7 compiler
appears to cheat:  if you cast a char into an int and then cast that
into an unsigned, the value is that of the unsigned char (after the
movb, it'll do a bic of the upper 9 bits of the word). That's right, it
ignores the int cast!  This interesting phenomenon is illustrated by
the following program:
main() { 
	register char c = 0377;
	register unsigned int csum;
	int i;

	csum = (unsigned) c;
	printf("%u\n", csum);
	csum = (unsigned)((int) c);
	printf("%u\n", csum);
	csum = (unsigned)(i=(int)c);
	printf("%u\n", csum);
}

You'd THINK that the second and third printf's would both print out the
same thing (actually, you'd expect them ALL to print out the same
thing, and under VAX pcc and VMS C, they do).  With the V7 C compiler
on pdp11's, the output of the program is 255 255 65535.  Sigh.

I don't see a portable solution to your problem. As you said, the V7 C
compiler doesn't know about the unsigned char datatype, and the
conversion rules in K&R don't address how unsigned chars are treated,
anyway.  If anyone has a more up-to-date C reference (book, manual, or
human), please consult them and let us know.

Looks like you'll have to use masking.

	Mark

bukys@rochester.UUCP (06/10/83)

As Dennis Ritchie points out, if the goal is for the code to be
portable to compilers which don't support unsigned chars, use a mask.

If, on the other hand, you just want to write it right, try

	char *cp;
	...
	checksum ^= *(unsigned char *)cp;

which does the cast before it's too late (before the dereference).
So, if the compiler supports unsigned characters, you're in.

Liudvikas Bukys

pdl@root44.UUCP (06/10/83)

The reason you get sign extension is because you `char's should be `unsigned char',
this solves the problem (the code shown converts a signed char to an unsigned
larger integer, so sign extension MAY occur (not that `char' may be signed
or unsigned at the whim of the compiler writer.)

I know of no compiler that disallows `unsigned char' these days,
so why not keep it simple (you don't need ANY casts, then !)

		Per ardua ad portability
			Dave Lukes

			...!vax135!ukc!root44!pdl

johnl@ima.UUCP (06/11/83)

#R:decvax:-11200:ima:15900008:000:997
ima!johnl    Jun 10 12:28:00 1983

Some more recent C compilers implement an "unsigned char" data type.
Characters are never required to be signed, the manual explicitly says
that it's machine dependent.

So, in any case, the morally correct thing to do to ensure unsigned
characters is to declare your pointer to them to be "unsigned char."  Of
course, this won't work on half of the C compilers that really exist.  If
you want to be sure, you have to write something like:

	i = *cp++ & 0377;

This is clearly an artifact of C being used mostly on PDP-11s in its
early years.  The 11 is the only machine around (that I know of) that
prefers signed chars, due to sign extension in the "mov" instruction.
Some machines prefer unsigned chars, e.g. the IBM 370 series.  Some swing
either way, e.g. the Vax and the 8086.  I expect that as Unix runs
predominantly on other than 11's and Vaxen, signed characters will wither
away.

John Levine, decvax!yale-co!jrl, ucbvax!cbosgd!ima!johnl,
{research|alice|allegra|floyd|amd70}!ima!johnl

guy@rlgvax.UUCP (06/11/83)

Unfortunately, the V7 PDP-11 C compiler does not support "unsigned char".
The System III PDP-11 C compiler, and the C compiler on all later versions
of USG UNIX (USG UNIX n, for n >= 3.0), do support it.

If you take the term "char" at face value, there is no such thing as a "signed"
or "unsigned" char.  What is the sign of 'q'?  But since C doesn't have a
"veryshort int" (or "veryshort unsigned int") datatype, "char" is overloaded
to mean "one byte int" as well as "character".  I suspect "unsigned char"
was added because on all 11-family machines, "char" is signed, and somebody
wanted a more convenient way of getting an unsigned one byte integer.  On
some machines, the hardware supports both kinds, and can handle unsigned
characters more efficiently than by masking with 0377.  "unsigned char" is
guaranteed not to be signed, but "char" is not guaranteed to be either signed
nor unsigned; on the Western Electric 3B machines, "char" is unsigned.

		Guy Harris
		RLG Corporation
		{seismo,mcnc,we13,brl-bmd,allegra}!rlgvax!guy

chris@umcp-cs.UUCP (06/11/83)

If you're willing to pick up a bunch of other cruft you can
#include <sys/param.h> (at least on x.yBSD) and get the constant
NBBY, number of bits per byte.  You also get NBPW, number of bytes
per word.  You unfortuantely get all sorts of uninteresting stuff
as well.

As for generating extract-macros, getting masks is easy:

#define	UNSIGNEDCHAR(c)		((c)&((1<<NBBY)-1))

The shift factor is just NBBY:

#define	MIDDLEBYTE(w)		UNSIGNEDCHAR((w)>>NBBY)

How about an include file called <machdep.h>, containing lots of
things like the above, and/or the version of C being run (say
perhaps #define MANY_UNSIGNEDS means the compiler understands
unsigned long and unsigned char)?  Possibly these might just be
macros, or even typedefs as in <sys/types.h>:

	<machdep.h> on a Vax running 4BSD
	typedef unsigned char u_char;
	#define U_CHAR(c)	(c)

	<machdep.h> on an 11/45 running V7
	typedef char u_char;
	#define U_CHAR(c)	((c)&0377)

and so on.  At least the ``fundamental constants'' (NBBY, NBPW)
should be available someplace other than <sys/param.h>.

			- Chris ({allegra,seismo}!umcp-cs!chris)

ka@spanky.UUCP (06/13/83)

USG UNIX has a <values.h> file which defines things such as the number
of bits per byte.  Of course, that doesn't help you write code which is
supposed to be portable to BSD.
					Kenneth Almquist