betz@decvax.UUCP (06/09/83)
I have been trying to write a CRC routine in C that is transportable between machines. In the process of doing this I have encountered the need to fetch unsigned characters through a char pointer. This has revealed a difference of opinion among the various C compilers that I have been using. Example: register char *bp; register unsigned int csum; csum ^= (unsigned) *bp++; Under VAX-11 C and under Berkeley C, the character fetched through bp ends up being sign extended in spite of the unsigned type cast. Under DECUS C and V7 unix C, the character is not sign extended. Which is correct? If the character is supposed to be sign extended, how do I fetch an unsigned character from a char pointer? (short of anding the resulting character with 0xFF) Under some compilers it is possible to declare a variable as unsigned char, but this is not a universal feature. Does anyone have any suggestions on how to make this code machine independant? Thanks in advance, David Betz decvax!betz
leichter@yale-com.UUCP (06/09/83)
The only guaranteed way I know of to get a REALLY "unsigned" character is not to do it at all - rather, step back and look at what you are trying to do and do it directly. To look at it simply: You want to do: a = *cp; where cp is a (char *) and a is "unsigned". What you REALLY mean is: Set the bottom 8 bits of a to the 8 bits at the location pointed to by cp, and clear the rest of a. The "right" code is then: a = (*cp) & 0377. (I went through this while doing something similar to calculating CRC's - I was computing a hash function. Various combinations of unsigned's failed to get the right result, although I eventually found a way to write it using a temporary that did the job; I don't remember how, though. Note that, on an 11 or VAX, a good peephole optimizer OUGHT to recognize that it can avoid the AND; on a VAX, it can do this assignment directly (convert byte to word, or whatever); on an 11, it's a CLR followed by a BISB). I don't know which compilers are likely to find this; DECUS C, unfortunately, will not (no peep- hole!).) BTW: If you want REAL portability, you are in trouble, since you shouldn't assume 8-bit bytes. There is a (small but significant) class of problems, of which CRC and hashing functions are typical, in which knowing the actual size of various data objects is essential. I'd like to see a small but comprehen- sive set of "machine constants" defined in some known include file for just such cases. Things that should be there are: bits per: byte, int, etc.; masks for each (tough to compute at compile time since there is no exponen- tiation operator); maximum/minimum values in each size object; not to men- tion the basic floating point quantities (max precision, etc.). Of course, you'd still lose out on non-binary machines, but what can you do... -- Jerry decvax!yale-comix!leichter leichter@yale
mp@mit-eddi.UUCP (Mark Plotnick) (06/10/83)
Well, the compiler SHOULD generate code to widen the char to an int
before converting to an unsigned (according to the K&R book). The
conversion from char to int is done with a cvtbl on the vax and a movb
on the 11, both of which do sign extension (the vax, pdp11, and most
other machines treat chars as signed). The conversion from int to
unsigned doesn't require any instructions at all.
So it would appear that the VAX compilers are doing the right thing,
but the V7 compiler isn't.
Notice that I said "SHOULD" in my first sentence. The V7 compiler
appears to cheat: if you cast a char into an int and then cast that
into an unsigned, the value is that of the unsigned char (after the
movb, it'll do a bic of the upper 9 bits of the word). That's right, it
ignores the int cast! This interesting phenomenon is illustrated by
the following program:
main() {
register char c = 0377;
register unsigned int csum;
int i;
csum = (unsigned) c;
printf("%u\n", csum);
csum = (unsigned)((int) c);
printf("%u\n", csum);
csum = (unsigned)(i=(int)c);
printf("%u\n", csum);
}
You'd THINK that the second and third printf's would both print out the
same thing (actually, you'd expect them ALL to print out the same
thing, and under VAX pcc and VMS C, they do). With the V7 C compiler
on pdp11's, the output of the program is 255 255 65535. Sigh.
I don't see a portable solution to your problem. As you said, the V7 C
compiler doesn't know about the unsigned char datatype, and the
conversion rules in K&R don't address how unsigned chars are treated,
anyway. If anyone has a more up-to-date C reference (book, manual, or
human), please consult them and let us know.
Looks like you'll have to use masking.
Mark
bukys@rochester.UUCP (06/10/83)
As Dennis Ritchie points out, if the goal is for the code to be portable to compilers which don't support unsigned chars, use a mask. If, on the other hand, you just want to write it right, try char *cp; ... checksum ^= *(unsigned char *)cp; which does the cast before it's too late (before the dereference). So, if the compiler supports unsigned characters, you're in. Liudvikas Bukys
pdl@root44.UUCP (06/10/83)
The reason you get sign extension is because you `char's should be `unsigned char', this solves the problem (the code shown converts a signed char to an unsigned larger integer, so sign extension MAY occur (not that `char' may be signed or unsigned at the whim of the compiler writer.) I know of no compiler that disallows `unsigned char' these days, so why not keep it simple (you don't need ANY casts, then !) Per ardua ad portability Dave Lukes ...!vax135!ukc!root44!pdl
johnl@ima.UUCP (06/11/83)
#R:decvax:-11200:ima:15900008:000:997 ima!johnl Jun 10 12:28:00 1983 Some more recent C compilers implement an "unsigned char" data type. Characters are never required to be signed, the manual explicitly says that it's machine dependent. So, in any case, the morally correct thing to do to ensure unsigned characters is to declare your pointer to them to be "unsigned char." Of course, this won't work on half of the C compilers that really exist. If you want to be sure, you have to write something like: i = *cp++ & 0377; This is clearly an artifact of C being used mostly on PDP-11s in its early years. The 11 is the only machine around (that I know of) that prefers signed chars, due to sign extension in the "mov" instruction. Some machines prefer unsigned chars, e.g. the IBM 370 series. Some swing either way, e.g. the Vax and the 8086. I expect that as Unix runs predominantly on other than 11's and Vaxen, signed characters will wither away. John Levine, decvax!yale-co!jrl, ucbvax!cbosgd!ima!johnl, {research|alice|allegra|floyd|amd70}!ima!johnl
guy@rlgvax.UUCP (06/11/83)
Unfortunately, the V7 PDP-11 C compiler does not support "unsigned char". The System III PDP-11 C compiler, and the C compiler on all later versions of USG UNIX (USG UNIX n, for n >= 3.0), do support it. If you take the term "char" at face value, there is no such thing as a "signed" or "unsigned" char. What is the sign of 'q'? But since C doesn't have a "veryshort int" (or "veryshort unsigned int") datatype, "char" is overloaded to mean "one byte int" as well as "character". I suspect "unsigned char" was added because on all 11-family machines, "char" is signed, and somebody wanted a more convenient way of getting an unsigned one byte integer. On some machines, the hardware supports both kinds, and can handle unsigned characters more efficiently than by masking with 0377. "unsigned char" is guaranteed not to be signed, but "char" is not guaranteed to be either signed nor unsigned; on the Western Electric 3B machines, "char" is unsigned. Guy Harris RLG Corporation {seismo,mcnc,we13,brl-bmd,allegra}!rlgvax!guy
chris@umcp-cs.UUCP (06/11/83)
If you're willing to pick up a bunch of other cruft you can #include <sys/param.h> (at least on x.yBSD) and get the constant NBBY, number of bits per byte. You also get NBPW, number of bytes per word. You unfortuantely get all sorts of uninteresting stuff as well. As for generating extract-macros, getting masks is easy: #define UNSIGNEDCHAR(c) ((c)&((1<<NBBY)-1)) The shift factor is just NBBY: #define MIDDLEBYTE(w) UNSIGNEDCHAR((w)>>NBBY) How about an include file called <machdep.h>, containing lots of things like the above, and/or the version of C being run (say perhaps #define MANY_UNSIGNEDS means the compiler understands unsigned long and unsigned char)? Possibly these might just be macros, or even typedefs as in <sys/types.h>: <machdep.h> on a Vax running 4BSD typedef unsigned char u_char; #define U_CHAR(c) (c) <machdep.h> on an 11/45 running V7 typedef char u_char; #define U_CHAR(c) ((c)&0377) and so on. At least the ``fundamental constants'' (NBBY, NBPW) should be available someplace other than <sys/param.h>. - Chris ({allegra,seismo}!umcp-cs!chris)
ka@spanky.UUCP (06/13/83)
USG UNIX has a <values.h> file which defines things such as the number of bits per byte. Of course, that doesn't help you write code which is supposed to be portable to BSD. Kenneth Almquist