[net.micro.pc] Signed Chars - What Foolishness Revisited!

jwg@duke.UUCP (Jeffrey William Gillette) (11/01/86)

[]

A few weeks ago I vented my hostilities on MSC's support (or lack 
thereof) for extended ASCII characters - specifically for their
decision to make type 'char' default to a signed quantity.  I asked
if other compilers defaulted to signed, and what justification existed
for such a policy.  I would like to thank those who were kind enough
to respond to my questions, summarize the arguments as I understand
them, and come back for a rebuttal.

1)	Microsoft C

MSC does, in fact, claim quite explicitly in the library manual that
'isupper', 'islower', etc. are defined only when 'isascii' is true.
Thus, with regards to my original complaint about 'isupper', the 
compiler is not broken, it is simply wrong!  

The MSC "Language Reference" distinguishes two types of character
sets.  The "representable" character set includes all symbols which
are meaningful to the host system.  The "C" character set, a subset
of the former, includes all characters which have meaning to the compiler.
I assume this distinction allows, e.g. the compiler to process strings
containing non-ASCII characters, or to handle quoted non-ASCII 
characters in 'if' or 'case' statements.

It seems to me that any 'isbar' macro *ought* to apply to the full set
of characters which can be represented in the system, not only to those
used by the compiler.  For the PCDOS environment this includes characters
with umlauts, acute and grave accents, etc.  Thus I argue that Microsoft
has made the wrong decision in failing to support the full character
environment of their target system.

2)	Signed char default

It appears that an accident of history - the architecture of the PDP-11 -
brought about the implementation of 'signed' chars.  Since then there 
appears to be a split between compilers that default to signed chars 
and those that default to unsigned.

The only argument for signed char default appears to be that some old 
PDP and VAX code will break without signed char defaults.  I could say
that this seems to me a better argument for rewriting the faulty code,
but I understand why many implementors do not want to rewrite large
amounts of established utilities.

I would suggest that the proper way to handle portability problems is
that of (believe it or not) the Microsoft 4.0 compiler.  Several of you
called attention to the new command line switch that will default chars
to unsigned.  This seems a relatively painless way to support code that
requires char defaults.  My bone of contention, however, is that this
scheme is exactly backwards.  Code that uses signed chars will not handle
half of the system's character set, and thus I must deliberately and
consciously choose to set a command line switch every time I compile
a program, or my program will not work acceptably on my system!

3)	What is a 'char' anyway?

Some of you called attention to K&R's discussions of the char type.
K&R definitely present 'char' as system specific.

	a single byte, capable of holding one character in 
	the local character set. (p. 34)

Following this statement is a table in which presents the 'char' type
as 8-bit ASCII on the PDP-11, 9-bit ASCII on the Honeywell 6000,
8-bit EBCDIC on the IBM 370, and 8-bit ASCII on the Interdata 8/32.
On the following page is an explanation of character constants and
the differing numerical values associated with '0' in ASCII and EBCDIC.

My point is that K&R clearly sets forth the 'char' type as a logical
quantity which is implementation specific.  They are willing to 
include ASCII and EBCDIC in the definition, and, I assume, any other
arbitrary representation scheme that will fit into "a single byte".
By this definition, any code that depends on the mathematical properties 
of characters (e.g. that, in ASCII, A-Z and a-z are contiguous) is 
inherently non-portable!

4)	What difference does it make?

None - if we want to continue to insist that English is the official
language of C and UNIX!  There is, however, a market of people who 
want to sed with ninyas or awk with cedillas.  There may, in fact,
be a system just around the corner for users who want to diff in
Kanji!  Unfortunately all of these are out of luck, since the afore-
mentioned code only works with 7-bit characters.  At this point in
time I am still trying to explain to my colleagues in the Humanities
Computing Lab why their new $10,000 Apollo supermicro can't display
a simple umlaut!

I guess the point of this rave should be summarized.  Now that hardware
no longer restricts us to 7-bit character sets, isn't it time we see
*forward* compatible compilers that default to the native character
set of their host system, and isn't it time we start writing (or 
rewriting) portable UNIX code that will work on systems whether 
characters display in ASCII, EBCDIC, Swedish, or Amharic!


Jeffrey William Gillette		uucp: mcnc!ethos!ducall!jeff
Humanities Computing Facility		bitnet: DYBBUK @ TUCCVM
Duke University
-- 
Jeffrey William Gillette	uucp:  mcnc!ethos!ducall!jeff
Humanities Computing Facility 	bitnet: DYBBUK @ TUCCVM
Duke University

metro@asi.UUCP (Metro T. Sauper) (11/05/86)

I would just like to point out that there are actually two issues which
are being argued.  They are actually two different topics.

1.  Should characters be signed or unsigned by default.

2.  Should the character type macros/subroutines support all possible
    values of type char.

The first question is compiler related, the second is library related.

My own preferences follows:

1.  Since the "c" language has an "unsigned" modifier, and not a "signed"
    modifier, I would much rather have a signed character by default and
    be able to define it to be "unsigned char" if needs be.

2.  The ctype routines are trivial at best, and with all the effort put
    to arguing which way they should work, you could have rewritten them
    to do whatever you would like them to do.

Metro T. Sauper, Jr.
..!ihnp4!ll1!bpa!asi!metro

henry@utzoo.UUCP (Henry Spencer) (11/06/86)

> It appears that an accident of history - the architecture of the PDP-11 -
> brought about the implementation of 'signed' chars...

This is correct.

> The only argument for signed char default appears to be that some old 
> PDP and VAX code will break without signed char defaults...

No, sorry, this is wrong.  There are many other machines on which char
is substantially more efficient when it is considered signed than when
it is considered unsigned.  Consigning the PDP11 and the VAX to history
(a dubious decision in itself) does not remove the problem.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (11/07/86)

> 1.  Since the "c" language has an "unsigned" modifier, and not a "signed"
>     modifier, I would much rather have a signed character by default and
>     be able to define it to be "unsigned char" if needs be.

Would you still feel this way if all manipulations of signed char took
three times as long as those of unsigned char?  It can happen.

All members of this debate please attend to the following.

- There exist machines (e.g. pdp11) on which unsigned chars are a lot less
	efficient than signed chars.

- There exist machines (e.g. ibm370) on which signed chars are a lot less
	efficient than unsigned chars.

- Many applications do not care whether the chars are signed or unsigned,
	so long as they can be twiddled efficiently.

- For this reason, char is intended to be the more efficient of the two.

- Many old programs assume that char is signed; this does not make it so.
	Those programs are wrong, and have been all along.  Alas, this is
	not a comfort if you have to run them.

- The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all
	agree that a character of the machine's normal character set MUST
	appear positive.  Given that the IBM PC has, I understand, a full
	8-bit character set, this means that a PC compiler which treats
	char as signed is wrong, period.  This should be documented as, at
	the very least, a deviation from K&R.

- The "unsigned char" type exists (in most newer compilers) because there
	are a number of situations where sign extension is very awkward.
	For example, getchar() wants to do a non-sign-extended conversion
	from char to int.

- X3J11, in its semi-infinite wisdom, has decided that it would be nice to
	have a signed counterpart to "unsigned char", to wit "signed char".
	Therefore it is reasonable to expect that most new compilers, and
	old ones brought into conformance with the yet-to-be-issued standard,
	will give you the full choice:  signed char if you need signs,
	unsigned char if you need everything positive, and char if you don't
	care but want it to run fast.

- Given that many compilers have not yet been upgraded to match even the
	current X3J11 drafts, much less the final endproduct (which doesn't
	exist yet), any application which cares about signedness should use
	typedefs or macros for its char types, so that the definitions can
	be revised later.

- The only things you can safely put into a char variable, and depend on
	having them come out unchanged, are characters from the native
	character set and small *positive* integers.
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry

henry@utzoo.UUCP (Henry Spencer) (11/10/86)

> - The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all
> 	agree that a character of the machine's normal character set MUST
> 	appear positive...

It turns out that I have to amend this slightly.  The Father and the Son
are indeed in agreement on this.  The Holy Ghost has chickened out and
watered down this restriction, however:  it only says that the characters
in the "source character set" (roughly, those one uses to write C) must
look positive.  Thus an 8088 C which makes normal ASCII look positive but
lets the "upper-bit" characters look negative is technically legitimate.
Grr.  ("Grr" not just because I goofed, but because I don't like the change.)
-- 
				Henry Spencer @ U of Toronto Zoology
				{allegra,ihnp4,decvax,pyramid}!utzoo!henry