jwg@duke.UUCP (Jeffrey William Gillette) (11/01/86)
[] A few weeks ago I vented my hostilities on MSC's support (or lack thereof) for extended ASCII characters - specifically for their decision to make type 'char' default to a signed quantity. I asked if other compilers defaulted to signed, and what justification existed for such a policy. I would like to thank those who were kind enough to respond to my questions, summarize the arguments as I understand them, and come back for a rebuttal. 1) Microsoft C MSC does, in fact, claim quite explicitly in the library manual that 'isupper', 'islower', etc. are defined only when 'isascii' is true. Thus, with regards to my original complaint about 'isupper', the compiler is not broken, it is simply wrong! The MSC "Language Reference" distinguishes two types of character sets. The "representable" character set includes all symbols which are meaningful to the host system. The "C" character set, a subset of the former, includes all characters which have meaning to the compiler. I assume this distinction allows, e.g. the compiler to process strings containing non-ASCII characters, or to handle quoted non-ASCII characters in 'if' or 'case' statements. It seems to me that any 'isbar' macro *ought* to apply to the full set of characters which can be represented in the system, not only to those used by the compiler. For the PCDOS environment this includes characters with umlauts, acute and grave accents, etc. Thus I argue that Microsoft has made the wrong decision in failing to support the full character environment of their target system. 2) Signed char default It appears that an accident of history - the architecture of the PDP-11 - brought about the implementation of 'signed' chars. Since then there appears to be a split between compilers that default to signed chars and those that default to unsigned. The only argument for signed char default appears to be that some old PDP and VAX code will break without signed char defaults. I could say that this seems to me a better argument for rewriting the faulty code, but I understand why many implementors do not want to rewrite large amounts of established utilities. I would suggest that the proper way to handle portability problems is that of (believe it or not) the Microsoft 4.0 compiler. Several of you called attention to the new command line switch that will default chars to unsigned. This seems a relatively painless way to support code that requires char defaults. My bone of contention, however, is that this scheme is exactly backwards. Code that uses signed chars will not handle half of the system's character set, and thus I must deliberately and consciously choose to set a command line switch every time I compile a program, or my program will not work acceptably on my system! 3) What is a 'char' anyway? Some of you called attention to K&R's discussions of the char type. K&R definitely present 'char' as system specific. a single byte, capable of holding one character in the local character set. (p. 34) Following this statement is a table in which presents the 'char' type as 8-bit ASCII on the PDP-11, 9-bit ASCII on the Honeywell 6000, 8-bit EBCDIC on the IBM 370, and 8-bit ASCII on the Interdata 8/32. On the following page is an explanation of character constants and the differing numerical values associated with '0' in ASCII and EBCDIC. My point is that K&R clearly sets forth the 'char' type as a logical quantity which is implementation specific. They are willing to include ASCII and EBCDIC in the definition, and, I assume, any other arbitrary representation scheme that will fit into "a single byte". By this definition, any code that depends on the mathematical properties of characters (e.g. that, in ASCII, A-Z and a-z are contiguous) is inherently non-portable! 4) What difference does it make? None - if we want to continue to insist that English is the official language of C and UNIX! There is, however, a market of people who want to sed with ninyas or awk with cedillas. There may, in fact, be a system just around the corner for users who want to diff in Kanji! Unfortunately all of these are out of luck, since the afore- mentioned code only works with 7-bit characters. At this point in time I am still trying to explain to my colleagues in the Humanities Computing Lab why their new $10,000 Apollo supermicro can't display a simple umlaut! I guess the point of this rave should be summarized. Now that hardware no longer restricts us to 7-bit character sets, isn't it time we see *forward* compatible compilers that default to the native character set of their host system, and isn't it time we start writing (or rewriting) portable UNIX code that will work on systems whether characters display in ASCII, EBCDIC, Swedish, or Amharic! Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff Humanities Computing Facility bitnet: DYBBUK @ TUCCVM Duke University -- Jeffrey William Gillette uucp: mcnc!ethos!ducall!jeff Humanities Computing Facility bitnet: DYBBUK @ TUCCVM Duke University
metro@asi.UUCP (Metro T. Sauper) (11/05/86)
I would just like to point out that there are actually two issues which are being argued. They are actually two different topics. 1. Should characters be signed or unsigned by default. 2. Should the character type macros/subroutines support all possible values of type char. The first question is compiler related, the second is library related. My own preferences follows: 1. Since the "c" language has an "unsigned" modifier, and not a "signed" modifier, I would much rather have a signed character by default and be able to define it to be "unsigned char" if needs be. 2. The ctype routines are trivial at best, and with all the effort put to arguing which way they should work, you could have rewritten them to do whatever you would like them to do. Metro T. Sauper, Jr. ..!ihnp4!ll1!bpa!asi!metro
henry@utzoo.UUCP (Henry Spencer) (11/06/86)
> It appears that an accident of history - the architecture of the PDP-11 - > brought about the implementation of 'signed' chars... This is correct. > The only argument for signed char default appears to be that some old > PDP and VAX code will break without signed char defaults... No, sorry, this is wrong. There are many other machines on which char is substantially more efficient when it is considered signed than when it is considered unsigned. Consigning the PDP11 and the VAX to history (a dubious decision in itself) does not remove the problem. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (11/07/86)
> 1. Since the "c" language has an "unsigned" modifier, and not a "signed" > modifier, I would much rather have a signed character by default and > be able to define it to be "unsigned char" if needs be. Would you still feel this way if all manipulations of signed char took three times as long as those of unsigned char? It can happen. All members of this debate please attend to the following. - There exist machines (e.g. pdp11) on which unsigned chars are a lot less efficient than signed chars. - There exist machines (e.g. ibm370) on which signed chars are a lot less efficient than unsigned chars. - Many applications do not care whether the chars are signed or unsigned, so long as they can be twiddled efficiently. - For this reason, char is intended to be the more efficient of the two. - Many old programs assume that char is signed; this does not make it so. Those programs are wrong, and have been all along. Alas, this is not a comfort if you have to run them. - The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all agree that a character of the machine's normal character set MUST appear positive. Given that the IBM PC has, I understand, a full 8-bit character set, this means that a PC compiler which treats char as signed is wrong, period. This should be documented as, at the very least, a deviation from K&R. - The "unsigned char" type exists (in most newer compilers) because there are a number of situations where sign extension is very awkward. For example, getchar() wants to do a non-sign-extended conversion from char to int. - X3J11, in its semi-infinite wisdom, has decided that it would be nice to have a signed counterpart to "unsigned char", to wit "signed char". Therefore it is reasonable to expect that most new compilers, and old ones brought into conformance with the yet-to-be-issued standard, will give you the full choice: signed char if you need signs, unsigned char if you need everything positive, and char if you don't care but want it to run fast. - Given that many compilers have not yet been upgraded to match even the current X3J11 drafts, much less the final endproduct (which doesn't exist yet), any application which cares about signedness should use typedefs or macros for its char types, so that the definitions can be revised later. - The only things you can safely put into a char variable, and depend on having them come out unchanged, are characters from the native character set and small *positive* integers. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (11/10/86)
> - The Father, the Son, and the Holy Ghost (K&R, H&S, and X3J11 resp.) all > agree that a character of the machine's normal character set MUST > appear positive... It turns out that I have to amend this slightly. The Father and the Son are indeed in agreement on this. The Holy Ghost has chickened out and watered down this restriction, however: it only says that the characters in the "source character set" (roughly, those one uses to write C) must look positive. Thus an 8088 C which makes normal ASCII look positive but lets the "upper-bit" characters look negative is technically legitimate. Grr. ("Grr" not just because I goofed, but because I don't like the change.) -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,decvax,pyramid}!utzoo!henry