gnu@hoptoad.uucp (John Gilmore) (12/02/86)
[This is posted to comp.lang.c because mod.std.c seems to be dead. Love those mod groups!] While considering my point of view on trigraphs, Laura Creighton pointed out that the problem is that Europeans really need more than a 7-bit character set. In that vein, one possible change to the ANSI standard would require "char" to be unsigned. This would double the number of characters that a strictly conforming program could easily handle, and European Unix systems could use an 8-bit character set in which the first 128 characters were USASCII. I believe that the various Unix internationalization efforts are already doing working in this direction. No strictly conforming programs would be broken by this change, since a strictly conforming program cannot assume whether char is signed or unsigned; in fact, it will make MORE programs strictly conform, since programs that assume char is unsigned will now conform. In an 8-bit character set, all the ANSI punctuation as well as all the national characters could be supported without kludges. -- John Gilmore {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu jgilmore@lll-crg.arpa Call +1 800 854 7179 or +1 714 540 9870 and order X3.159-198x (ANSI C) for $65. Then spend two weeks reading it and weeping. THEN send in formal comments!
lamy@ai.toronto.edu (12/02/86)
ISO Latin 1 is an 8 bit character set that is a superset of ASCII. Portability, then, is a matter of having standard transliteration rules, e.g. c-cedilla --> c , a-ring --> aa But I sincerely doubt that code with native identifiers would ever make it to public distribution. Such code is usually commented in English (or an approximation thereof), with English identifiers (i, j, k, x, y, z, p, c, s :-). Jean-Francois Lamy one day, I may have all the characters I need to type my name :-( AI Group, Dept of Computer Science CSNet: lamy@ai.toronto.edu University of Toronto EAN: lamy@ai.toronto.cdn Toronto, ON, Canada M5S 1A4 UUCP: lamy@utai.uucp
ahe@k.cc.purdue.edu (Bill Wolfe) (12/03/86)
In article <1382@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes: > While considering my point of view on trigraphs, Laura Creighton pointed > out that the problem is that Europeans really need more than a 7-bit > character set. > > In that vein, one possible change to the ANSI standard would require > "char" to be unsigned. This would double the number of characters > that a strictly conforming program could easily handle, and European > Unix systems could use an 8-bit character set in which the first 128 > characters were USASCII. I believe that the various Unix > internationalization efforts are already doing working in this direction. Actually, as mentioned in Byte magazine about 9-10 months ago, ANSI is in the process of soliciting comments regarding its proposed 8-bit ASCII standard, which does contain 7-bit ASCII as its first 128 characters, and includes all the European characters in the upper 128... check the Letters section of Byte, around February 1986 or so for the exact positions of the various characters in the proposed standard... Bill Wolfe (ahe!k.cc.purdue.edu...) Purdue University Computing Center
bandy@lll-crg.ARpA (Andrew Scott Beals) (12/03/86)
In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >[This is posted to comp.lang.c because mod.std.c seems to be dead. Love >those mod groups!] >[We really need an 8-bit character set and C needs to acknowledge this] >[John thinks that perhaps "char" should be unsigned - most programs would be more correct since they assUme that chars are unsigned] Well, I heartily agree, but I think that there must be some programs out there that assume that chars are useful as small signed numbers, which I would also prefer not to break. Also, I think that having chars have different semantics (assumed unsigned rather than signed like int) would be a bad thing in general. Perhaps what is needed is a "tiny" type (ala long and short) that would be signed and (for now) essentially a signed char. Of course, this brings in yet another type (oh no!) and yet another reserved word, but it would make programs nicer. andy -- Andrew Scott Beals (member of HASA - A and S divisions) bandy@lll-crg.arpa {ihnp4,seismo,ll-xn,ptsfa,pyramid}!lll-crg!bandy LLNL, P.O. Box 808, Mailstop L-419, Livermore CA 94550 (415) 423-1948 Primates who don't have tails should keep cats who don't have tails.
sjl@ukc.ac.uk (S.J.Leviseur) (12/04/86)
In article <8322@lll-crg.ARpA> bandy@lll-crg.UUCP (Andrew Scott Beals) writes: >In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >>[This is posted to comp.lang.c because mod.std.c seems to be dead. Love >>those mod groups!] > >>[We really need an 8-bit character set and C needs to acknowledge this] > >>[John thinks that perhaps "char" should be unsigned - most programs > would be more correct since they assUme that chars are unsigned] > >Well, I heartily agree, but I think that there must be some programs >out there that assume that chars are useful as small signed numbers, >which I would also prefer not to break. Actually this breaks *LOTS* of programs. We have a compiler with unsigned chars on some of our machines. This causes endless problems with what is 'affectionately' known as the "EOF bug". The implementors of that compiler said it was the first thing they would alter if they reimplemented it because of the number of problems it caused. If you want to see for yourself have a look through your sources and find every occurence of a comparision between EOF or -1, and a char. Typically, where cp is a character pointer:- if (*cp == EOF) or while (*cp != EOF) Older code is littered with these constructs. sean
joemu@nscpdc.NSC.COM (Joe Mueller) (12/04/86)
> >[We really need an 8-bit character set and C needs to acknowledge this] > > >[John thinks that perhaps "char" should be unsigned - most programs > would be more correct since they assUme that chars are unsigned] > > Well, I heartily agree, but I think that there must be some programs > out there that assume that chars are useful as small signed numbers, > which I would also prefer not to break. > > Also, I think that having chars have different semantics (assumed unsigned > rather than signed like int) would be a bad thing in general. Perhaps > what is needed is a "tiny" type (ala long and short) that would be signed > and (for now) essentially a signed char. > > Of course, this brings in yet another type (oh no!) and yet another > reserved word, but it would make programs nicer. > andy Andy, The standard does allow for a small signed char type called (would you believe) "signed char". From section 3.1.2.5 of the draft dated Oct. 1, 1986. A signed char occupies the same amount of storage as a "plain" char. A "plain" int has the natural size suggested by the architecture of the execution environment ... The committee wanted to "fix" the question of signedness of a char but couldn't arrive at an acceptable compromise. We thought about having chars be signed and unsigned chars unsigned but we were afraid it would break too much code that depended on chars being unsigned. We ended up adopting the compromise of: char - signed or unsigned, implementation defined unsigned char signed char By the way, the draft is now released for formal public review, so if you have any other technical comment, fire away now or it will be too late! a humble member of X3J11, Joe Mueller ...!nsc!nscpdc!joemu
karl@haddock.UUCP (Karl Heuer) (12/05/86)
In article <8322@lll-crg.ARpA> bandy@lll-crg.UUCP (Andrew Scott Beals) writes: >In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes: >>[We really need an 8-bit character set and C needs to acknowledge this] >>[John thinks that perhaps "char" should be unsigned - most programs > would be more correct since they assUme that chars are unsigned] > >Well, I heartily agree, but I think that there must be some programs >out there that assume that chars are useful as small signed numbers, >which I would also prefer not to break. They are already broken. >Also, I think that having chars have different semantics (assumed unsigned >rather than signed like int) would be a bad thing in general. Perhaps >what is needed is a "tiny" type (ala long and short) that would be signed >and (for now) essentially a signed char. ANSI already has it, and it's called "signed char". Back to the previous question. I think the original reason for leaving the signedness of plain "char" unspecified is still valid: a program that deals with 7-bit USASCII doesn't care whether char is signed or not, and it's nice to have the compiler use the more efficient mode (which is signed, on the pdp11.) However, K&R and X3J11 both state that all members of the normal character set are positive; I would interpret this to mean that international implementations with non-USASCII printing characters must make "char" an unsigned type*. (I think some implementors disagree, so the point needs to be clarified by X3J11.) As for the trigraphs (I have to justify continuing to cross-post this), maybe they're necessary for antique card-punches? I got the impression that they were necessary because ANSI didn't want to bind the language to ASCII, so they only insisted on those characters in some ANSI standard character set. (I wonder if there are any programs that will break because they have "??" in the middle of a string someplace...) Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint *Or, a signed type with nine or more bits.
ahe@k.cc.purdue.edu (Bill Wolfe) (12/05/86)
In article <1637@k.cc.purdue.edu>, ahe@k.cc.purdue.edu (Bill Wolfe) writes: > Actually, as mentioned in Byte magazine about 9-10 months ago, ANSI is > in the process of soliciting comments regarding its proposed 8-bit ASCII > standard, which does contain 7-bit ASCII as its first 128 characters, and > includes all the European characters in the upper 128... check the Letters > section of Byte, around February 1986 or so for the exact positions of the > various characters in the proposed standard... > > Bill Wolfe (ahe!k.cc.purdue.edu...) > > Purdue University Computing Center Make that the August or September 1985 issues...
stuart@bms-at.UUCP (Stuart D. Gathman) (12/08/86)
In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes: > >>[John thinks that perhaps "char" should be unsigned - most programs > > would be more correct since they assUme that chars are unsigned] > Actually this breaks *LOTS* of programs. We have a compiler with unsigned > chars on some of our machines. This causes endless problems with what is > 'affectionately' known as the "EOF bug". > If you want to see for yourself have a look through your sources and find > every occurence of a comparision between EOF or -1, and a char. Typically, > where cp is a character pointer:- > if (*cp == EOF) > while (*cp != EOF) Not in our code! This type of code is not likely to work, even under K & R. ANSI is only trying not to break *legal* programs. The above essentially is trying to use 255 (or whatever) instead of 0 as a string terminator. Even if there was a legitimate reason for this, EOF is the wrong name to use since it is _already defined as a return value of stdio functions_. This code was broken already. It's too bad that type checking doesn't can this sort of thing. I wish there was a way to define an enum type that is either a char or EOF, and declare stdio functions to return that type. Then if only enum weren't so loose about converting to other types without a cast. Sigh. P.S. Are you trying to tell me that official unix utilities are written like that? -- Stuart D. Gathman <..!seismo!dgis!bms-at!stuart>
levy@ttrdc.UUCP (Daniel R. Levy) (12/08/86)
In article <2221@eagle.ukc.ac.uk>, sjl@ukc.UUCP writes: >Actually this breaks *LOTS* of programs. We have a compiler with unsigned >chars on some of our machines. This causes endless problems with what is >'affectionately' known as the "EOF bug". >The implementors of that compiler said it was the first thing they would >alter if they reimplemented it because of the number of problems it caused. >If you want to see for yourself have a look through your sources and find >every occurence of a comparision between EOF or -1, and a char. Typically, >where cp is a character pointer:- > if (*cp == EOF) > or > while (*cp != EOF) >Older code is littered with these constructs. > sean Ugh. "Littered" is the right term. Not only is this nonportable, it will be FOOLED on systems where it normally works (signed char, 2's complement representation) given the character '\377'. So if you getchar(), say, upon an arbitrary binary file and you are looking for EOF you are likely scr*wed with this kind of code. EOF is meant to be an out-of-band value for things like getchar() etc. That's why they return int, and not char. (Lint should warn about this kind of comparison. I have learned the slow, hard way that when I get C code from elsewhere, yes even the mighty BTL, I lint it first and fix the warnings before compiling on a system other than from whence it came!) Dan
sjl@ukc.ac.uk (S.J.Leviseur) (12/09/86)
In article <300@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes: >In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes: .... >> If you want to see for yourself have a look through your sources and find >> every occurence of a comparision between EOF or -1, and a char. Typically, >> where cp is a character pointer:- > >> if (*cp == EOF) > >> while (*cp != EOF) > >Not in our code! This type of code is not likely to work, even under K & R. It will work on any machine that allows signed chars (despite being ideologically unsound!) >ANSI is only trying not to break *legal* programs. The above essentially >is trying to use 255 (or whatever) instead of 0 as a string terminator. >Even if there was a legitimate reason for this, EOF is the wrong name >to use since it is _already defined as a return value of stdio functions_. The case I was thinking of here is reading on a pipe. This seems to be popular. The use of EOF is valid used in this context. Another favorite is assigning the result of getchar to a char and then testing to see if the char is -1. There are others .... > >This code was broken already. > >It's too bad that type checking doesn't can this sort of thing. I wish >there was a way to define an enum type that is either a char or EOF, and >declare stdio functions to return that type. Then if only enum weren't so >loose about converting to other types without a cast. Sigh. > >P.S. Are you trying to tell me that official unix utilities are written like >that? Yes, worse luck :-(
jc@piaget.UUCP (John Cornelius) (12/13/86)
In article <2246@eagle.ukc.ac.uk> sjl@ukc.ac.uk (S.J.Leviseur) writes: >In article <300@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes: >>In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes: >.... >>> If you want to see for yourself have a look through your sources and find >>> every occurence of a comparision between EOF or -1, and a char. Typically, >>> where cp is a character pointer:- >> >>> if (*cp == EOF) >> >>> while (*cp != EOF) >> >>Not in our code! This type of code is not likely to work, even under K & R. > >It will work on any machine that allows signed chars (despite being ideologically >unsound!) > I believe that the 3B2, to pick an example, places char in the high order byte of the register. If you test one for equality with (int) -1 you will never pass the test. As for small integers, Whitesmiths had a convention where a signed char was typdef tiny char (in this case signed). Because of the different architectures we're seeing in the Unix/C environment I laud the effort to create a standard that is architecture independent. As for the above construct working on any machine with signed char, I doubt that it will work on the 3B2. -- John Cornelius (...!sdcsvax!piaget!jc)
guy@sun.uucp (Guy Harris) (12/14/86)
> I believe that the 3B2, to pick an example, places char in the high order > byte of the register. You may believe that, but it's not the case. It's also completely irrelevant; the C language states quite precisely what happens in the case cited, both in the case where "char" is signed and in the case where "char" is unsigned. The implementation can put the character into any byte it wants, as long as it behaves the way K&R, or the ANSI C spec, say it should. > If you test one for equality with (int) -1 you will never pass the test. That's because characters are UNsigned on the 3B2. > As for the above construct working on any machine with signed char, I doubt > that it will work on the 3B2. See above. The construct doesn't work on the 3B2 because it *isn't* a machine with signed char. The above construct *will* "work", in some sense, on any C implementation with signed "char". It *still* won't work correctly; on a machine with signed "char" and 8-bit bytes, a byte with the value 0xff will be sign-extended to -1, and thus will compare equal to EOF even though it's a perfectly legitimate datum. -- Guy Harris {ihnp4, decvax, seismo, decwrl, ...}!sun!guy guy@sun.com (or guy@sun.arpa)
ron@brl-sem.ARPA (Ron Natalie <ron>) (12/14/86)
Correct me if I am wrong, but there is no place in STDIO where EOF is ever meant to be applied to a character. The only thing that EOF is defined for is functions returning int. Second, nowhere is it stated that (unsigned) -1 will give you a word of all ones. Becareful when making this assumption. I spend a lot of time fixing up the Berkeley network code because of this. -Ron
ballou@brahms (Kenneth R. Ballou) (12/15/86)
In article <518@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes: > ... nowhere is it stated that (unsigned) -1 will give you a word of >all ones. Becareful when making this assumption. I spend a lot of time >fixing up the Berkeley network code because of this. Actually, I think (unsigned) -1 does have to give you a bit pattern of all 1's. I can not find an explicit reason, but I can deduce this from the following: 1. Harbison and Steele, page 89: "No matter what representation is used for signed integers, an unsigned integer represented with n bits is always considered to be in straight unsigned binary notation, with values ranging from 0 through 2^n-1. Therefore, the bit pattern for a given unsigned value is predictable and portable, whereas the bit pattern for a given signed value is not predictable and not portable." 2. Harbison and Steele, pages 126-7 (talking about casting an integral type into an unsigned integral type): " If the result type is an unsigned type, then the result must be that unique alue of the result type that is equal (congruent) mod 2^n in the original value, where n is equal to the number of bits used in the representation of the result type." 3. The value in the range 0 to 2^n-1 (inclusive) congruent mod 2^n to -1 is 2^n-1. In straight binary notation this value is repre- sented as all 1's. -------- Kenneth R. Ballou ARPA: ballou@brahms Department of Mathematics UUCP: ...!ucbvax!brahms!ballou University of California Berkeley, California 94720 -------- Kenneth R. Ballou ARPA: ballou@brahms Department of Mathematics UUCP: ...!ucbvax!brahms!ballou University of California Berkeley, California 94720