[comp.std.internat] draft ANSI standard: one change that would *really* help Europe

gnu@hoptoad.uucp (John Gilmore) (12/02/86)

[This is posted to comp.lang.c because mod.std.c seems to be dead.  Love
those mod groups!]

While considering my point of view on trigraphs, Laura Creighton pointed
out that the problem is that Europeans really need more than a 7-bit
character set.

In that vein, one possible change to the ANSI standard would require
"char" to be unsigned.  This would double the number of characters
that a strictly conforming program could easily handle, and European
Unix systems could use an 8-bit character set in which the first 128
characters were USASCII.  I believe that the various Unix
internationalization efforts are already doing working in this direction.

No strictly conforming programs would be broken by this change,
since a strictly conforming program cannot assume whether char is
signed or unsigned; in fact, it will make MORE programs strictly
conform, since programs that assume char is unsigned will now conform.

In an 8-bit character set, all the ANSI punctuation as well as all
the national characters could be supported without kludges.
-- 
John Gilmore  {sun,ptsfa,lll-crg,ihnp4}!hoptoad!gnu   jgilmore@lll-crg.arpa
Call +1 800 854 7179 or +1 714 540 9870 and order X3.159-198x (ANSI C) for $65.
Then spend two weeks reading it and weeping.  THEN send in formal comments! 

lamy@ai.toronto.edu (12/02/86)

ISO Latin 1 is an 8 bit character set that is a superset of ASCII.
Portability, then, is a matter of having standard transliteration rules, e.g.
c-cedilla --> c    , a-ring --> aa

But I sincerely doubt that code with native identifiers would ever make it to
public distribution. Such code is usually commented in English (or an
approximation thereof), with English identifiers (i, j, k, x, y, z, p, c, s
:-).

Jean-Francois Lamy 
one day, I may have all the characters I need to type my name :-( 

AI Group, Dept of Computer Science     CSNet: lamy@ai.toronto.edu
University of Toronto		       EAN:   lamy@ai.toronto.cdn
Toronto, ON, Canada M5S 1A4	       UUCP:  lamy@utai.uucp

ahe@k.cc.purdue.edu (Bill Wolfe) (12/03/86)

In article <1382@hoptoad.uucp>, gnu@hoptoad.uucp (John Gilmore) writes:
> While considering my point of view on trigraphs, Laura Creighton pointed
> out that the problem is that Europeans really need more than a 7-bit
> character set.
> 
> In that vein, one possible change to the ANSI standard would require
> "char" to be unsigned.  This would double the number of characters
> that a strictly conforming program could easily handle, and European
> Unix systems could use an 8-bit character set in which the first 128
> characters were USASCII.  I believe that the various Unix
> internationalization efforts are already doing working in this direction.

  Actually, as mentioned in Byte magazine about 9-10 months ago, ANSI is
  in the process of soliciting comments regarding its proposed 8-bit ASCII
  standard, which does contain 7-bit ASCII as its first 128 characters, and
  includes all the European characters in the upper 128...  check the Letters
  section of Byte, around February 1986 or so for the exact positions of the
  various characters in the proposed standard...
  
				      Bill Wolfe (ahe!k.cc.purdue.edu...)
					   
				      Purdue University Computing Center
					   

bandy@lll-crg.ARpA (Andrew Scott Beals) (12/03/86)

In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>[This is posted to comp.lang.c because mod.std.c seems to be dead.  Love
>those mod groups!]

>[We really need an 8-bit character set and C needs to acknowledge this]

>[John thinks that perhaps "char" should be unsigned - most programs
  would be more correct since they assUme that chars are unsigned]

Well, I heartily agree, but I think that there must be some programs
out there that assume that chars are useful as small signed numbers,
which I would also prefer not to break.  

Also, I think that having chars have different semantics (assumed unsigned
rather than signed like int) would be a bad thing in general.  Perhaps
what is needed is a "tiny" type (ala long and short) that would be signed
and (for now) essentially a signed char.

Of course, this brings in yet another type (oh no!) and yet another
reserved word, but it would make programs nicer.
	andy
-- 
Andrew Scott Beals	(member of HASA - A and S divisions)
bandy@lll-crg.arpa	{ihnp4,seismo,ll-xn,ptsfa,pyramid}!lll-crg!bandy
LLNL, P.O. Box 808, Mailstop L-419, Livermore CA 94550 (415) 423-1948
Primates who don't have tails should keep cats who don't have tails.

sjl@ukc.ac.uk (S.J.Leviseur) (12/04/86)

In article <8322@lll-crg.ARpA> bandy@lll-crg.UUCP (Andrew Scott Beals) writes:
>In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>>[This is posted to comp.lang.c because mod.std.c seems to be dead.  Love
>>those mod groups!]
>
>>[We really need an 8-bit character set and C needs to acknowledge this]
>
>>[John thinks that perhaps "char" should be unsigned - most programs
>  would be more correct since they assUme that chars are unsigned]
>
>Well, I heartily agree, but I think that there must be some programs
>out there that assume that chars are useful as small signed numbers,
>which I would also prefer not to break.  

Actually this breaks *LOTS* of programs. We have a compiler with unsigned
chars on some of our machines. This causes endless problems with what is
'affectionately' known as the "EOF bug".

The implementors of that compiler said it was the first thing they would
alter if they reimplemented it because of the number of problems it caused.
If you want to see for yourself have a look through your sources and find
every occurence of a comparision between EOF or -1, and a char. Typically,
where cp is a character pointer:-

		if (*cp == EOF)
	or
		while (*cp != EOF)

Older code is littered with these constructs.

	sean

joemu@nscpdc.NSC.COM (Joe Mueller) (12/04/86)

> >[We really need an 8-bit character set and C needs to acknowledge this]
> 
> >[John thinks that perhaps "char" should be unsigned - most programs
>   would be more correct since they assUme that chars are unsigned]
> 
> Well, I heartily agree, but I think that there must be some programs
> out there that assume that chars are useful as small signed numbers,
> which I would also prefer not to break.  
> 
> Also, I think that having chars have different semantics (assumed unsigned
> rather than signed like int) would be a bad thing in general.  Perhaps
> what is needed is a "tiny" type (ala long and short) that would be signed
> and (for now) essentially a signed char.
> 
> Of course, this brings in yet another type (oh no!) and yet another
> reserved word, but it would make programs nicer.
> 	andy

Andy, 

The standard does allow for a small signed char type called (would you believe)
"signed char". From section 3.1.2.5 of the draft dated Oct. 1, 1986.

	A signed char occupies the same amount of storage as a "plain" char.
	A "plain" int has the natural size suggested by the architecture of the
	execution environment ...

The committee wanted to "fix" the question of signedness of a char but couldn't
arrive at an acceptable compromise. We thought about having chars be signed
and unsigned chars unsigned but we were afraid it would break too much code that
depended on chars being unsigned. We ended up adopting the compromise of:
	char	- signed or unsigned, implementation defined
	unsigned char
	signed char

By the way, the draft is now released for formal public review, so if you
have any other technical comment, fire away now or it will be too late!

					a humble member of X3J11,
					Joe Mueller
					...!nsc!nscpdc!joemu

karl@haddock.UUCP (Karl Heuer) (12/05/86)

In article <8322@lll-crg.ARpA> bandy@lll-crg.UUCP (Andrew Scott Beals) writes:
>In article <1382@hoptoad.uucp> gnu@hoptoad.uucp (John Gilmore) writes:
>>[We really need an 8-bit character set and C needs to acknowledge this]
>>[John thinks that perhaps "char" should be unsigned - most programs
>  would be more correct since they assUme that chars are unsigned]
>
>Well, I heartily agree, but I think that there must be some programs
>out there that assume that chars are useful as small signed numbers,
>which I would also prefer not to break.  

They are already broken.

>Also, I think that having chars have different semantics (assumed unsigned
>rather than signed like int) would be a bad thing in general.  Perhaps
>what is needed is a "tiny" type (ala long and short) that would be signed
>and (for now) essentially a signed char.

ANSI already has it, and it's called "signed char".

Back to the previous question.  I think the original reason for leaving the
signedness of plain "char" unspecified is still valid: a program that deals
with 7-bit USASCII doesn't care whether char is signed or not, and it's nice
to have the compiler use the more efficient mode (which is signed, on the
pdp11.)  However, K&R and X3J11 both state that all members of the normal
character set are positive; I would interpret this to mean that international
implementations with non-USASCII printing characters must make "char" an
unsigned type*.  (I think some implementors disagree, so the point needs to
be clarified by X3J11.)

As for the trigraphs (I have to justify continuing to cross-post this), maybe
they're necessary for antique card-punches?  I got the impression that they
were necessary because ANSI didn't want to bind the language to ASCII, so
they only insisted on those characters in some ANSI standard character set.
(I wonder if there are any programs that will break because they have "??" in
the middle of a string someplace...)

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
*Or, a signed type with nine or more bits.

ahe@k.cc.purdue.edu (Bill Wolfe) (12/05/86)

In article <1637@k.cc.purdue.edu>, ahe@k.cc.purdue.edu (Bill Wolfe) writes:
>   Actually, as mentioned in Byte magazine about 9-10 months ago, ANSI is
>   in the process of soliciting comments regarding its proposed 8-bit ASCII
>   standard, which does contain 7-bit ASCII as its first 128 characters, and
>   includes all the European characters in the upper 128... check the Letters
>   section of Byte, around February 1986 or so for the exact positions of the
>   various characters in the proposed standard...
>   
>				      Bill Wolfe (ahe!k.cc.purdue.edu...)
>					   
>				      Purdue University Computing Center

    Make that the August or September 1985 issues...
    

stuart@bms-at.UUCP (Stuart D. Gathman) (12/08/86)

In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes:

> >>[John thinks that perhaps "char" should be unsigned - most programs
> >  would be more correct since they assUme that chars are unsigned]

> Actually this breaks *LOTS* of programs. We have a compiler with unsigned
> chars on some of our machines. This causes endless problems with what is
> 'affectionately' known as the "EOF bug".

> If you want to see for yourself have a look through your sources and find
> every occurence of a comparision between EOF or -1, and a char. Typically,
> where cp is a character pointer:-

> 		if (*cp == EOF)

> 		while (*cp != EOF)

Not in our code!  This type of code is not likely to work, even under K & R.
ANSI is only trying not to break *legal* programs.  The above essentially
is trying to use 255 (or whatever) instead of 0 as a string terminator.
Even if there was a legitimate reason for this, EOF is the wrong name
to use since it is _already defined as a return value of stdio functions_.

This code was broken already.

It's too bad that type checking doesn't can this sort of thing.  I wish
there was a way to define an enum type that is either a char or EOF, and
declare stdio functions to return that type.  Then if only enum weren't so
loose about converting to other types without a cast.  Sigh.

P.S.  Are you trying to tell me that official unix utilities are written like
that?
-- 
Stuart D. Gathman	<..!seismo!dgis!bms-at!stuart>

levy@ttrdc.UUCP (Daniel R. Levy) (12/08/86)

In article <2221@eagle.ukc.ac.uk>, sjl@ukc.UUCP writes:
>Actually this breaks *LOTS* of programs. We have a compiler with unsigned
>chars on some of our machines. This causes endless problems with what is
>'affectionately' known as the "EOF bug".
>The implementors of that compiler said it was the first thing they would
>alter if they reimplemented it because of the number of problems it caused.
>If you want to see for yourself have a look through your sources and find
>every occurence of a comparision between EOF or -1, and a char. Typically,
>where cp is a character pointer:-
>		if (*cp == EOF)
>	or
>		while (*cp != EOF)
>Older code is littered with these constructs.
>	sean

Ugh.  "Littered" is the right term.  Not only is this nonportable, it will
be FOOLED on systems where it normally works (signed char, 2's complement
representation) given the character '\377'.  So if you getchar(), say, upon
an arbitrary binary file and you are looking for EOF you are likely scr*wed
with this kind of code.

EOF is meant to be an out-of-band value for things like getchar() etc.
That's why they return int, and not char.

(Lint should warn about this kind of comparison.  I have learned the slow,
hard way that when I get C code from elsewhere, yes even the mighty BTL, I
lint it first and fix the warnings before compiling on a system other than
from whence it came!)

Dan

sjl@ukc.ac.uk (S.J.Leviseur) (12/09/86)

In article <300@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:
>In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes:
....
>> If you want to see for yourself have a look through your sources and find
>> every occurence of a comparision between EOF or -1, and a char. Typically,
>> where cp is a character pointer:-
>
>> 		if (*cp == EOF)
>
>> 		while (*cp != EOF)
>
>Not in our code!  This type of code is not likely to work, even under K & R.

It will work on any machine that allows signed chars (despite being ideologically
unsound!)

>ANSI is only trying not to break *legal* programs.  The above essentially
>is trying to use 255 (or whatever) instead of 0 as a string terminator.
>Even if there was a legitimate reason for this, EOF is the wrong name
>to use since it is _already defined as a return value of stdio functions_.

The case I was thinking of here is reading on a pipe. This seems to be
popular. The use of EOF is valid used in this context.

Another favorite is assigning the result of getchar to a char and then
testing to see if the char is -1.

There are others ....

>
>This code was broken already.
>
>It's too bad that type checking doesn't can this sort of thing.  I wish
>there was a way to define an enum type that is either a char or EOF, and
>declare stdio functions to return that type.  Then if only enum weren't so
>loose about converting to other types without a cast.  Sigh.
>
>P.S.  Are you trying to tell me that official unix utilities are written like
>that?

Yes, worse luck :-(

jc@piaget.UUCP (John Cornelius) (12/13/86)

In article <2246@eagle.ukc.ac.uk> sjl@ukc.ac.uk (S.J.Leviseur) writes:
 >In article <300@bms-at.UUCP> stuart@bms-at.UUCP (Stuart D. Gathman) writes:
 >>In article <2221@eagle.ukc.ac.uk>, sjl@ukc.ac.uk (S.J.Leviseur) writes:
 >....
 >>> If you want to see for yourself have a look through your sources and find
 >>> every occurence of a comparision between EOF or -1, and a char. Typically,
 >>> where cp is a character pointer:-
 >>
 >>> 		if (*cp == EOF)
 >>
 >>> 		while (*cp != EOF)
 >>
 >>Not in our code!  This type of code is not likely to work, even under K & R.
 >
 >It will work on any machine that allows signed chars (despite being ideologically
 >unsound!)
 >

I believe that the 3B2, to pick an example, places char in the high order
byte of the register.  If you test one for equality with (int) -1 you will
never pass the test.

As for small integers, Whitesmiths had a convention where a signed char
was typdef tiny char (in this case signed). 

Because of the different architectures we're seeing in the Unix/C environment
I laud the effort to create a standard that is architecture independent.

As for the above construct working on any machine with signed char, I doubt
that it will work on the 3B2.


-- 
John Cornelius
(...!sdcsvax!piaget!jc)

guy@sun.uucp (Guy Harris) (12/14/86)

> I believe that the 3B2, to pick an example, places char in the high order
> byte of the register.

You may believe that, but it's not the case.  It's also completely
irrelevant; the C language states quite precisely what happens in the case
cited, both in the case where "char" is signed and in the case where "char"
is unsigned.  The implementation can put the character into any byte it
wants, as long as it behaves the way K&R, or the ANSI C spec, say it should.

> If you test one for equality with (int) -1 you will never pass the test.

That's because characters are UNsigned on the 3B2.

> As for the above construct working on any machine with signed char, I doubt
> that it will work on the 3B2.

See above.  The construct doesn't work on the 3B2 because it *isn't* a
machine with signed char.  The above construct *will* "work", in some sense,
on any C implementation with signed "char".  It *still* won't work
correctly; on a machine with signed "char" and 8-bit bytes, a byte with the
value 0xff will be sign-extended to -1, and thus will compare equal to EOF
even though it's a perfectly legitimate datum.
-- 
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com (or guy@sun.arpa)

ron@brl-sem.ARPA (Ron Natalie <ron>) (12/14/86)

Correct me if I am wrong, but there is no place in STDIO where EOF is
ever meant to be applied to a character.  The only thing that EOF is
defined for is functions returning int.

Second, nowhere is it stated that (unsigned) -1 will give you a word of
all ones.  Becareful when making this assumption.  I spend a lot of time
fixing up the Berkeley network code because of this.

-Ron

ballou@brahms (Kenneth R. Ballou) (12/15/86)

In article <518@brl-sem.ARPA> ron@brl-sem.ARPA (Ron Natalie <ron>) writes:
> ... nowhere is it stated that (unsigned) -1 will give you a word of
>all ones.  Becareful when making this assumption.  I spend a lot of time
>fixing up the Berkeley network code because of this.

	Actually, I think (unsigned) -1 does have to give you a bit pattern of
all 1's.  I can not find an explicit reason, but I can deduce this from the
following:

	1.  Harbison and Steele, page 89:
	    "No matter what representation is used for signed integers,
	an unsigned integer represented with n bits is always considered to
	be in straight unsigned binary notation, with values ranging from
	0 through 2^n-1.  Therefore, the bit pattern for a given unsigned
	value is predictable and portable, whereas the bit pattern for a
	given signed value is not predictable and not portable."

	2.  Harbison and Steele, pages 126-7 (talking about casting an integral
	    type into an unsigned integral type):
	    "  If the result type is an unsigned type, then the result
	must be that unique alue of the result type that is equal (congruent)
	mod 2^n in the original value, where n is equal to the number of bits
	used in the representation of the result type."

	3.  The value in the range 0 to 2^n-1 (inclusive) congruent mod 2^n
	to -1 is 2^n-1.  In straight binary notation this value is repre-
	sented as all 1's.

--------
Kenneth R. Ballou			ARPA:  ballou@brahms
Department of Mathematics		UUCP:  ...!ucbvax!brahms!ballou
University of California
Berkeley, California  94720
--------
Kenneth R. Ballou			ARPA:  ballou@brahms
Department of Mathematics		UUCP:  ...!ucbvax!brahms!ballou
University of California
Berkeley, California  94720