[comp.std.c] Must sizeof

msb@sq.sq.com (Mark Brader) (08/30/89)

I'm transferring this discussion from comp.lang.c to comp.std.c.

The following point of interpretation is one which had never occurred to
me during the standardization process, and which I'd never seen discussed
on the net until just now.

The only thing that Section 3.1.2.5 of the proposed standard (pANS)
guarantees about sizes of integral types is, in effect, that the sequence
"sizeof(char), sizeof(short), sizeof(int), sizeof(long)" is nondecreasing,
and that the presence of "signed" or "unsigned" doesn't affect the size.
Section 2.2.4.2 in effect specifies minimum sizes for each type, but
specifies nothing about maximum sizes.

I believe there is nothing in the whole of Sections 2 or 3 of the pANS
which requires any integral type to be *larger* than any other.

But in a *hosted* implementation, Section 4 applies as well.  And Doug
Gwyn has just called attention in comp.lang.c to the fact that several
library functions specified there, such as getchar(), are expected to
convert an unsigned char value to type int.

Considering an implementation where sizeof(int)==sizeof(char), Doug writes:
> Since in such an implementation an int would be unable to represent
> all possible values in the range of a unsigned char, as required by
> the specification for some library routines, it would not be standard
> conforming.

Setting aside the fact that we're talking only about hosted environments
here, this seems shaky to me.  I can see two ways out of it, which makes
three possibilities in all.

[1] The wording of footnote 16 defining a so-called pure binary numeration
system is so broad that it may allow an unsigned type to simply ignore the
high-order bit position, provided that the corresponding signed type is
at least one bit wider than the minimum otherwise required.  Then int could
be 16 bits, char could be 16 bits, and unsigned char could be 16 bits of
which only the lower 15 are actually used.

[2] The wording of the requirements of the aforementioned functions could
be taken as specifying only that such a conversion be attempted, not that
it be possible for all possible values of the argument.  If int and char
are both 16 bits, and getchar() reads the character 0xFEED from the input,
then getchar() should be allowed to do whatever happens when you assign
the positive value 0xFEED to an int variable, and anything else would be
undefined behavior under the "invalid value" rule of 4.1.6.

[3] The above argument is right and so sizeof(int)>sizeof(char) is required
to be true, in a hosted environment only.

I seem to recall that the Committee explicitly decided not to require that
sizeof(int)>sizeof(char) when it was requested for other reasons, to do
with avoiding surprises with unsigned types in comparisons.  ("It was
decided to allow implementers flexibility in this regard", or some such
words.)  Are they now finding that they did require this all along?

-- 
Mark Brader	"Many's the time when I've thanked the DAG of past years
utzoo!sq!msb	for anticipating future maintenance questions and providing
msb@sq.com	helpful information in the original sources."	-- Doug A. Gwyn

This article is in the public domain.

dfp@cbnewsl.ATT.COM (david.f.prosser) (09/01/89)

In article <1989Aug29.204254.3307@sq.sq.com> msb@sq.com (Mark Brader) writes:
>But in a *hosted* implementation, Section 4 applies as well.  And Doug
>Gwyn has just called attention in comp.lang.c to the fact that several
>library functions specified there, such as getchar(), are expected to
>convert an unsigned char value to type int.
>
>Considering an implementation where sizeof(int)==sizeof(char), Doug writes:
>> Since in such an implementation an int would be unable to represent
>> all possible values in the range of a unsigned char, as required by
>> the specification for some library routines, it would not be standard
>> conforming.
>
>Setting aside the fact that we're talking only about hosted environments
>here, this seems shaky to me.  I can see two ways out of it, which makes
>three possibilities in all.
>
>[1] The wording of footnote 16 defining a so-called pure binary numeration
>system is so broad that it may allow an unsigned type to simply ignore the
>high-order bit position, provided that the corresponding signed type is
>at least one bit wider than the minimum otherwise required.  Then int could
>be 16 bits, char could be 16 bits, and unsigned char could be 16 bits of
>which only the lower 15 are actually used.

I agree that the pure binary numeration definition paraphrased in a footnote
does allow this sort of implementation.

>[2] The wording of the requirements of the aforementioned functions could
>be taken as specifying only that such a conversion be attempted, not that
>it be possible for all possible values of the argument.  If int and char
>are both 16 bits, and getchar() reads the character 0xFEED from the input,
>then getchar() should be allowed to do whatever happens when you assign
>the positive value 0xFEED to an int variable, and anything else would be
>undefined behavior under the "invalid value" rule of 4.1.6.
>
>[3] The above argument is right and so sizeof(int)>sizeof(char) is required
>to be true, in a hosted environment only.

Mark's and Doug's articles got me thinking along the same lines.  I believe
that the library does not force sizeof(char) to be less than sizeof(int).
Mark's [2] is a valid argument for Doug's point, but there are other
library section items:

1. 4.9.2, p126, l32-34:
	"A binary stream is an ordered sequence of characters that can
	transparently record internal data.  Data read in from a binary
	stream shall compare equal to the data that were written out to
	that stream, under the same implementation."

2. 4.9.3, p127, l9-11:
	"All input takes place as if characters were read by successive
	calls to the fgetc function; all output takes place as if by
	successive calls to the fputc function."

3. 4.9.7.1, p142, l7-8:
	"The fgetc function obtains the next character (if present) as
	an unsigned character converted to an int."

Since all objects are exact multiples of characters, this means that all
the bits in a character must be significant so that an fwrite/fread of a
negative int value works.

Now, if EOF is required to be distinguishable from all unsigned char
values after conversion to int, then it follows that sizeof(char) must
be less than sizeof(int).  There are many strong indications that EOF
"should" be different, I cannot find anything that actually requires
such a distinction.  Two such indications:

4. 4.9.7.1, p142 l12-13:
	"If the stream is at end-of-file, the end-of-file indicator for
	the stream is set and [the] fgetc [function] returns EOF.  If a
	read error occurs, the error indicator for the stream is set
	and [the] fgetc [function] returns EOF."

5. 4.9.7.11, p145, l15-16:
	"If the value of c [the first parameter for ungetc] equals that
	of the macro EOF, the operation fails and the input stream is
	unchanged."

Of course, virtually every program that reads input until EOF is not
portable since they don't check feof when getchar returns EOF!  And one
cannot pushback any character, since EOF must be rejected by ungetc.

>I seem to recall that the Committee explicitly decided not to require that
>sizeof(int)>sizeof(char) when it was requested for other reasons, to do
>with avoiding surprises with unsigned types in comparisons.  ("It was
>decided to allow implementers flexibility in this regard", or some such
>words.)  Are they now finding that they did require this all along?

Therefore (while discovering that even "cat" as most simply written is
not portable), the pANS still does not require that sizeof(char) must
be less than sizeof(int).

At this point, I'd be happier if there were a requirement that EOF be
distinct from all other values possible to return from fgetc!

Dave Prosser	...not an official X3J11 answer...

gwyn@smoke.BRL.MIL (Doug Gwyn) (09/01/89)

In article <1713@cbnewsl.ATT.COM> dfp@cbnewsl.ATT.COM (david.f.prosser) writes:
>At this point, I'd be happier if there were a requirement that EOF be
>distinct from all other values possible to return from fgetc!

The very issue you discussed arose at an X3J11 meeting, in off-line
discussion with Jervis, myself, and someone else (as I recall).  My
dim recollection is that we decided EOF didn't have to be distinct
if sizeof(int)==sizeof(char), and so far as we could tell the latter
is allowed.  This agrees with your conclusions.  I would rather
construe the description of EOF as requiring that it be distinct,
for the obvious reasons.

Yet another matter for the "interpretations" phase?

mark@cblpf.ATT.COM (Mark Horton) (09/12/89)

In article <10908@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>In article <1713@cbnewsl.ATT.COM> dfp@cbnewsl.ATT.COM (david.f.prosser) writes:
>>At this point, I'd be happier if there were a requirement that EOF be
>>distinct from all other values possible to return from fgetc!
>
>The very issue you discussed arose at an X3J11 meeting, in off-line
>discussion with Jervis, myself, and someone else (as I recall).  My
>dim recollection is that we decided EOF didn't have to be distinct
>if sizeof(int)==sizeof(char), and so far as we could tell the latter
>is allowed.  This agrees with your conclusions.  I would rather
>construe the description of EOF as requiring that it be distinct,
>for the obvious reasons.
>
>Yet another matter for the "interpretations" phase?

The obvious reason why you would want big characters (other than tiny
8 bit machines) is to support eastern character sets, such as the
Japanese Kanji.  There are several encodings of Kanji, generally in
16 bits.  While they don't use the entire 65K possible combinations,
they do use all 16 bits.  As I recall, 7 bit ASCII, 8 bit European,
and 16 bit Kanji characters can be interspersed, and can be recognized
by looking at the high bits of each byte: 0/0 => ASCII, 0/1 or 1/0 =>
ASCII/Eur or Eur/ASCII, 1/1 => a single Kanji character in the remaining
14 bits.  I suspect (but am not sure) that FFFF is unused, making EOF
likely to be distinct, but it could appear in a file.

I would discourage any implementation of unsigned from ignoring or clearing
the high bit.  I think the "assume you won't see the EOF bits in the file"
approach is right for the implementation, while it's better for the
application to use feof instead of EOF.

By the way, some other character sets (such as Chinese) don't fit in
16 bits.  Assuming that since int=long that characters will always be
smaller than int may not be safe.

chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) (09/13/89)

In article <9487@cbnews.ATT.COM> mark@cblpf.ATT.COM (Mark Horton) writes:
|The obvious reason why you would want big characters (other than tiny
|8 bit machines) is to support eastern character sets, such as the
|Japanese Kanji. 

	Exactly why the indentifier wchar_t was placed into the pANS.  A
small note of this is given in the latter parts of K&R. wchar_t is
implementation defined to be the largest value any particular character set
can within a given locale. 

|By the way, some other character sets (such as Chinese) don't fit in
|16 bits.  Assuming that since int=long that characters will always be
|smaller than int may not be safe.

	Possibly safe, but definately not portable programming.

gwyn@smoke.BRL.MIL (Doug Gwyn) (09/13/89)

In article <9487@cbnews.ATT.COM> mark@cblpf.ATT.COM (Mark Horton) writes:
>The obvious reason why you would want big characters (other than tiny
>8 bit machines) is to support eastern character sets, ...

Yes, but the "international" community bought into "multibyte characters"
instead of fat implementation of char.  I happen to think that was a poor
decision, because it requires additional programming to properly deal
with such environments.  Fat "char" would be slicker, but if you want
that you also need some way to express "small char" too, and my proposal
for that was not adopted.  Therefore I doubt that many implementations
will actually implement char any fatter than 8 or 9 bits, even in Eastern
markets.

>I would discourage any implementation of unsigned from ignoring or clearing
>the high bit.

They're not allowed to do that already.

>I think the "assume you won't see the EOF bits in the file"
>approach is right for the implementation, while it's better for the
>application to use feof instead of EOF.

That seems to be the conclusions we've arrived at.  It is unfortunate
that comparing (getchar() == EOF) is not as universal as we've come
to believe.