msb@sq.sq.com (Mark Brader) (08/30/89)
I'm transferring this discussion from comp.lang.c to comp.std.c. The following point of interpretation is one which had never occurred to me during the standardization process, and which I'd never seen discussed on the net until just now. The only thing that Section 3.1.2.5 of the proposed standard (pANS) guarantees about sizes of integral types is, in effect, that the sequence "sizeof(char), sizeof(short), sizeof(int), sizeof(long)" is nondecreasing, and that the presence of "signed" or "unsigned" doesn't affect the size. Section 2.2.4.2 in effect specifies minimum sizes for each type, but specifies nothing about maximum sizes. I believe there is nothing in the whole of Sections 2 or 3 of the pANS which requires any integral type to be *larger* than any other. But in a *hosted* implementation, Section 4 applies as well. And Doug Gwyn has just called attention in comp.lang.c to the fact that several library functions specified there, such as getchar(), are expected to convert an unsigned char value to type int. Considering an implementation where sizeof(int)==sizeof(char), Doug writes: > Since in such an implementation an int would be unable to represent > all possible values in the range of a unsigned char, as required by > the specification for some library routines, it would not be standard > conforming. Setting aside the fact that we're talking only about hosted environments here, this seems shaky to me. I can see two ways out of it, which makes three possibilities in all. [1] The wording of footnote 16 defining a so-called pure binary numeration system is so broad that it may allow an unsigned type to simply ignore the high-order bit position, provided that the corresponding signed type is at least one bit wider than the minimum otherwise required. Then int could be 16 bits, char could be 16 bits, and unsigned char could be 16 bits of which only the lower 15 are actually used. [2] The wording of the requirements of the aforementioned functions could be taken as specifying only that such a conversion be attempted, not that it be possible for all possible values of the argument. If int and char are both 16 bits, and getchar() reads the character 0xFEED from the input, then getchar() should be allowed to do whatever happens when you assign the positive value 0xFEED to an int variable, and anything else would be undefined behavior under the "invalid value" rule of 4.1.6. [3] The above argument is right and so sizeof(int)>sizeof(char) is required to be true, in a hosted environment only. I seem to recall that the Committee explicitly decided not to require that sizeof(int)>sizeof(char) when it was requested for other reasons, to do with avoiding surprises with unsigned types in comparisons. ("It was decided to allow implementers flexibility in this regard", or some such words.) Are they now finding that they did require this all along? -- Mark Brader "Many's the time when I've thanked the DAG of past years utzoo!sq!msb for anticipating future maintenance questions and providing msb@sq.com helpful information in the original sources." -- Doug A. Gwyn This article is in the public domain.
dfp@cbnewsl.ATT.COM (david.f.prosser) (09/01/89)
In article <1989Aug29.204254.3307@sq.sq.com> msb@sq.com (Mark Brader) writes: >But in a *hosted* implementation, Section 4 applies as well. And Doug >Gwyn has just called attention in comp.lang.c to the fact that several >library functions specified there, such as getchar(), are expected to >convert an unsigned char value to type int. > >Considering an implementation where sizeof(int)==sizeof(char), Doug writes: >> Since in such an implementation an int would be unable to represent >> all possible values in the range of a unsigned char, as required by >> the specification for some library routines, it would not be standard >> conforming. > >Setting aside the fact that we're talking only about hosted environments >here, this seems shaky to me. I can see two ways out of it, which makes >three possibilities in all. > >[1] The wording of footnote 16 defining a so-called pure binary numeration >system is so broad that it may allow an unsigned type to simply ignore the >high-order bit position, provided that the corresponding signed type is >at least one bit wider than the minimum otherwise required. Then int could >be 16 bits, char could be 16 bits, and unsigned char could be 16 bits of >which only the lower 15 are actually used. I agree that the pure binary numeration definition paraphrased in a footnote does allow this sort of implementation. >[2] The wording of the requirements of the aforementioned functions could >be taken as specifying only that such a conversion be attempted, not that >it be possible for all possible values of the argument. If int and char >are both 16 bits, and getchar() reads the character 0xFEED from the input, >then getchar() should be allowed to do whatever happens when you assign >the positive value 0xFEED to an int variable, and anything else would be >undefined behavior under the "invalid value" rule of 4.1.6. > >[3] The above argument is right and so sizeof(int)>sizeof(char) is required >to be true, in a hosted environment only. Mark's and Doug's articles got me thinking along the same lines. I believe that the library does not force sizeof(char) to be less than sizeof(int). Mark's [2] is a valid argument for Doug's point, but there are other library section items: 1. 4.9.2, p126, l32-34: "A binary stream is an ordered sequence of characters that can transparently record internal data. Data read in from a binary stream shall compare equal to the data that were written out to that stream, under the same implementation." 2. 4.9.3, p127, l9-11: "All input takes place as if characters were read by successive calls to the fgetc function; all output takes place as if by successive calls to the fputc function." 3. 4.9.7.1, p142, l7-8: "The fgetc function obtains the next character (if present) as an unsigned character converted to an int." Since all objects are exact multiples of characters, this means that all the bits in a character must be significant so that an fwrite/fread of a negative int value works. Now, if EOF is required to be distinguishable from all unsigned char values after conversion to int, then it follows that sizeof(char) must be less than sizeof(int). There are many strong indications that EOF "should" be different, I cannot find anything that actually requires such a distinction. Two such indications: 4. 4.9.7.1, p142 l12-13: "If the stream is at end-of-file, the end-of-file indicator for the stream is set and [the] fgetc [function] returns EOF. If a read error occurs, the error indicator for the stream is set and [the] fgetc [function] returns EOF." 5. 4.9.7.11, p145, l15-16: "If the value of c [the first parameter for ungetc] equals that of the macro EOF, the operation fails and the input stream is unchanged." Of course, virtually every program that reads input until EOF is not portable since they don't check feof when getchar returns EOF! And one cannot pushback any character, since EOF must be rejected by ungetc. >I seem to recall that the Committee explicitly decided not to require that >sizeof(int)>sizeof(char) when it was requested for other reasons, to do >with avoiding surprises with unsigned types in comparisons. ("It was >decided to allow implementers flexibility in this regard", or some such >words.) Are they now finding that they did require this all along? Therefore (while discovering that even "cat" as most simply written is not portable), the pANS still does not require that sizeof(char) must be less than sizeof(int). At this point, I'd be happier if there were a requirement that EOF be distinct from all other values possible to return from fgetc! Dave Prosser ...not an official X3J11 answer...
gwyn@smoke.BRL.MIL (Doug Gwyn) (09/01/89)
In article <1713@cbnewsl.ATT.COM> dfp@cbnewsl.ATT.COM (david.f.prosser) writes: >At this point, I'd be happier if there were a requirement that EOF be >distinct from all other values possible to return from fgetc! The very issue you discussed arose at an X3J11 meeting, in off-line discussion with Jervis, myself, and someone else (as I recall). My dim recollection is that we decided EOF didn't have to be distinct if sizeof(int)==sizeof(char), and so far as we could tell the latter is allowed. This agrees with your conclusions. I would rather construe the description of EOF as requiring that it be distinct, for the obvious reasons. Yet another matter for the "interpretations" phase?
mark@cblpf.ATT.COM (Mark Horton) (09/12/89)
In article <10908@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes: >In article <1713@cbnewsl.ATT.COM> dfp@cbnewsl.ATT.COM (david.f.prosser) writes: >>At this point, I'd be happier if there were a requirement that EOF be >>distinct from all other values possible to return from fgetc! > >The very issue you discussed arose at an X3J11 meeting, in off-line >discussion with Jervis, myself, and someone else (as I recall). My >dim recollection is that we decided EOF didn't have to be distinct >if sizeof(int)==sizeof(char), and so far as we could tell the latter >is allowed. This agrees with your conclusions. I would rather >construe the description of EOF as requiring that it be distinct, >for the obvious reasons. > >Yet another matter for the "interpretations" phase? The obvious reason why you would want big characters (other than tiny 8 bit machines) is to support eastern character sets, such as the Japanese Kanji. There are several encodings of Kanji, generally in 16 bits. While they don't use the entire 65K possible combinations, they do use all 16 bits. As I recall, 7 bit ASCII, 8 bit European, and 16 bit Kanji characters can be interspersed, and can be recognized by looking at the high bits of each byte: 0/0 => ASCII, 0/1 or 1/0 => ASCII/Eur or Eur/ASCII, 1/1 => a single Kanji character in the remaining 14 bits. I suspect (but am not sure) that FFFF is unused, making EOF likely to be distinct, but it could appear in a file. I would discourage any implementation of unsigned from ignoring or clearing the high bit. I think the "assume you won't see the EOF bits in the file" approach is right for the implementation, while it's better for the application to use feof instead of EOF. By the way, some other character sets (such as Chinese) don't fit in 16 bits. Assuming that since int=long that characters will always be smaller than int may not be safe.
chad@csd4.csd.uwm.edu (D. Chadwick Gibbons) (09/13/89)
In article <9487@cbnews.ATT.COM> mark@cblpf.ATT.COM (Mark Horton) writes: |The obvious reason why you would want big characters (other than tiny |8 bit machines) is to support eastern character sets, such as the |Japanese Kanji. Exactly why the indentifier wchar_t was placed into the pANS. A small note of this is given in the latter parts of K&R. wchar_t is implementation defined to be the largest value any particular character set can within a given locale. |By the way, some other character sets (such as Chinese) don't fit in |16 bits. Assuming that since int=long that characters will always be |smaller than int may not be safe. Possibly safe, but definately not portable programming.
gwyn@smoke.BRL.MIL (Doug Gwyn) (09/13/89)
In article <9487@cbnews.ATT.COM> mark@cblpf.ATT.COM (Mark Horton) writes: >The obvious reason why you would want big characters (other than tiny >8 bit machines) is to support eastern character sets, ... Yes, but the "international" community bought into "multibyte characters" instead of fat implementation of char. I happen to think that was a poor decision, because it requires additional programming to properly deal with such environments. Fat "char" would be slicker, but if you want that you also need some way to express "small char" too, and my proposal for that was not adopted. Therefore I doubt that many implementations will actually implement char any fatter than 8 or 9 bits, even in Eastern markets. >I would discourage any implementation of unsigned from ignoring or clearing >the high bit. They're not allowed to do that already. >I think the "assume you won't see the EOF bits in the file" >approach is right for the implementation, while it's better for the >application to use feof instead of EOF. That seems to be the conclusions we've arrived at. It is unfortunate that comparing (getchar() == EOF) is not as universal as we've come to believe.