chris@mimsy.umd.edu (Chris Torek) (01/07/90)
In article <11879@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >The paragraph around line 40 on page 136 of the December draft makes it >clear that the result of the conversion for the n specifier is subject >to assignment suppression by *. (Yes, there IS a conversion, just no >input operation.) Well, I would argue that it makes it the most reasonable interpretation, but definitely *not* `clear'. Anyway, here are the (apparent) answers: %*n suppresses assignment; no action occurs: it is a no-op. (%n is a conversion, but is not an assignment; yet it can be suppressed with assignment suppression.) %[efg] reads a floating point number. If the input has one of the forms <opt-sign><nondigit> <opt-sign>.<nondigit> no input is consumed. If the input has one of the forms <opt-sign><digit-seq><exp><opt-sign><nondigit> <opt-sign><digit-seq>.<exp><opt-sign><nondigit> <opt-sign><digit-seq>.<digit-seq><exp><opt-sign><nondigit> <opt-sign>.<digit-seq><exp><opt-sign><nondigit> the <exp>, the second <opt-sign>, and the <nondigit> remain unconsumed. The definitions of <opt-sign>, <digit-seq>, <exp>, and <nondigit> are the obvious. (Note that EOF counts as a nondigit.) %[dioux] reads an integer. If the input has one of the forms <sign><nondigit> no input is consumed. If the input has the form <opt-sign>0x <opt-sign>0X and the conversion is either `i' or `x', the sign (if any) and the zero are consumed; the `x' or `X' remains unconsumed. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/07/90)
In article <21675@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
-%*n suppresses assignment; no action occurs: it is a no-op.
Right.
- (%n is a conversion, but is not an assignment; ...
No, %n involves both a conversion and an assignment, but no input action.
The assignment is suppressed by *, in which case there is really no need
for the implementation to perform the conversion since it merely wastes
CPU cycles. (The same applies to other conversion when the assignment is
suppressed, although most implementations will probably combine lexing
with conversion and thus not be conviently able to skip the conversion.)
-%[efg] reads a floating point number. If the input has one of the forms
- <opt-sign><nondigit>
- <opt-sign>.<nondigit>
- no input is consumed.
Right. Similarly for any other ill-formed string.
- If the input has one of the forms
- <opt-sign><digit-seq><exp><opt-sign><nondigit>
- <opt-sign><digit-seq>.<exp><opt-sign><nondigit>
- <opt-sign><digit-seq>.<digit-seq><exp><opt-sign><nondigit>
- <opt-sign>.<digit-seq><exp><opt-sign><nondigit>
- the <exp>, the second <opt-sign>, and the <nondigit> remain
- unconsumed.
Right.
-%[dioux] reads an integer. If the input has one of the forms
- <sign><nondigit>
- no input is consumed.
Right.
- If the input has the form
- <opt-sign>0x
- <opt-sign>0X
- and the conversion is either `i' or `x', the sign (if any) and
- the zero are consumed; the `x' or `X' remains unconsumed.
Right (assuming that there is no hex digit immediately following the x).
DISCLAIMER: This is my personal interpretation of the Standard;
for an official interpretation you must send a request to X3J11 via X3.
chris@mimsy.umd.edu (Chris Torek) (01/08/90)
[me: (%n is a conversion, but is not an assignment; ...] In article <11897@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >No, %n involves both a conversion and an assignment, but no input action. Except that it is not counted in the return value. (Neither are suppressed assignments, but those are `suppressed assignments', not `assignments'.) %n, then, is a conversion and an assignment, but cannot be called an assignment because it is not counted as an assignment. This is the sort of thing that causes confusion as to whether `%*n' suppresses the (already not counted as an assignment) assignment. >> If the input has the form >> <opt-sign>0x >> <opt-sign>0X >> and the conversion is either `i' or `x', the sign (if any) and >> the zero are consumed; the `x' or `X' remains unconsumed. >Right (assuming that there is no hex digit immediately following the x). Oops, I meant to say `<opt-sign>0<hex-indicator><non-hex-digit>'. Anyway, if the implementation of *scanf() uses lookahead to handle scanning, it needs at least three bytes of lookahead. If it uses pushback (which is legal but not required), the implementation must provide at least its own three bytes of pushback plus one more. Mine uses a combination of lookahead and pushback: it looks at the first remaining character in the buffer, and consumes it if it appears to be valid. If it later discovers that it was not, there might be one character of lookahead around, and two characters consumed that need to be pushed back; in this case, both are pushed back, and the implementation further guarantees at least one more pushback. Incidentally, it is not clear to me whether the standard requires the following to work. (The important line is marked with -> on the left.) #include <stdio.h> #include "h_defs.h" /* for H_VALUE values */ /* * Assume `stream' is open to a read stream on which * the next few input characters are either * `h<optional space><integer>' or perhaps `hello'. * If the format is `h<integer>', stuff the value into * the given h_value pointer and return 1. Leave *h_value * unchanged otherwise. * * If there is an h, but it is not followed by a space or a digit, * leave the h and what follows it unconsumed. */ int find_h_value(FILE *stream, int *h_value) { int c, v, n, r; c = getc(stream); if (c != 'h') { /* nb: ungetc(EOF) fails; this is desired */ (void) ungetc(c, stream); return (HV_NO_H); /* no `h' */ } if ((r = fscanf(stream, " %n%d", &n, &v)) == EOF) { /* must have been an input failure: conk out */ return (HV_H_WITH_EOF); } if (r == 1) { *h_value = v; return (HV_WITHVALUE); /* got an h value */ } /* r must be 0 */ if (n == 0) { /* there was no white space: put back the `h' */ -> (void) ungetc('h', stream); return (HV_UNCHANGED); /* input stream unchanged */ } /* there were spaces, so we may not be able to put back the `h'; return a code saying `keyword h found, followed by something not an integer' */ return (HV_H_WITH_UNKNOWN_TEXT); } If n is zero, we know the scanf() did not consume any characters. We may therefore be required to allow the `h' to be pushed back. I am not sure. Consider an implementation similar to the old Unix one, however, in which one fills a buffer whenever a `getc' (or equivalent) is done on an empty buffer. Here we might have the following: A. buffer is nearly exhausted: it has one `h' left B. program does a `getc', which returns 'h': buffer now empty (at this point, ungetc('h') will work.) C. program calls fscanf() which calls __vfscanf(), which starts the ` ' directive, needs to skip spaces, and therefore refills the buffer D. __vfscanf() finds an `e' (from `hello', perhaps) and stops skipping spaces E. __vfscanf() executes `%n' directive, which stores 0 in n F. __vfscanf() tries to execute `%d', finds an `e', and stops with a matching failure (returns 0) At this point, there is probably no room in the input buffer to push back the `h'. Then again, the description for `ungetc' does not indicate that any `getc' must be done in advance. It says that one character of pushback is guaranteed. Perhaps this is meant to imply that FILE *foo = fopen("foo", "r"); if (foo == NULL) die(); (void) ungetc('a', foo); is guaranteed to push back an `a', so that the first getc(foo) returns 'a'. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@cs.umd.edu Path: uunet!mimsy!chris
karl@haddock.ima.isc.com (Karl Heuer) (01/09/90)
In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >[So apparently scanf() requires at least three bytes of lookahead or pushback >so that it can recover from "1.2e-x", which assigns 1.2 and leaves "e-x" in >the input stream.] No no no. That may be the morally correct action, but it's not what the Standard says. Doug was mistaken on this point. 4.9.6.2 says "If conversion terminates on a conflicting input character, the offending input character is left unread". In the example, the "x" is the first conflicting input character, so only it remains unread; the characters "1.2e-" are consumed (and the conversion fails, since it does not form a valid floating-point number). The examples on page 139 of the Dec88 Draft clearly demonstrate that a format beginning with "%f" matches zero items when presented with "100ergs", because `"100e" fails to match "%f"'. Moreover, the Rationale says that "One- character pushback is sufficient for the implementation of |fscanf|. Given the invalid field `-.x', the characters `-.' are not pushed back" and "if a `flawed field' is detected, no value is stored for the corresponding argument". I think the intent of the Committee is clear. >Incidentally, it is not clear to me whether the standard requires >the following to work. [Code equivalent to the following:] > c = getchar(); assert(c == 'h'); > r = scanf(" %n", &n); assert(r == 0 && n == 0); > ungetc('h', stdin); /* can this fail? */ >If n is zero, we know the scanf() did not consume any characters. We >may therefore be required to allow the `h' to be pushed back. I am not >sure. I believe this is exactly what the Committee intended to legalize with its insistence that scanf is "not allowed to consume the ungetc slot". Yes, the traditional Unix implementation is non-conforming. It will have to be fixed. Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint
karl@haddock.ima.isc.com (Karl Heuer) (01/09/90)
As pointed out in my previous article, scanf() and ungetc() together require two lookahead/pushback slots. Personally, I'm surprised that the Committee did this. Providing a single slot is easy, as demonstrated by the traditional Unix implementation. It's not immediately obvious how to provide more than one, and still be able to provide the fast macro version of getc(). The solutions I've come up with can be easily generalized to more than two slots. So it appears that, given the insistence of the Committee to keep scanf() and ungetc() independent (despite common existing practice!), they could just as well have required scanf() to do full pushback for cases like "1.2e-x", which would require a total of four slots (three for scanf and one for ungetc), and not imposed any undue hardship on the implementation. There is a clause in the Standard that says that `all input takes place as if characters were read by successive calls to the |fgetc| function'. Of course |scanf| can't be written using only |fgetc| as a primitive, and I think it was probably a mistake to pretend that it can. My early suggestion was to add the clause `If an extra character is necessary to recognize the end of input, then scanf behaves as if it called the ungetc function after reading said character'. This would have kept the model simple, clearly documented the state of the stream following a scanf, and fixed the tricky implementation problem noted above. It would also have agreed with existing practice. Several months later I did find a way to implement |getc|, |ungetc|, and |scanf| so that they would follow the rules without incurring a substantial performance penalty. So my current opinion is that |scanf| should have been specified to do the morally correct thing, i.e. what Doug thought it was already specifying. Of course, it's too late to change it now. Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/09/90)
In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >[me: (%n is a conversion, but is not an assignment; ...] >In article <11897@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes: >>No, %n involves both a conversion and an assignment, but no input action. >Except that it is not counted in the return value. (Neither are >suppressed assignments, but those are `suppressed assignments', not >`assignments'.) %n, then, is a conversion and an assignment, but cannot >be called an assignment because it is not counted as an assignment. >This is the sort of thing that causes confusion as to whether `%*n' >suppresses the (already not counted as an assignment) assignment. The Standard says, "... the fscanf function returns the number of input items assigned ...". %n does not correspond to an input item. That %n is not included in the returned assignment count is also stated explicitly under the description of the n conversion specifier, just to make sure that there is no question about this. (Generally we avoided redundancy in the specifications, but some was added to reduce apparent ambiguity as evidenced by comments received during the public review.) >Anyway, if the implementation of *scanf() uses lookahead to handle >scanning, it needs at least three bytes of lookahead. If it uses >pushback (which is legal but not required), the implementation must >provide at least its own three bytes of pushback plus one more. Yup. Some of us would have been happy to eliminate the non-string forms of *scanf() as well as ungetc(), to avoid having stdio deal with pushback. Since only a limited amount of pushback is guaranteed, and then is subject to rigid constraints, it really isn't very useful for real-world tokenizers, macro processors, etc. which must implement their own scheme anyway. However, there was a strong minority that insisted that ungetc() was important to them, and at the time we thought it highly desirable to obtain unanimous approval for sending the draft out for (the first) public review, so the notion of pushback remained in the draft stdio specs. Later there was little sentiment for revisiting this issue. >Incidentally, it is not clear to me whether the standard requires >the following to work. [example program omitted for brevity] The Standard does guarantee that you can push back one character with ungetc() ('h' in the example program). As you have noticed, the old UNIX implementation of stdio pushback does not conform to the Standard. Perhaps you can get Dave Prosser or some other AT&T implementor to explain how they dealt with this in SVR4, which is advertised as Standard conformant. >At this point, there is probably no room in the input buffer to push >back the `h'. Definitely some additional "slop space" must be provided, somehow. I've heard it said that two bytes of slop is required if fscanf() uses getc()/ungetc(), although four seems to me to be necessary (without thinking very hard about it). >Then again, the description for `ungetc' does not indicate that any >`getc' must be done in advance. It says that one character of pushback >is guaranteed. Perhaps this is meant to imply that > FILE *foo = fopen("foo", "r"); > if (foo == NULL) die(); > (void) ungetc('a', foo); >is guaranteed to push back an `a', so that the first getc(foo) returns >'a'. It is true that ungetc() need not be preceded by getc() (or other stdio input function). Also, as shown, it is permissible to push back a character even at the beginning of the stream (fpi becomes sort of indeterminate as explained in the Standard). A POSIX-conforming implementation must treat text and binary streams indistinguishably, which means that after reading back the pushed-back character the fpi must again have the value 0.
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/09/90)
In article <15591@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >>[So apparently scanf() requires at least three bytes of lookahead or pushback >>so that it can recover from "1.2e-x", which assigns 1.2 and leaves "e-x" in >>the input stream.] >No no no. That may be the morally correct action, but it's not what the >Standard says. Doug was mistaken on this point. >4.9.6.2 says "If conversion terminates on a conflicting input character, the >offending input character is left unread". In the example, the "x" is the >first conflicting input character, so only it remains unread; the characters >"1.2e-" are consumed (and the conversion fails, since it does not form a valid >floating-point number). Again, I think this is a case of focusing too narrowly on only a portion of the full Standard. Two pages earlier, an "input item" is carefully defined as the LONGEST MATCHING sequence of input characters, etc. and that the FIRST CHARACTER after the input item remains unread. I think too much is being read into the term "conflicting input character". By the way, this does conflict with a response given to David Hough during the third public review, as well as the example. Therefore, it seems appropriate for submission to X3 as an interpretation issue requiring clarification. Perhaps our response was incorrect (not the first time that happened), or perhaps my reading of the Standard is. Anyhow, they're inconsistent. >The examples on page 139 of the Dec88 Draft clearly demonstrate that a format >beginning with "%f" matches zero items when presented with "100ergs", because >`"100e" fails to match "%f"'. Moreover, the Rationale says that "One- >character pushback is sufficient for the implementation of |fscanf|. Given >the invalid field `-.x', the characters `-.' are not pushed back" and "if a >`flawed field' is detected, no value is stored for the corresponding >argument". >I think the intent of the Committee is clear. Unfortunately for this argument, one of the unwritten guiding principles of the fprintf and fscanf specs was that they should closely follow what the AT&T UNIX System V implementation actually DID, and I just tested that and found that %f does match the "100" part of "100ergs", contrary to the example but in agreement with my interpretation of the Standard. So it's not so clear that we got this right. (Note that the example contained a late editing change, which makes it suspect in my opinion.)
sthomas@acorn.co.uk (Steve Thomas) (01/10/90)
In article <21675@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes: >Anyway, here are the (apparent) answers: [ list of answers ] I've been following this discussion with some interest, since I'm trying to implement scanf() myself. This list of answers is in line with my latest reading of the standard, though I confess that I went astray a few times. There is an example which is presumably supposed to clarify matters (although it's not, of course, officially part of the standard). In the Dec. 88 draft, page 139 line 24 says count = 0; /* "100e" fails to match "%f" */ when trying to match "%f%20s of %20s" with the input `100ergs of energy'. The example is clearly wrong, since page 151 line 36ff speaks of the `longest initial subsequence ... of the correct form'. Caveat lector, I suppose. Steve Thomas
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/10/90)
In article <15592@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >As pointed out in my previous article, scanf() and ungetc() together require >two lookahead/pushback slots. Personally, I'm surprised that the Committee >did this. ... This whole area was the subject of long, emotionally charged debates. The resulting specification was the only compromise we could come up with that wouldn't cause someone or another to vote against sending the draft proposed standard out for the (first) public review. At the time, we thought that unanimity for that vote was highly desirable (or even necessary from X3's point of view, which we later found was not true). However, seeing that it in effect gave everyone veto power, that was probably not a wise policy, and we dropped it for later activity. (In fact, I voted against sending out the draft resulting from one round of review.) During later processing of public comments, I don't think many committee members wanted to revisit the issue, since we had all grudgingly accepted the compromise specification.
karl@haddock.ima.isc.com (Karl Heuer) (01/12/90)
In article <11907@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes: >Definitely some additional "slop space" must be provided, somehow. >[Either two or four bytes, depending on which page of the pANS is correct.] Hmm, is it really bytes? Or (multibyte) characters? Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are {0x84, 0x30}. If I call scanf("@") and the input stream contains 0x84 0x31, does it push back one byte or two? Would it make any difference if I wrote it as scanf("\x84\x30")? Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint
karl@haddock.ima.isc.com (Karl Heuer) (01/12/90)
In article <15618@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are >{0x84, 0x30}. If I call scanf("@") and the input stream contains 0x84 0x31, >does it push back one byte or two? To answer my own question: one, apparently. Each ordinary multibyte character is a single directive, which causes characters (bytes) to be read from the stream, and in case of mismatch the differing and subsequent characters remain unread. If "character" had been intended to mean "multibyte character" here, they would have said so, and it would have been singular instead of plural. So it seems that the forces in favor of minimal pushback felt so strongly about it that they were even willing to leave the input stream in an unknown shift state! Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint
meissner@osf.org (Michael Meissner) (01/12/90)
In article <15620@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: | So it seems that the forces in favor of minimal pushback felt so strongly | about it that they were even willing to leave the input stream in an unknown | shift state! The ungetc wars primarily came before multibyte chars were added to the language, so it may have been overlooked, instead of agreeing to leave the input stream that way. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA Catproof is an oxymoron, Childproof is nearly so
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/13/90)
In article <15618@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >In article <11907@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes: >>Definitely some additional "slop space" must be provided, somehow. >>[Either two or four bytes, depending on which page of the pANS is correct.] >Hmm, is it really bytes? Or (multibyte) characters? Bytes. While the format is treated as multibyte, the input is not, and the distinction is relevant only for matching "ordinary" (literal) multibytes in the format. Input ceases as soon as the first non-matching character is input, and only the DIFFERING and subsequent characters remain unread. Matching bytes within the multibyte span simply get lost. (No, I don't like that behavior.) >Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are >{0x84, 0x30}. If I call scanf("@") and the input stream contains 0x84 0x31, >does it push back one byte or two? Would it make any difference if I wrote it >as scanf("\x84\x30")? The 0x84 input byte is consumed and just 0x31 gets pushed back in both cases.
gwyn@smoke.BRL.MIL (Doug Gwyn) (01/13/90)
In article <15620@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes: >So it seems that the forces in favor of minimal pushback felt so strongly >about it that they were even willing to leave the input stream in an unknown >shift state! Or perhaps failed to consider the ramifications. This is more evidence in support of the position I espoused (that was not adopted), that (even in a so-called "multibyte" environment) characters should be handled everywhere as unanalyzable units, so that a text stream getc() would return one of them. Think how much simpler the spec would be (and how much better) if there was no need to be concerned about "shift states" within applications.