[comp.std.c] ANSI draft interpretation questions

chris@mimsy.umd.edu (Chris Torek) (01/07/90)

In article <11879@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
>The paragraph around line 40 on page 136 of the December draft makes it
>clear that the result of the conversion for the n specifier is subject
>to assignment suppression by *.  (Yes, there IS a conversion, just no
>input operation.)

Well, I would argue that it makes it the most reasonable interpretation,
but definitely *not* `clear'.

Anyway, here are the (apparent) answers:

%*n	suppresses assignment; no action occurs: it is a no-op.
	(%n is a conversion, but is not an assignment; yet it can
	be suppressed with assignment suppression.)

%[efg]	reads a floating point number.  If the input has one of the forms

		<opt-sign><nondigit>
		<opt-sign>.<nondigit>

	no input is consumed.  If the input has one of the forms

		<opt-sign><digit-seq><exp><opt-sign><nondigit>
		<opt-sign><digit-seq>.<exp><opt-sign><nondigit>
		<opt-sign><digit-seq>.<digit-seq><exp><opt-sign><nondigit>
		<opt-sign>.<digit-seq><exp><opt-sign><nondigit>

	the <exp>, the second <opt-sign>, and the <nondigit> remain
	unconsumed.

	The definitions of <opt-sign>, <digit-seq>, <exp>, and <nondigit>
	are the obvious.  (Note that EOF counts as a nondigit.)

%[dioux] reads an integer.  If the input has one of the forms

		<sign><nondigit>

	no input is consumed.  If the input has the form

		<opt-sign>0x
		<opt-sign>0X

	and the conversion is either `i' or `x', the sign (if any) and
	the zero are consumed; the `x' or `X' remains unconsumed.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/07/90)

In article <21675@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
-%*n	suppresses assignment; no action occurs: it is a no-op.

Right.

-	(%n is a conversion, but is not an assignment; ...

No, %n involves both a conversion and an assignment, but no input action.
The assignment is suppressed by *, in which case there is really no need
for the implementation to perform the conversion since it merely wastes
CPU cycles.  (The same applies to other conversion when the assignment is
suppressed, although most implementations will probably combine lexing
with conversion and thus not be conviently able to skip the conversion.)

-%[efg]	reads a floating point number.  If the input has one of the forms
-		<opt-sign><nondigit>
-		<opt-sign>.<nondigit>
-	no input is consumed.

Right.  Similarly for any other ill-formed string.

-			       If the input has one of the forms
-		<opt-sign><digit-seq><exp><opt-sign><nondigit>
-		<opt-sign><digit-seq>.<exp><opt-sign><nondigit>
-		<opt-sign><digit-seq>.<digit-seq><exp><opt-sign><nondigit>
-		<opt-sign>.<digit-seq><exp><opt-sign><nondigit>
-	the <exp>, the second <opt-sign>, and the <nondigit> remain
-	unconsumed.

Right.

-%[dioux] reads an integer.  If the input has one of the forms
-		<sign><nondigit>
-	no input is consumed.

Right.

-			       If the input has the form
-		<opt-sign>0x
-		<opt-sign>0X
-	and the conversion is either `i' or `x', the sign (if any) and
-	the zero are consumed; the `x' or `X' remains unconsumed.

Right (assuming that there is no hex digit immediately following the x).

DISCLAIMER:  This is my personal interpretation of the Standard;
for an official interpretation you must send a request to X3J11 via X3.

chris@mimsy.umd.edu (Chris Torek) (01/08/90)

[me: 	(%n is a conversion, but is not an assignment; ...]

In article <11897@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
>No, %n involves both a conversion and an assignment, but no input action.

Except that it is not counted in the return value.  (Neither are
suppressed assignments, but those are `suppressed assignments', not
`assignments'.)  %n, then, is a conversion and an assignment, but cannot
be called an assignment because it is not counted as an assignment.
This is the sort of thing that causes confusion as to whether `%*n'
suppresses the (already not counted as an assignment) assignment.

>>	If the input has the form
>>		<opt-sign>0x
>>		<opt-sign>0X
>>	and the conversion is either `i' or `x', the sign (if any) and
>>	the zero are consumed; the `x' or `X' remains unconsumed.

>Right (assuming that there is no hex digit immediately following the x).

Oops, I meant to say `<opt-sign>0<hex-indicator><non-hex-digit>'.

Anyway, if the implementation of *scanf() uses lookahead to handle
scanning, it needs at least three bytes of lookahead.  If it uses
pushback (which is legal but not required), the implementation must
provide at least its own three bytes of pushback plus one more.
Mine uses a combination of lookahead and pushback: it looks at the
first remaining character in the buffer, and consumes it if it
appears to be valid.  If it later discovers that it was not, there
might be one character of lookahead around, and two characters
consumed that need to be pushed back; in this case, both are pushed
back, and the implementation further guarantees at least one more
pushback.

Incidentally, it is not clear to me whether the standard requires
the following to work.  (The important line is marked with -> on the
left.)

	#include <stdio.h>
	#include "h_defs.h" /* for H_VALUE values */

	/*
	 * Assume `stream' is open to a read stream on which
	 * the next few input characters are either
	 * `h<optional space><integer>' or perhaps `hello'.
	 * If the format is `h<integer>', stuff the value into 
	 * the given h_value pointer and return 1.  Leave *h_value
	 * unchanged otherwise.
	 *
	 * If there is an h, but it is not followed by a space or a digit,
	 * leave the h and what follows it unconsumed.
	 */
	int find_h_value(FILE *stream, int *h_value) {
		int c, v, n, r;

		c = getc(stream);
		if (c != 'h') {
			/* nb: ungetc(EOF) fails; this is desired */
			(void) ungetc(c, stream);
			return (HV_NO_H);	/* no `h' */
		}
		if ((r = fscanf(stream, " %n%d", &n, &v)) == EOF) {
			/* must have been an input failure: conk out */
			return (HV_H_WITH_EOF);
		}
		if (r == 1) {
			*h_value = v;
			return (HV_WITHVALUE);	/* got an h value */
		}
		/* r must be 0 */
		if (n == 0) {
			/* there was no white space: put back the `h' */
->			(void) ungetc('h', stream);
			return (HV_UNCHANGED);	/* input stream unchanged */
		}
		/* there were spaces, so we may not be able to put back
		   the `h'; return a code saying `keyword h found, followed
		   by something not an integer' */
		return (HV_H_WITH_UNKNOWN_TEXT);
	}

If n is zero, we know the scanf() did not consume any characters.  We
may therefore be required to allow the `h' to be pushed back.  I am not
sure.  Consider an implementation similar to the old Unix one, however,
in which one fills a buffer whenever a `getc' (or equivalent) is done
on an empty buffer.  Here we might have the following:

	A. buffer is nearly exhausted: it has one `h' left
	B. program does a `getc', which returns 'h': buffer now empty

(at this point, ungetc('h') will work.)

	C. program calls fscanf() which calls __vfscanf(), which starts
	   the ` ' directive, needs to skip spaces, and therefore refills
	   the buffer
	D. __vfscanf() finds an `e' (from `hello', perhaps) and stops
	   skipping spaces
	E. __vfscanf() executes `%n' directive, which stores 0 in n
	F. __vfscanf() tries to execute `%d', finds an `e', and stops
	   with a matching failure (returns 0)

At this point, there is probably no room in the input buffer to push
back the `h'.

Then again, the description for `ungetc' does not indicate that any
`getc' must be done in advance.  It says that one character of pushback
is guaranteed.  Perhaps this is meant to imply that

	FILE *foo = fopen("foo", "r");
	if (foo == NULL) die();
	(void) ungetc('a', foo);

is guaranteed to push back an `a', so that the first getc(foo) returns
'a'.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@cs.umd.edu	Path:	uunet!mimsy!chris

karl@haddock.ima.isc.com (Karl Heuer) (01/09/90)

In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>[So apparently scanf() requires at least three bytes of lookahead or pushback
>so that it can recover from "1.2e-x", which assigns 1.2 and leaves "e-x" in
>the input stream.]

No no no.  That may be the morally correct action, but it's not what the
Standard says.  Doug was mistaken on this point.

4.9.6.2 says "If conversion terminates on a conflicting input character, the
offending input character is left unread".  In the example, the "x" is the
first conflicting input character, so only it remains unread; the characters
"1.2e-" are consumed (and the conversion fails, since it does not form a valid
floating-point number).

The examples on page 139 of the Dec88 Draft clearly demonstrate that a format
beginning with "%f" matches zero items when presented with "100ergs", because
`"100e" fails to match "%f"'.  Moreover, the Rationale says that "One-
character pushback is sufficient for the implementation of |fscanf|.  Given
the invalid field `-.x', the characters `-.' are not pushed back" and "if a
`flawed field' is detected, no value is stored for the corresponding
argument".

I think the intent of the Committee is clear.

>Incidentally, it is not clear to me whether the standard requires
>the following to work.  [Code equivalent to the following:]
>	c = getchar();  assert(c == 'h');
>	r = scanf(" %n", &n);  assert(r == 0 && n == 0);
>	ungetc('h', stdin);  /* can this fail? */
>If n is zero, we know the scanf() did not consume any characters.  We
>may therefore be required to allow the `h' to be pushed back.  I am not
>sure.

I believe this is exactly what the Committee intended to legalize with its
insistence that scanf is "not allowed to consume the ungetc slot".  Yes, the
traditional Unix implementation is non-conforming.  It will have to be fixed.

Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint

karl@haddock.ima.isc.com (Karl Heuer) (01/09/90)

As pointed out in my previous article, scanf() and ungetc() together require
two lookahead/pushback slots.  Personally, I'm surprised that the Committee
did this.  Providing a single slot is easy, as demonstrated by the traditional
Unix implementation.  It's not immediately obvious how to provide more than
one, and still be able to provide the fast macro version of getc().

The solutions I've come up with can be easily generalized to more than two
slots.  So it appears that, given the insistence of the Committee to keep
scanf() and ungetc() independent (despite common existing practice!), they
could just as well have required scanf() to do full pushback for cases like
"1.2e-x", which would require a total of four slots (three for scanf and one
for ungetc), and not imposed any undue hardship on the implementation.

There is a clause in the Standard that says that `all input takes place as if
characters were read by successive calls to the |fgetc| function'.  Of course
|scanf| can't be written using only |fgetc| as a primitive, and I think it was
probably a mistake to pretend that it can.  My early suggestion was to add the
clause `If an extra character is necessary to recognize the end of input, then
scanf behaves as if it called the ungetc function after reading said
character'.  This would have kept the model simple, clearly documented the
state of the stream following a scanf, and fixed the tricky implementation
problem noted above.  It would also have agreed with existing practice.

Several months later I did find a way to implement |getc|, |ungetc|, and
|scanf| so that they would follow the rules without incurring a substantial
performance penalty.  So my current opinion is that |scanf| should have been
specified to do the morally correct thing, i.e. what Doug thought it was
already specifying.

Of course, it's too late to change it now.

Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/09/90)

In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>[me: 	(%n is a conversion, but is not an assignment; ...]
>In article <11897@smoke.BRL.MIL> gwyn@smoke.BRL.MIL (Doug Gwyn) writes:
>>No, %n involves both a conversion and an assignment, but no input action.
>Except that it is not counted in the return value.  (Neither are
>suppressed assignments, but those are `suppressed assignments', not
>`assignments'.)  %n, then, is a conversion and an assignment, but cannot
>be called an assignment because it is not counted as an assignment.
>This is the sort of thing that causes confusion as to whether `%*n'
>suppresses the (already not counted as an assignment) assignment.

The Standard says, "... the fscanf function returns the number of input
items assigned ...".  %n does not correspond to an input item.  That %n
is not included in the returned assignment count is also stated
explicitly under the description of the n conversion specifier, just to
make sure that there is no question about this.  (Generally we avoided
redundancy in the specifications, but some was added to reduce apparent
ambiguity as evidenced by comments received during the public review.)

>Anyway, if the implementation of *scanf() uses lookahead to handle
>scanning, it needs at least three bytes of lookahead.  If it uses
>pushback (which is legal but not required), the implementation must
>provide at least its own three bytes of pushback plus one more.

Yup.  Some of us would have been happy to eliminate the non-string
forms of *scanf() as well as ungetc(), to avoid having stdio deal with
pushback.  Since only a limited amount of pushback is guaranteed, and
then is subject to rigid constraints, it really isn't very useful for
real-world tokenizers, macro processors, etc. which must implement
their own scheme anyway.  However, there was a strong minority that
insisted that ungetc() was important to them, and at the time we thought
it highly desirable to obtain unanimous approval for sending the draft
out for (the first) public review, so the notion of pushback remained
in the draft stdio specs.  Later there was little sentiment for
revisiting this issue.

>Incidentally, it is not clear to me whether the standard requires
>the following to work.  [example program omitted for brevity]

The Standard does guarantee that you can push back one character with
ungetc() ('h' in the example program).  As you have noticed, the old UNIX
implementation of stdio pushback does not conform to the Standard.
Perhaps you can get Dave Prosser or some other AT&T implementor to
explain how they dealt with this in SVR4, which is advertised as Standard
conformant.

>At this point, there is probably no room in the input buffer to push
>back the `h'.

Definitely some additional "slop space" must be provided, somehow.
I've heard it said that two bytes of slop is required if fscanf()
uses getc()/ungetc(), although four seems to me to be necessary
(without thinking very hard about it).

>Then again, the description for `ungetc' does not indicate that any
>`getc' must be done in advance.  It says that one character of pushback
>is guaranteed.  Perhaps this is meant to imply that
>	FILE *foo = fopen("foo", "r");
>	if (foo == NULL) die();
>	(void) ungetc('a', foo);
>is guaranteed to push back an `a', so that the first getc(foo) returns
>'a'.

It is true that ungetc() need not be preceded by getc() (or other stdio
input function).  Also, as shown, it is permissible to push back a
character even at the beginning of the stream (fpi becomes sort of
indeterminate as explained in the Standard).  A POSIX-conforming
implementation must treat text and binary streams indistinguishably,
which means that after reading back the pushed-back character the fpi
must again have the value 0.

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/09/90)

In article <15591@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>In article <21690@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>>[So apparently scanf() requires at least three bytes of lookahead or pushback
>>so that it can recover from "1.2e-x", which assigns 1.2 and leaves "e-x" in
>>the input stream.]
>No no no.  That may be the morally correct action, but it's not what the
>Standard says.  Doug was mistaken on this point.
>4.9.6.2 says "If conversion terminates on a conflicting input character, the
>offending input character is left unread".  In the example, the "x" is the
>first conflicting input character, so only it remains unread; the characters
>"1.2e-" are consumed (and the conversion fails, since it does not form a valid
>floating-point number).

Again, I think this is a case of focusing too narrowly on only a portion
of the full Standard.  Two pages earlier, an "input item" is carefully
defined as the LONGEST MATCHING sequence of input characters, etc. and
that the FIRST CHARACTER after the input item remains unread.  I think
too much is being read into the term "conflicting input character".

By the way, this does conflict with a response given to David Hough
during the third public review, as well as the example.  Therefore, it
seems appropriate for submission to X3 as an interpretation issue
requiring clarification.  Perhaps our response was incorrect (not the
first time that happened), or perhaps my reading of the Standard is.
Anyhow, they're inconsistent.

>The examples on page 139 of the Dec88 Draft clearly demonstrate that a format
>beginning with "%f" matches zero items when presented with "100ergs", because
>`"100e" fails to match "%f"'.  Moreover, the Rationale says that "One-
>character pushback is sufficient for the implementation of |fscanf|.  Given
>the invalid field `-.x', the characters `-.' are not pushed back" and "if a
>`flawed field' is detected, no value is stored for the corresponding
>argument".
>I think the intent of the Committee is clear.

Unfortunately for this argument, one of the unwritten guiding principles
of the fprintf and fscanf specs was that they should closely follow what
the AT&T UNIX System V implementation actually DID, and I just tested
that and found that %f does match the "100" part of "100ergs", contrary
to the example but in agreement with my interpretation of the Standard.
So it's not so clear that we got this right.  (Note that the example
contained a late editing change, which makes it suspect in my opinion.)

sthomas@acorn.co.uk (Steve Thomas) (01/10/90)

In article <21675@mimsy.umd.edu> chris@mimsy.umd.edu (Chris Torek) writes:
>Anyway, here are the (apparent) answers:
	[ list of answers ]

I've been following this discussion with some interest, since I'm
trying to implement scanf() myself.  This list of answers is in line
with my latest reading of the standard, though I confess that I went
astray a few times.  There is an example which is presumably supposed
to clarify matters (although it's not, of course, officially part of
the standard).  In the Dec. 88 draft, page 139 line 24 says

	count = 0; /* "100e" fails to match "%f" */

when trying to match "%f%20s of %20s" with the input `100ergs of
energy'.

The example is clearly wrong, since page 151 line 36ff speaks of the
`longest initial subsequence ... of the correct form'.  Caveat lector,
I suppose.

Steve Thomas

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/10/90)

In article <15592@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>As pointed out in my previous article, scanf() and ungetc() together require
>two lookahead/pushback slots.  Personally, I'm surprised that the Committee
>did this.  ...

This whole area was the subject of long, emotionally charged debates.
The resulting specification was the only compromise we could come up
with that wouldn't cause someone or another to vote against sending the
draft proposed standard out for the (first) public review.  At the time,
we thought that unanimity for that vote was highly desirable (or even
necessary from X3's point of view, which we later found was not true).
However, seeing that it in effect gave everyone veto power, that was
probably not a wise policy, and we dropped it for later activity.  (In
fact, I voted against sending out the draft resulting from one round of
review.)  During later processing of public comments, I don't think
many committee members wanted to revisit the issue, since we had all
grudgingly accepted the compromise specification.

karl@haddock.ima.isc.com (Karl Heuer) (01/12/90)

In article <11907@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>Definitely some additional "slop space" must be provided, somehow.
>[Either two or four bytes, depending on which page of the pANS is correct.]

Hmm, is it really bytes?  Or (multibyte) characters?

Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are
{0x84, 0x30}.  If I call scanf("@") and the input stream contains 0x84 0x31,
does it push back one byte or two?  Would it make any difference if I wrote it
as scanf("\x84\x30")?

Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint

karl@haddock.ima.isc.com (Karl Heuer) (01/12/90)

In article <15618@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are
>{0x84, 0x30}.  If I call scanf("@") and the input stream contains 0x84 0x31,
>does it push back one byte or two?

To answer my own question: one, apparently.  Each ordinary multibyte character
is a single directive, which causes characters (bytes) to be read from the
stream, and in case of mismatch the differing and subsequent characters remain
unread.  If "character" had been intended to mean "multibyte character" here,
they would have said so, and it would have been singular instead of plural.

So it seems that the forces in favor of minimal pushback felt so strongly
about it that they were even willing to leave the input stream in an unknown
shift state!

Karl W. Z. Heuer (karl@haddock.isc.com or ima!haddock!karl), The Walking Lint

meissner@osf.org (Michael Meissner) (01/12/90)

In article <15620@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl
Heuer) writes:

| So it seems that the forces in favor of minimal pushback felt so strongly
| about it that they were even willing to leave the input stream in an unknown
| shift state!

The ungetc wars primarily came before multibyte chars were added to
the language, so it may have been overlooked, instead of agreeing to
leave the input stream that way.
--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA

Catproof is an oxymoron, Childproof is nearly so

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/13/90)

In article <15618@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>In article <11907@smoke.BRL.MIL> gwyn@brl.arpa (Doug Gwyn) writes:
>>Definitely some additional "slop space" must be provided, somehow.
>>[Either two or four bytes, depending on which page of the pANS is correct.]
>Hmm, is it really bytes?  Or (multibyte) characters?

Bytes.  While the format is treated as multibyte, the input is not, and
the distinction is relevant only for matching "ordinary" (literal)
multibytes in the format.  Input ceases as soon as the first non-matching
character is input, and only the DIFFERING and subsequent characters
remain unread.  Matching bytes within the multibyte span simply get lost.
(No, I don't like that behavior.)

>Suppose MB_LEN_MAX == 2 and that '@' is a two-byte character whose bytes are
>{0x84, 0x30}.  If I call scanf("@") and the input stream contains 0x84 0x31,
>does it push back one byte or two?  Would it make any difference if I wrote it
>as scanf("\x84\x30")?

The 0x84 input byte is consumed and just 0x31 gets pushed back in both cases.

gwyn@smoke.BRL.MIL (Doug Gwyn) (01/13/90)

In article <15620@haddock.ima.isc.com> karl@haddock.ima.isc.com (Karl Heuer) writes:
>So it seems that the forces in favor of minimal pushback felt so strongly
>about it that they were even willing to leave the input stream in an unknown
>shift state!

Or perhaps failed to consider the ramifications.

This is more evidence in support of the position I espoused (that was
not adopted), that (even in a so-called "multibyte" environment)
characters should be handled everywhere as unanalyzable units, so
that a text stream getc() would return one of them.  Think how much
simpler the spec would be (and how much better) if there was no need
to be concerned about "shift states" within applications.