[comp.lang.c] Trouble at EOF

EGNILGES@pucc.Princeton.EDU (Ed Nilges) (06/17/91)

According to the description of the standard fgets library function
(reference 1), you are not guaranteed newline at the end of every
line...that is, you'll get one at the end of the LAST line (or is
it the last-1th line, such that the last line is zero length?
enquiring minds want to know) only if it's there in the file.
I guess it's the old IBMer in me, who wants the end of a line to
be the Edge of the World, but this seems a tad bogus, especially
if one is writing a lexical analyser where such issues are
important.  Is there a true line reader in C? One that would
slap on an end of line at the end of the last line if it needed
it?

bhoughto@pima.intel.com (Blair P. Houghton) (06/17/91)

In article <12847@pucc.Princeton.EDU> EGNILGES@pucc.Princeton.EDU writes:
>According to the description of the standard fgets library function
>(reference 1), you are not guaranteed newline at the end of every
>line...that is, you'll get one at the end of the LAST line (or is
>it the last-1th line, such that the last line is zero length?
>enquiring minds want to know) only if it's there in the file.
>I guess it's the old IBMer in me, who wants the end of a line to
>be the Edge of the World, but this seems a tad bogus, especially
>if one is writing a lexical analyser where such issues are
>important.  Is there a true line reader in C? One that would
>slap on an end of line at the end of the last line if it needed
>it?

As confusing as that was, I think I got it.

Yes, fgets at eof may get a line with no newline.

Hence:

    while ( fgets(s, sizeof s, stream) )
	/* not eof */
	process(s);

    /* reached iff eof */
    if ( strlen(s) != 0 ) {
	/* there's something left to process */
	if ( s[strlen(s) - 1] != '\n' )
	    /* it has no newline */
	    strcat(s,"\n");
	process(s);
    }

But notice also that the size of the gotten string is
limited to the number of chars specified in the second
argument to fgets.  I.e., fgets may also get a line with no
newline when that line is longer than the length you
desire.  If the line is actually longer than that, the
balance will be read on the next call to fgets (possibly
also without a newline).

Why?  Because fgets' responsibility is to fill an array
with bytes, not to alter them.  It should be the
programmer's responsibility to maintain the semantics of
input data.

				--Blair
				  "process("what next?")"

kers@hplb.hpl.hp.com (Chris Dollin) (06/17/91)

Ed Nilges says (about fgets and the optional newline):

   be the Edge of the World, but this seems a tad bogus, especially
   if one is writing a lexical analyser where such issues are
   important.  

If I was writing a lexical analyser in C, I certainly would not first read in
the entire line, not even with fgets; I'd read in characters as required. (How
big a biffer should I allocate for fgets? What do I do on line overflow? These
are questions I wish to unask.)

Is the overhead of reading characters with fgetc really so large? (I suppose if
the lexis is suitable bizarre, you may need lots of putback, and being able to
just backbump the line index is easy. I find it a crying shame that stdio
doesn't mandate arbitrary putback. Still, it's not as bad as Lisp - at least C
has an excuse.)
--

Regards, Chris ``GC's should take less than 0.1 second'' Dollin.

datangua@watmath.waterloo.edu (David Tanguay) (06/17/91)

In article <4739@inews.intel.com> bhoughto@pima.intel.com (Blair P. Houghton) writes:
 >    while ( fgets(s, sizeof s, stream) )
 >	/* not eof */
 >	process(s);
 >
 >    /* reached iff eof */
 >    if ( strlen(s) != 0 ) {
 >	/* there's something left to process */
 >	if ( s[strlen(s) - 1] != '\n' )
 >	    /* it has no newline */
 >	    strcat(s,"\n");
 >	process(s);
 >    }

Huh? 4.9.7.2: "If end-of-file is encountered and no characters have been read
into the array, the contents of the array remain unchanged and a null pointer
is returned." fgets does not return NULL for eof when there are characters
read, so (barring I/O errors) the above code will process the last "line"
twice.
-- 
David Tanguay        datanguay@watmath.waterloo.edu        Thinkage, Ltd.

EGNILGES@pucc.Princeton.EDU (Ed Nilges) (06/17/91)

In article <4739@inews.intel.com>, bhoughto@pima.intel.com (Blair P. Houghton) writes:

>
>As confusing as that was, I think I got it.

The only confusion resulted from the misuse of a line reader in a
lexical analyzer, which is a character-by-character sort of thing.
A minor source of confusion was the omission of the reference.
It was ANSI C: A Lexical Guide, published by the Mark Williams
Company.

I took the advice of Mr. Ken Yap down in Australia at CSIRO, and
this morning completely altered the lexical analyzer to use
getc and ungetc.  It considerably simplified the code.  The use
of a line reader in the first place was the unfortunate byproduct
of having an IBM, unit-record background.  Yes, there may be a
performance penalty on IBM mainframe systems compiling C, in which
case the getc and ungetc can be hand-rolled around a unit record
reader for efficiency.

No, I don't want to use lexx.  I do not like the code it generates
and (once I get rid of these subtle tendencies to think in IBMerese)
I believe I can write more efficient code for the language I am
lexxicating.

Thanks to Mr. Ken Yap and the rest of the gang on comp.lang.c for
your patience.

bhoughto@pima.intel.com (Blair P. Houghton) (06/18/91)

In article <1991Jun17.120927.3802@watmath.waterloo.edu> datangua@watmath.waterloo.edu (David Tanguay) writes:
>In article <4739@inews.intel.com> bhoughto@pima.intel.com (Blair P. Houghton) writes:
> >    while ( fgets(s, sizeof s, stream) )
> >	process(s);
> >    if ( strlen(s) != 0 ) {
> >	/* there's something left to process */
>
>Huh? 4.9.7.2: "If end-of-file is encountered and no characters have been read
[...etc...]

Urp!

Remind me:

	a.  To test code before I post it.
	b.  Not to do it this way...



Spoiler alert.  If there's anything you don't
want to know, don't turn the page...

				--Blair
				  "Uh, er, uh, Bob made me do it.
				   Yeah, that's it.  Everyone thinks
				   that Agent Cooper is the only one
				   he's infested, but he's got me
				   in his clutches now, too.  Yeah.
				   That's the ticket..."

robert@isgtec.UUCP (Robert Osborne) (06/18/91)

In article <4739@inews.intel.com>, bhoughto@pima.intel.com (Blair P. Houghton) writes:
> Yes, fgets at eof may get a line with no newline.
> 
> Hence:
> 
>     while ( fgets(s, sizeof s, stream) )
>         /* not eof */
>         process(s);

Well you really want...
    if( fgets(s, sizeof s, stream) ) {
		do {
			process(s);
		} while ( fgets(s, sizeof s, stream) );
	}
or something similar.

>     /* reached iff eof */
>     if ( strlen(s) != 0 ) {
>         /* there's something left to process */
>         if ( s[strlen(s) - 1] != '\n' )
>             /* it has no newline */
>             strcat(s,"\n");
>         process(s);
>     }

This has two str??? calls too many...

    /* reached iff eof */
    if ( (s_length = strlen(s)) != 0 ) {
        /* there's something left to process */
        if ( s[s_length - 1] != '\n' ) {
            /* it has no newline */
        	s[s_length] = '\n';
        	s[s_length + 1] = '\0';
		}
        process(s);
    }

The str??? calls are OFTEN used in this manner and this is a very
common optimization that can be made in string handlers.
I once cut the running time of a key piece of UI functionality from
an intolerable >10 seconds to an almost bearable <5 seconds by performing
this kind of "optimization".

> But notice also that the size of the gotten string is
> limited to the number of chars specified in the second
> argument to fgets.

And this would be intolerable in a parser.  I'm surprised Blair didn't
mention this (although he did solve the problem asked).

Rob.
-- 
Robert A. Osborne   ...uunet!utai!lsuc!isgtec!robert or robert@isgtec.uucp