[comp.bugs.4bsd] read

karl@haddock.UUCP (Karl Heuer) (06/29/87)

main() {
    char buf[5];
    for (;;) printf("%d\n", read(0, buf, 5));
}

If you type *exactly* 5 characters and terminate the read with EOT (which is
not an EOF in this context, in the middle of a line), the first read returns 5
(as it should) and the second returns 0 (instead of waiting for more input).
Tested on 4.3ansan

gwyn@brl-smoke.UUCP (06/30/87)

In article <648@haddock.UUCP> karl@haddock.isc.com (Karl Heuer) writes:
>If you type *exactly* 5 characters and terminate the read with EOT (which is
>not an EOF in this context, in the middle of a line), the first read returns 5
>(as it should) and the second returns 0 (instead of waiting for more input).

That is correct behavior.  In cooked mode, the "EOT" character is a delimiter
that is inserted into the stream along with the others.  It is NEVER an "end
of file" character; that is merely a conventional interpretation given to a
delimiter found as the first character of a text line.  Your first read got 5
characters, and the second read encountered the delimiter, which stops input
and returns the number of characters found before the delimiter (0 in this
case).

ron@topaz.rutgers.edu (Ron Natalie) (06/30/87)

Excuse me System V breath, if you look on your own beloved operating system
you will see that EOT works the opposite way that it does on system V, that
is, the following code

    main()  {
	int count;
	char buf[10];

	do {
	    count = read(0, buf, 5);
	    printf("\ncount = %d\n", count);
	} while(count);
    }

Does the following on Berkeley UNIX (SUN 3.2):
    % a.out
    a<NL>

    count = 2
    abcde<EOT>
    count = 5

    count = 0
    %
  note that it doesn't read the keyboard between the last two printfs.

on both a 3B20 running Sys VR2v3 and a 3B2 running Sys VR3

    % a.out
    a<NL>

    count = 2
    abcde<EOT>
    count = 5

...at this point it waits for you to type more input...

I guess System V is wrong for once :-)

-Ron

kre@munnari.oz (Robert Elz) (07/01/87)

In article <13048@topaz.rutgers.edu>, ron@topaz.rutgers.edu (Ron Natalie):
> Does the following on Berkeley UNIX (SUN 3.2):
	...
>   note that it doesn't read the keyboard between the last two printfs.

that's wrong.

> on both a 3B20 running Sys VR2v3 and a 3B2 running Sys VR3
	...
> ...at this point it waits for you to type more input...

that's right.

Sys V is clearly right here, and bsd is wrong, and it should be fixed.
(And for anyone who doesn't know, I'm hardly a Sys V supporter).

kre

rpw3@amdcad.AMD.COM (Rob Warnock) (07/02/87)

My understanding has always been that <EOT> was a "push" which did not
store data in the stream.  By "push" I simply mean "return from the read
with whatever you've got so far.  (Under this interpretation, <LF> usually
means "store an <LF> then 'push'".) The function of <EOF> arises because
if you "push" at the beginning of a line (before data is typed), the "read()"
will return zero.

But if you "push" after N characters have been typed, you get N characters.

Therefore, by the "Principle Of [my own] Least Astonishment":

	abcde<EOT>

should return 5 characters, and the next call to "read()" should block.

In this case, System-V does it *right*!


Rob Warnock
Systems Architecture Consultant

UUCP:	  {amdcad,fortune,sun,attmail}!redwood!rpw3
ATTmail:  !rpw3
DDD:	  (415)572-2607
USPS:	  627 26th Ave, San Mateo, CA  94403

ford@crash.CTS.COM (Michael Ditto) (07/03/87)

In article <13048@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
>Excuse me System V breath, if you look on your own beloved operating system
>you will see that EOT works the opposite way that it does on system V, that
>is, the following code
>
>    main()  {
>	int count;
>	char buf[10];
>
>	do {
>	    count = read(0, buf, 5);
>	    printf("\ncount = %d\n", count);
>	} while(count);
>    }
>
> [...]
>
>    % a.out
>    a<NL>
>
>    count = 2
>    abcde<EOT>
>    count = 5
>
>...at this point it waits for you to type more input...
>
>I guess System V is wrong for once :-)
>

No, this is the spec for EOF in termio(7) (SysV's equivalent to tty(4)):

	[When EOF is received] all the characters waiting to be read are
	immediately passed to the program, without waiting for a new-line,
	and the EOF is descarded.  Thus, if there are no characters waiting,
	which is to say the EOF occurred at the beginning of a line, zero
	characters will be passed back [...]

This is the way UNIX has always worked, except for Berkeley's versions, and
AT&T still does it this way.

(The above quote from termio(7) is copyright by AT&T, but you can see it on
your SysV system with "man 7 termio".  [Lame attempt at disclaimer]).
-- 

Michael "Ford" Ditto				-=] Ford [=-
P.O. Box 1721					ford@crash.CTS.COM
Bonita, CA 92002				ford%oz@prep.mit.ai.edu

henry@utzoo.UUCP (Henry Spencer) (07/03/87)

> That is correct behavior...

Uh, correct by whose definition, Doug?  The original Unix semantics of
EOT were the "push" semantics (as opposed to the "delimiter" semantics you
describe), in which the EOT forces the existing input queue (possibly
zero-length) to be pushed through to the user, and then disappears utterly.
-- 
Mars must wait -- we have un-         Henry Spencer @ U of Toronto Zoology
finished business on the Moon.     {allegra,ihnp4,decvax,pyramid}!utzoo!henry

ron@topaz.rutgers.edu (Ron Natalie) (07/04/87)

>I guess System V is wrong for once :-)
>

No, this is the spec for EOF in termio(7) (SysV's equivalent to tty(4)):

	[When EOF is received] all the characters waiting to be read are
	immediately passed to the program, without waiting for a new-line,
	and the EOF is descarded.  Thus, if there are no characters waiting,
	which is to say the EOF occurred at the beginning of a line, zero
	characters will be passed back [...]

You seem to have missed the fact that I was jeering at Doug Gwyn
(Notable System V proponent) for putting forth this opinion
was exactly contrary to what System V does:

    That is correct behavior.  In cooked mode, the "EOT" character is a
    delimiter that is inserted into the stream along with the others.  It is
    NEVER an "end of file" character; that is merely a conventional
    interpretation given to a delimiter found as the first character of a
    text line.  Your first read got 5 characters, and the second read
    encountered the delimiter, which stops input and returns the number of
    characters found before the delimiter (0 in this case).

He's right except that System V associates the delimeter with the characters
before it, out of band.  Berkeley, places the delimeter (still an EOF) in
band, which causes it not to be noticed if the read size exactly matches
the number of characters queued before the delimeter.  In neither case
is the ^D merely discarded, that would imply that

	sleep(10) read(0, buf, 10);
with
	a<EOT>b<EOT>c<NL>

typed during the sleep would return "abc\n".

The belief that the EOF should not be treated as in BSD is confirmed
by the statement earlier in the termio manual page that states that
the read size may be smaller than the number of characters in the queue,
even a single character, without loss of information.  Thus, this implies
that the loop:

    while(1)  {
	i = read(0, buf, N);
	if(i == 0) break;
	write(1, buf, i);
    }

will work regardless of the size of N, which is not true on Berkeley
as setting N to 1 will cause any EOT terminated lines to return
apparent EOF indications.

To make BSD work like Sys V you can kludge it by changing tty.c routine
ttread (around line 2191 in mine) where it says

	if(u.u_resid == 0)
		break;

to say something like

    if(u.u_resid == 0)  {
        if(				    /* IF there
	    p->c_cc > 0  &&		    /* are more characters AND    */
	    (*p->c_cf & 0x377) == eof &&    /* ..the next is EOF AND      */
	    (t_flags & CBREAK) == 0   &&    /* ..we're in cooked mode AND */
	    (ttbreakc(c, tp) == 0)	    /* .. last char wasn't break  */
        ) getc(tp);	    /* Throw away EOF that goes with this data. */
        break;
    }

I don't feel like remaking the kernel now, so I can't tell you if
it works.

-Ron

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/05/87)

In article <17345@amdcad.AMD.COM> rpw3@amdcad.UUCP (Rob Warnock) writes:
>My understanding has always been that <EOT> was a "push" which did not
>store data in the stream.

At one time, a special "delimiter" marker was inserted into the stream
at that point.  Apparently, some UNIXy implementations do it one way
and some another.  I seem to recall that SVR3.0 STREAMS was missing the
M_DELIM message type, so whenever AT&T finally gets the whole character
I/O system converted to STREAMS, they couldn't insert a delimiter if
they wanted too (according to Ron, that would be consistent with current
UNIX System V behavior).

Alas, another difference among UNIX variants.  What does POSIX have to
say about this?

guy%gorodish@Sun.COM (Guy Harris) (07/05/87)

> At one time, a special "delimiter" marker was inserted into the stream
> at that point.  Apparently, some UNIXy implementations do it one way
> and some another.

Non-STREAMS tty drivers generally have a "raw" queue and a
"canonical" queue.  Reads in "cooked" mode take place from the
"canonical" queue.

In the AT&T drivers, of various flavors (V7, S3, S5), characters
accumulate in the "raw" queue until a "read" is done.  If the
terminal is in cooked mode when the "read" is done, the "read" blocks
until a line terminator (newline, EOF, or "secondary end-of-line"
character) is received.  At that point, one and only one line is
canonicalized (erase/kill processing is done) and is moved to the
"canonical" queue.  If the "line" is terminated by an EOF rather than
an end-of-line character, the EOF does NOT appear in the canonical
queue.  Thus, the top-level reading code won't see delimiters.

The 4BSD driver(s) move data from the "raw" queue to the "canonical"
queue as soon as a line terminator is received.  "Canonicalization"
is done on the fly; for example, as soon as an "erase" character is
received, the character it erases is removed from the "raw" queue.
(This makes it easier to implement more correct handling of the
"erase" character - it's easier for the driver to know what character
is being erased, so it can do a better job of erasing it from the
screen - and also makes it easier to handle a "reprint" character
that causes the current queued-up input to be re-echoed.  It also
means that erase, kill, etc. characters do NOT count against the
256-character limit of uncanonicalized characters, but subtract from
that count.)  If the line ended with EOF, the EOF is left in the
canonical queue as a delimiter.  It is stripped out when the "read" is
done; however, if there are five characters in the queue, and the
"read" asks for five bytes, only those five characters are looked at.
If an EOF follows them, it is left in the queue and seen by the next
"read".

> I seem to recall that SVR3.0 STREAMS was missing the M_DELIM message type,
> so whenever AT&T finally gets the whole character I/O system converted to
> STREAMS, they couldn't insert a delimiter if they wanted too (according to
> Ron, that would be consistent with current UNIX System V behavior).

This is the true.  STREAMS messages somewhat resemble "mbuf" chains;
delimiters are implicit in the structure of these chains (when you
get to the end of one, you're at the end of a message).  A line would
be a single STREAMS message; the EOF would be discarded ASAP, since
it is not needed as a delimiter.  As such, any driver based on the
S5R3 STREAMS code will give the "push", rather than the "delimiter"
behavior (regardless of whether it implements "canonicalize at read
time" or "canonicalize at input time" behavior).

The "streams" code described in Dennis Ritchie's paper in the BSTJ (I
have no idea if that implementation is called STREAMS or just
"streams") has a "delimiter" message type.  I don't know what sort of
behavior the various V8 "streams"-based (as opposed to S5R3
STREAMS-based) tty drivers provide; Dennis' paper described two
drivers, one giving the 4.1BSD "old" line discipline behavior (which
may resemble V7 behavior) and one giving the 4.1BSD "new" line
discipline behavior (which probably resembles other 4BSD systems).

I agree with most of the people here; the non-4BSD behavior is
correct.  When I type ^D, it doesn't mean that I'm putting a ^D into
the input queue, it measns I'm terminating a record.

> Alas, another difference among UNIX variants.  What does POSIX have to
> say about this?

From the draft of Draft 10 (*sic*) we have here:

7.1.1.11 Special Characters

	...

	EOF	...When received, all the characters waiting to be
		read are immediately passed to the program, without
		waiting for a new-line, and the EOF is discarded.
		Thus, if there are no characters waiting (that is,
		the EOF occurred at the beginning of a line), zero
		characters shall be passed back, representing an
		end-of-file indication.
	Guy Harris
	{ihnp4, decvax, seismo, decwrl, ...}!sun!guy
	guy@sun.com

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/05/87)

In article <1325@crash.CTS.COM> ford@crash.CTS.COM (Michael Ditto) writes:
>	[When EOF is received] all the characters waiting to be read are
>	immediately passed to the program, without waiting for a new-line,
>	and the EOF is descarded.  ...

This is of course nonsense, because the characters are NOT necessarily
"passed to the program".  (What program?  Terminal I/O proceeds
asynchronously, and there is no telling in advance which process will
ultimately read the terminal input.)  Typical UNIXy terminal handlers
have a "canonical" input queue and a "raw" queue; in "cooked mode"
(ICANON on), characters are passed from the canonical queue to the raw
queue by a canonicalization gnome that "knows" that a newline or an EOF
(also an EOL in System V) delimits a chunk of input (so that the chunk
is immune to a subsequent char-erase or line-kill).  In order to keep
track of chunk ("line") boundaries in the absence of a newline, it is
traditional to store a special "delimiter" marker in the input queue.

There is an earlier section of the TERMIO spec that mentions line
delimiters.  The above quotation from the manual (same as in the SVID)
is incomplete (as well as erroneous), in that it does not specify the
boundary behavior of delimiters (i.e., the phenomena Ron reported on).

>This is the way UNIX has always worked, except for Berkeley's versions, ...

I dispute that.  It MAY be the way that the USG 3.0 and derivative
terminal handler (the "termio" one)  has always worked.

gwyn@brl-smoke.ARPA (Doug Gwyn ) (07/05/87)

In article <6055@brl-smoke.ARPA> gwyn@brl.arpa (Doug Gwyn (VLD/VMB) <gwyn>) writes:
>..., characters are passed from the canonical queue to the raw queue ...

Oops, I got the queue names backwards (it's been a long time since I've had
to work on the terminal handler).  Guy Harris's explanation, which I hadn't
seen when I posted my previous note, looks accurate to me.

>The above quotation from the manual (same as in the SVID) is incomplete
>(as well as erroneous), ...

Ahem, this also applies to Draft 10 of IEEE 1003.1.  "Passed to the program"
indeed.

henry@utzoo.UUCP (Henry Spencer) (07/08/87)

> The "streams" code described in Dennis Ritchie's paper in the BSTJ (I
> have no idea if that implementation is called STREAMS or just
> "streams") has a "delimiter" message type.

(Probably just "streams" -- I've never seen Dennis capitalize it that I
recall.)

> I don't know what sort of
> behavior the various V8 "streams"-based (as opposed to S5R3
> STREAMS-based) tty drivers provide...

The tty drivers just put a delimiter message on after they pass a line
through.  However, there is some subtlety in the behavior of a stream
read when the count is exactly satisfied that causes a trailing delimiter
to be swallowed.  So the "push" behavior is what is provided.  Unless I'm
much mistaken, this applies to both tty drivers.
-- 
Mars must wait -- we have un-         Henry Spencer @ U of Toronto Zoology
finished business on the Moon.     {allegra,ihnp4,decvax,pyramid}!utzoo!henry

thorinn@diku.UUCP (Lars Henrik Mathiesen) (07/09/87)

In article <648@haddock.UUCP> karl@haddock.UUCP (Karl Heuer) writes:
>main() {
>    char buf[5];
>    for (;;) printf("%d\n", read(0, buf, 5));
>}

>If you type *exactly* 5 characters and terminate the read with EOT (which is
>not an EOF in this context, in the middle of a line), the first read returns 5
>(as it should) and the second returns 0 (instead of waiting for more input).
>Tested on 4.3bsd.

I agree that this seems wrong, but look at it this way: If you had tried to
read, say, 6 characters, you would still have got only 5; you could therefore
conclude that the user had typed an EOF. According to the 4.3 tty(4) manual:

   It is not, however, necessary to read a whole line at once; any number of
   characters may be requested in a read, even one, without losing information.

But if the next read (after the read( , , 5)) returned some further input,
you would never know that the EOF was there, thus information is lost. This
seems to be the way AT&T systems behave.
  If we can agree that the user-interface definition of an EOF indication is
something like "An EOF immediately following a newline or another EOF", AND
if we want this to be the only way to provoke a return of zero characters
from read, the AT&T behaviour is best.
  But if we want to be able to detect arbitrary EOFs even when it is not
practical to provide a buffer large enough for any input, the BSD behaviour
is necessary. Regrettably you have to use code like this:

	/*
	 * new_canon is a boolean variable that is true if we've just read
	 * past a "canonicalization point". Assume that there's no t_brkc.
	 */
	int new_canon = 1;

	...
    nextline:
	do {
		if ((n = read(0, buf, BUFSIZ)) < 0)
			/* ERROR */
			exit(1);
		if (n == 0 && new_canon)
			/* EOF */
			exit(0);
		newline = n && buf[n - 1] == '\n'; 
		new_canon = newline || n < BUFSIZ;
		/* PROCESS buf */
	} while (new_canon == 0);
	if (!newline)
		/* Input was terminated by EOF */
		putchar('\n');
	...
	goto nextline;
--
Lars Mathiesen, DIKU, U of Copenhagen, Denmark		..mcvax!diku!thorinn
Institute of Datalogy -- we're scientists, not engineers.

laman@ncr-sd.UUCP (07/09/87)

In article <13145@topaz.rutgers.edu> ron@topaz.rutgers.edu (Ron Natalie) writes:
	:
	:
	:
 >
 >To make BSD work like Sys V you can kludge it by changing tty.c routine
 >ttread (around line 2191 in mine) where it says
 >
 >	if(u.u_resid == 0)
 >		break;
 >
 >to say something like
 >
 >    if(u.u_resid == 0)  {
 >        if(				    /* IF there
 >	    p->c_cc > 0  &&		    /* are more characters AND    */
 >	    (*p->c_cf & 0x377) == eof &&    /* ..the next is EOF AND      */
			 ^
	Get rid of the 'x' so you get an octal contant
 >	    (t_flags & CBREAK) == 0   &&    /* ..we're in cooked mode AND */
 >	    (ttbreakc(c, tp) == 0)	    /* .. last char wasn't break  */
 >        ) getc(tp);	    /* Throw away EOF that goes with this data. */
 >        break;
 >    }
 >
 >I don't feel like remaking the kernel now, so I can't tell you if
 >it works.
 >
 >-Ron

Just thought I'd point this out in case some did want to try this.

Not having access to a BSD kernel, I can't comment on the rest of the code.

		Mike Laman
		UUCP: {ihnp4,sdcsvax,noscvax,...}!ncr-sd!laman

karl@haddock.UUCP (07/15/87)

In article <3320@diku.UUCP> thorinn@diku.UUCP (Lars Henrik Mathiesen) writes:
>According to the 4.3 tty(4) manual:  "... any number of characters may be
>requested in a read ... without losing information."  But if the next read
>(after the read( , , 5)) returned some further input, you would never know
>that the EOF was there, thus information is lost.

True.  (I don't think that's what they meant by "information", though.)

>But if we want to be able to detect arbitrary EOFs even when it is not
>practical to provide a buffer large enough for any input, the BSD behaviour
>is necessary.  Regrettably you have to use code like this: [complicated code
>that uses an extra variable and observes newlines and full buffers].

But the fact is that existing code -- e.g. the guts of getchar() -- does not
do anything of the sort, and therefore will behave as if a real end-of-file
were signalled.

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint