[comp.unix.wizards] Is Unix stdio slow?

chris@mimsy.UUCP (Chris Torek) (09/25/88)

[moved from comp.os.vms]
In article <3951@psuvax1.cs.psu.edu> schwartz@shire (Scott Schwartz) writes:
>I've seen unix programs (things like a grep replacement) that got
>similar speedups by replacing stdio calls with read/write and a large
>buffer.  I wonder how much of that 2-6x is from overhead in stdio,
>rather than in the filesystem.  

Quite a bit.  The overhead is not all in stdio itself, but in inefficient
expansions of stdio macros:

>	while ((c = getchar()) != EOF)
>		putchar(c);

If you are not going to inspect each character, it seems a bit silly
to bring them in and put them out one at a time, but let us take a
look at what the traditional dumb Unix C compiler generates for this:

		jmp	while_test
	loop:
		<<code for putchar>>
	while_test:	| while ((c = getchar()) != EOF)
		sub	$1,stdin_cnt	| --cnt >= 0 ?
		jls	refill
		mov	stdin_ptr,r1	|	*(unsigned char *)p++
		clr	r0
		movb	@r1+,r0
		mov	r1,stdin_ptr
		jmp	merge
	refill:	push	stdin		| : _filbuf(&stdin)
		call	_filbuf
		tst	@sp+		| (some sort of pop)
	merge:	mov	r0,C(fp)	| c = ...
		cmp	r0,$-1		| c == EOF?
		jne	loop		| no, loop
		<<code to exit>>

Now, stdio reads in blocks, typically 512 to 65536 characters at
a time, so the most common path through this code is

	sub $1,stdin_cnt; no_jmp; mov stdin_ptr,r1	| 3
	clr r0; movb @r1+,r0; mov r1,stdin_ptr		| 3
	jmp merge; merge:				| 1
	mov r0,C(fp); cmp r0,$-1; jmp loop		| 3 = 10 instrs

Most of this is unnecessary!  If we put stdin_ptr in a register
in advance, and note that r0 (and hence C) can never be -1 since it
is zero-extended from a byte, we should see something like

	sub $1,stdin_cnt; no_jmp			| 2
	clr r0; movb @p+,r0				| 2
	mov r0,C(fp); jmp loop				| 2 = 6 instrs

The code to call `filbuf' has to save this extra pointer, but that is
no big deal.  If BUFSIZ is 1024, we saved 4*1023=4092 instructions at
the expense of one instruction when calling filbuf, for a total savings
of 4091 instructions per read!  We might save `only' ~2000 instructions
if stdin_ptr must be left global (which is true if the loop is
nontrivial).

What about putchar?  It is even worse:

	| putchar(C(fp))
		sub	$1,stdout_cnt	| --cnt >= 0 ?
		jls	flush
		mov	stdout_ptr,r0	| *p = c,
		mov	C(fp),@r0
		tstbit	$7,stdout_flg	| flag&IS_LINE_BUFFERED
		jz	no_nl_flush
		cmp	@r0,$10		| && *p == '\n' ?
		jne	no_nl_flush
		push	$10		|	_flsbuf(stdout, '\n')
		jmp	fls0		| (merge)
	no_nl_flush:
		clr	r1		| : *(unsigned char *)p++
		movb	@r0+,r1
		mov	r0,stout_ptr
		mov	r1,r0
		jmp	merge
	flush:	push	C(fp)		| : _flsbuf(stdout, c)
	fls0:	push	stdout
		call	_flsbuf
		cmp	@sp+,@sp+
	merge:

The most common case is, again, that the buffer does *not* fill.
Unfortunately, this case still has to test for line buffered
output, and the flag can (and in fact does) change after calls
to _flsbuf, so the test cannot easily be hoisted out of the
loop (it can be done, by splitting and cross-branching in the
flush code, but this is not often a good optimisation: this is
a peculiar case).

Eliminating line-buffered output descriptors shrinks putchar()
considerably:

	| p is `unsigned char *'
	| --cnt >= 0 ? *p++ = c : _flsbuf(stdout, c)
		sub	$1,stdout_cnt
		jls	flush
		mov	stdout_ptr
		mov	C(fp),@r1
		clr	r0
		mov	@r1+,r0
		mov	r1,stdout_ptr
		jmp	merge
	flush:	push	C(fp)
		push	stdout
		call	_flsbuf
		cmp	@sp+,@sp+
	merge:

The copy-a-buffer-at-a-time loop looks like this:

		jmp	while_tst
	loop:
		push	n	| write(1, buf, n)
		push	buf
		push	1
		call	write
		add	$6,sp	| (pop)
	while_tst:
		push	size	| n = read(0, buf, size)
		push	buf
		push	0
		call	read
		add	$6,sp
		mov	r0,n
		tst	r0	| while (... > 0)
		jgt	loop

which has *no* per-character overhead, compared with a minimum 6+11
(getchar+putchar) instruction per-character overhead for stdio.  When
the buffer size is large (e.g., 1024), that is 17000 instructions of
overhead per read and write call.  read and write are each thousands of
instructions, but they no longer swamp the overhead in stdio.  In
programs that do real work with each character, at least some of the
stdio `overhead' is in fact necessary, so you lose less.

Using fread and fwrite can help, if they are intelligently written.
(They are NOT so written in old BSD releases.)
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

edward@ucbarpa.Berkeley.EDU (Edward Wang) (10/04/88)

Chris Torek writes:
>What about putchar?  It is even worse:
>
>	| putchar(C(fp))
>		sub	$1,stdout_cnt	| --cnt >= 0 ?
>		jls	flush
>		mov	stdout_ptr,r0	| *p = c,
>		mov	C(fp),@r0
>		tstbit	$7,stdout_flg	| flag&IS_LINE_BUFFERED
>		jz	no_nl_flush
>		cmp	@r0,$10		| && *p == '\n' ?
>		jne	no_nl_flush
>		push	$10		|	_flsbuf(stdout, '\n')
>		jmp	fls0		| (merge)
>	no_nl_flush:
> . . .

I should point out that this is not 4.3 putc().
The real 4.3 putc looks like this

#define putc(x, p)	(--(p)->_cnt >= 0 ?\
	(int)(*(unsigned char *)(p)->_ptr++ = (x)) :\
	(((p)->_flag & _IOLBF) && -(p)->_cnt < (p)->_bufsiz ?\
		((*(p)->_ptr = (x)) != '\n' ?\
			(int)(*(unsigned char *)(p)->_ptr++) :\
			_flsbuf(*(unsigned char *)(p)->_ptr, p)) :\
		_flsbuf((unsigned char)(x), p)))

The block buffered path is the second line and does not
check the _IOLBF bit.  It is, in fact, exactly as fast
as the old putc.  Of course this doesn't mean the C compiler
generates good code for it.

Peter da Silva writes:
>It always depresses me when I think that people are still doing line
>buffered output. The problem of handling stdout and stdin is a solved
>problem: do a flushbuf on all interactive streams whenever you do a
>fillbuf on any interactive stream. . .

The reason for implicit line buffering is to make certain programs
more responsive when writing to a tty while still run at full speed
when output is redirected.  These are not necessarily interactive
programs.  For example, "last -10 foo" isn't interactive (the flsbuf
when filbuf fix won't work on it), but you do want it to give you
output line-by-line in case foo logs in only infrequently.

I will go as far as to say that standard output to pipes should be
line buffered, since we no longer put pipes in files.

chris@mimsy.UUCP (Chris Torek) (10/04/88)

In article <26315@ucbvax.BERKELEY.EDU> edward@ucbarpa.Berkeley.EDU
(Edward Wang) writes:
>I should point out that [Chris Torek's expansion] is not 4.3 putc().
>The real 4.3 putc looks like this
>
>#define putc(x, p)	(--(p)->_cnt >= 0 ?\
>	(int)(*(unsigned char *)(p)->_ptr++ = (x)) :\
>	(((p)->_flag & _IOLBF) && -(p)->_cnt < (p)->_bufsiz ?\
[rest deleted]
>The block buffered path is the second line and does not
>check the _IOLBF bit.  It is, in fact, exactly as fast
>as the old putc.  Of course this doesn't mean the C compiler
>generates good code for it.

And, also of course, it does not mean that the code is right!  4.3BSD
putc and printf (actually _doprnt) do not speak to each other.  _doprnt
assumes that a negative _cnt field is an error, and clobbers the
count.  Every once in a while, this results in putc scribbling all over
memory.  Fortunately, instances of this bug are rare (it requires
mixing calls to printf and calls to putc in a particular way).  This is
fixed in 4.3BSD-tahoe, which no longer has _doprnt in Vax assembly.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

peter@ficc.uu.net (value) (10/05/88)

In article <26315@ucbvax.BERKELEY.EDU>, edward@ucbarpa.Berkeley.EDU (Edward Wang) writes:
> Peter da Silva writes:
> >It always depresses me when I think that people are still doing line
> >buffered output. The problem of handling stdout and stdin is a solved
> >problem: do a flushbuf on all interactive streams whenever you do a
> >fillbuf on any interactive stream. . .

> The reason for implicit line buffering is to make certain programs
> more responsive when writing to a tty while still run at full speed
> when output is redirected.

A laudable goal. I'm not so sure that making the output line-buffered
is the way to do it. After all you may want to make them run at full
speed to a terminal device. And certainly interactive programs want
to be able to do this...

Really the best thing to do would be to have seperate flags for line
buffering and interactive buffering.

I'm more worried about block buffering on pipes on input, a long-standing
practice that greatly reduces the effectiveness of shared file descriptors.

Oh well... stdio certainly has its warts.
-- 
Peter da Silva  `-_-'  Ferranti International Controls Corporation.
"Have you hugged  U  your wolf today?"            peter@ficc.uu.net

edward@ucbarpa.Berkeley.EDU (Edward Wang) (10/05/88)

In article <13853@mimsy.UUCP> chris@mimsy.UUCP (Chris Torek) writes:
>. . . _doprnt
>assumes that a negative _cnt field is an error, and clobbers the
>count.  Every once in a while, this results in putc scribbling all over
>memory.  Fortunately, instances of this bug are rare (it requires
>mixing calls to printf and calls to putc in a particular way).  This is
>fixed in 4.3BSD-tahoe, which no longer has _doprnt in Vax assembly.

Well I don't know about this, but the negative _cnt field
is as old as line buffering, before the fast putc.  The old
putc depended on the negative _cnt to invoked _flsbuf
for line buffered streams (the --fp->_cnt >= 0 test always
fails so _flsbuf is always called).

But enough on this already.

andrew@alice.UUCP (Andrew Hume) (10/08/88)

to my way of thinking the bufering problem is not solved;
the best youcan say is that we recognise several different situations
that seem to need different kind of buffering solutions.

the solution fio (fast i/o) supports best is that all I/O is buffered
and if you want unbuffered i/o, use the unbuffered routines
(say, fprint(2, instead of Fprint(2 ) which do one system call.
There is one extra hook; Ftie(ifd, ofd) has the menaing that if fio
wants to do a read() system call on ifd, it will flush any output
associated with ofd.

jc@minya.UUCP (John Chambers) (10/11/88)

> I will go as far as to say that standard output to pipes should be
> line buffered, since we no longer put pipes in files.

And I will go as far as to say that, in my experience, this is one of
the biggest pains in the [expletive deleted] when you want to get your
software to work reliably.  I've long since lost count of how many times
I've wasted time trying to solve the problem that program A has written
to a pipe (or socket), program B has done a read on the pipe, and both
are sitting there getting no cpu time.  True, there are always ways to
get it to work right.  But the ways are different on every release of
Unix, and so far, when I ask how to do it right and portably, all I get
are insults from supposed experts (who I suspect are just trying to cover
up the fact that they don't know the answer, either).

Perhaps with POSIX, there will be a way for B to say "wake me up when there
is input on a file (pipe or disk or stream or socket or ...), and give me
whatever is there".  On the other hand, I can't quite tell from the manual
just how to do it *exactly* right, so maybe they won't help either, and I'll
have to add a new #ifdef for every vendor's POSIX-compliant system.

OK, maybe I'm just a stupid idiot, not to understand something so trivial.
But I suspect I'm not the only one.

BTW, on every Unix system I've ever used, a pipe is a file.  What could
the stuff after the comma in the above quote possibly mean?  Is there a
new implementation of pipes that doesn't use file descriptors?  I sure
hope not; that sure would be a giant step backwards.

-- 
John Chambers <{adelie,ima,maynard,mit-eddie}!minya!{jc,root}> (617/484-6393)

[Any errors in the above are due to failures in the logic of the keyboard,
not in the fingers that did the typing.]

domo@riddle.UUCP (Dominic Dunlop) (10/11/88)

Those interested in the speed of stdio may care to follow up this
reference:

Grep Wars
Andrew Hume
AT&T Bell Laboratories, Murray Hill, New Jersey 07974
Published in proceedings of European UNIX Systems User Group spring 1988
	conference, ISBN 0 9513181 0 1 (available at pounds 20 + pounds 6.08
	surface mail or pounds 22.63 airmail from EUUG, Owles Hall,
	Buntingford, Herts SG9 9PL, England; phone +44 763 73039 -- Visa
	cards accepted)

Abstract:
	Subsequent to the sixth edition of the UNIX operating system there
	have been different versions of the searching tool grep using
	different algorithms tuned for different types of search patterns.
	Friendly competition between the tools has yielded a succession of
	performance enhancements.

	We describe the latest round of improvements, based on the fio fast
	I/O library and incorporating the Boyer-Moore algorithm.  Although
	grep is now 3-4 times faster than it was, egrep is typically 8-10
	(for some common patterns 30-40) times faster than the new grep.

I believe that a similar paper has been presented at Usenix, possibly at
the summer 1987 conference, but I'm afraid I don't have the proceedings to
hand.

At the EUUG conference, Andrew said (as I recall) that he'd like to see
fio hit the public domain, but that whether it did or not was a matter
which was out of his hands.  It was also not clear whether fio would ever
be incorporated into a licensed release of the UNIX operating system,
either as a set of library functions, or even merely as a part of
[ef]*grep.
-- 
Dominic Dunlop
domo@sphinx.co.uk  domo@riddle.uucp

chris@mimsy.UUCP (Chris Torek) (10/17/88)

In article <104@minya.UUCP> jc@minya.UUCP (John Chambers) writes:
>...  I've long since lost count of how many times
>I've wasted time trying to solve the problem that program A has written
>to a pipe (or socket), program B has done a read on the pipe, and both
>are sitting there getting no cpu time.  True, there are always ways to
>get it to work right.  But the ways are different on every release of
>Unix, and so far, when I ask how to do it right and portably, all I get
>are insults from supposed experts (who I suspect are just trying to cover
>up the fact that they don't know the answer, either).

Use fflush() in the writing program (since you are using stdio [see
Subject]).  If you do not have control over the writing program, you
are out of luck.  Many standard programs never bother to call fflush
since they know that the output is obviously line buffered since it
is *of course* going to a terminal.

Be aware that if you use buffered input in the reading program, it
may make a second call to read() on the pipe or socket descriptor
after first consuming all the available data, and then go back to sleep
waiting for more data.  You can disable this portably and incredibly
inefficiently by setting the input FILE to unbuffered.

[Happy? :-) ]

>BTW, on every Unix system I've ever used, a pipe is a file. ... Is there a
>new implementation of pipes that doesn't use file descriptors?  I sure
>hope not; that sure would be a giant step backwards.

In 4.2BSD and 4.3BSD, and systems based upon them, pipes are sockets rather
than files, but sockets are named by file descriptors just like files.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163)
Domain:	chris@mimsy.umd.edu	Path:	uunet!mimsy!chris

allbery@ncoast.UUCP (Brandon S. Allbery) (10/18/88)

As quoted from <104@minya.UUCP> by jc@minya.UUCP (John Chambers):
+---------------
| > I will go as far as to say that standard output to pipes should be
| > line buffered, since we no longer put pipes in files.
| 
| BTW, on every Unix system I've ever used, a pipe is a file.  What could
| the stuff after the comma in the above quote possibly mean?  Is there a
| new implementation of pipes that doesn't use file descriptors?  I sure
| hope not; that sure would be a giant step backwards.
+---------------

He means that on V7-based systems (including SVR3), pipes are essentially
implemented in the kernel in a way that's almost identical to:

	p[1] = creat(some_temp_file, 0);
	p[1] = open(some_temp_file, 0);
	unlink(some_temp_file);

whereas under BSD pipes are linked socketpairs and under SVR4 they're
supposed to be a variant of the "clone" device (essentially, the STREAMS
version of sockets) and are implemented quite differently.

++Brandon
-- 
Brandon S. Allbery, comp.sources.misc moderator and one admin of ncoast PA UN*X
uunet!hal.cwru.edu!ncoast!allbery  <PREFERRED!>	    ncoast!allbery@hal.cwru.edu
allbery@skybridge.sdi.cwru.edu	      <ALSO>		   allbery@uunet.uu.net
comp.sources.misc is moving off ncoast -- please do NOT send submissions direct
  (But the aliases are NOT on UUNET yet, use the aliases at backbone sites!)

allbery@ncoast.UUCP (Brandon S. Allbery) (10/18/88)

As quoted from <12837@ncoast.UUCP> by allbery@ncoast.UUCP (Brandon S. Allbery):
+---------------
| 	p[1] = creat(some_temp_file, 0);
| 	p[1] = open(some_temp_file, 0);
+---------------

Urk.  Anyone want a slightly-used brain with flaky neurons?  :-(  The second
"p[1]" should of course be "p[0]".

AAAAARRRRRRRRRGGGHHHHHHHHHHH!!!!!!!!!  (I shoulda stood in bed!)

++Brandon
-- 
Brandon S. Allbery, comp.sources.misc moderator and one admin of ncoast PA UN*X
uunet!hal.cwru.edu!ncoast!allbery  <PREFERRED!>	    ncoast!allbery@hal.cwru.edu
allbery@skybridge.sdi.cwru.edu	      <ALSO>		   allbery@uunet.uu.net
comp.sources.misc is moving off ncoast -- please do NOT send submissions direct
  (But the aliases are NOT on UUNET yet, use the aliases at backbone sites!)