[comp.unix.wizards] 4.3BSD stdio, and so forth

chris@trantor.umd.edu (Chris Torek) (03/13/88)
>>	For example, if you want to print a string fast, you would write
>>a routine that uses only putc...

In article <1988Mar13.005841.12770@utzoo.uucp>
henry@utzoo.uucp (Henry Spencer) writes:
>!!NO!!  If you want to print a string fast, you use fputs and insist that
>your supplier implements it properly.

>Unfortunately, not all implementors make the effort to make stdio fast.
>In general, System V stdios are fast and 4BSD ones are not.  (There are
>exceptions to this, both ways, and I believe the 4BSD one is going to be
>improved.)

`Has been', not `going to be'.  The 4.3BSD stdio fputs() is written
in assembly, so that it can use the Vax `locc' and `movc3' instructions
to push bytes around.  In addition, the putc() macro was redone so
as to avoid a subroutine call for every character put to a line
buffered stream (such as stdout).

Unfortunately, whoever came up with the `putc' hack did it by
negating the count on line buffered streams, without changing the
innards of the printf formatter to handle this.  Mixing [f]printf
and putc[har] can overflow stdio buffers.  This leads to bugs that
are often difficult to diagnose.

I have fixed this particular bug in my revised stdio by removing
the negation.  Instead, the putc macro tests the line-buffered flag
directly.  This affects a relatively small number of modules in
/usr/src/lib/libc/stdio (and of course /usr/src/include/stdio.h).
While I was at it I also fixed stdio to return errors if one attempts
to use printf, fprintf, puts, or fputs on a read-only stream.
(getc and putc still work on improperly directed streams; I could
fix this by adding a second count, at the expense of making one of
the two somewhat slower.)  But I made enough other changes that
this stdio is not likely to be released very soon.

As a workaround which incidentally improves efficiency, I suggest
writing programs that set block buffering on stdout.  It is
unfortunate that there is no way (portable) to say `clear the line
buffering flag without changing anything else';

	stdout->_flag &= ~_IOLBF;
	if (stdout->_cnt < 0)
		stdout->_cnt = -stdout->_cnt;

happens to work when done after printing something to stdout, but
is horribly ugly.  Instead, for reasonably portable efficiency,
one must duplicate the part of stdio that discovers the proper
buffer size for a stream and allocates and attaches such a buffer,
and allocate and attache it `by hand' (here via setbuffer()), and
then later free it `by hand' when the file is closed.  Simply using
setbuf() is rather unsatisfactory because then if stdout is sent
to a file, the file system is obliged to allocate and expand
fragments.  This is particularly bad under 4.2BSD, which would
often end up copying a 1K fragment to a 2K fragment, only to have
to copy that to a 3K fragment, then to a 4K fragment, and so forth.
4.3BSD notices when a program behaves this way, and when expanding
the 1K fragment, finds a whole block first, so that the 2K fragment
can be expanded all the way to that whole block without any copies,
provided, of course, that no one else takes the rest first.

Someone has suggested privately that we stop automatically line
buffering files for which `isatty' is true.  Programmers would be
forced to use fflush() whenever appropriate, or to use setlinebuf()
(or setvbuf() from System V) to set line buffering.  Were we to
delete line buffering entirely, putc() could become considerably
shorter:

	/* approximately: */
	#define putc(c, f) \
		(--f->count < 0 ? _flsbuf(c, f) : *f->ptr++ = c)

vs

	/* again approximately */
	#define putc(c, f) \
		(--f->count < 0 ? _flsbuf(c, f) : (*f->ptr = c, \
			(f->flags & _IOLBF && *f->ptr == '\n' ? \
				_flsbuf('\n', f) : *f->ptr++)))
	/* the acutal macros are uglier */

On another related topic, given the latter macro and the following
C source:

	register FILE *f;
	register int c;
	...
	if (putc(c, fp) == EOF)
		die();

the Vax compiler produces this:

		decl	(r11)		# --f->count < 0?
		jgeq	L99999		# no, skip this
		pushl	r11		# yes, push f and c
		movzbl	r10,-(sp)
		jbr	L2000000	# and go call _flsbuf
	L2000002:		# get here if line buffered
		cmpb	*4(r11),$10	# *f->ptr == '\n' ?
		jneq	L99997		# no, go increment
		pushl	r11		# push f and newline
		pushl	$10
	L2000000:
		calls	$2,__flsbuf	# and go call flsbuf
		jbr	L99998		# ... expression value in r0
	L99999:
		cvtlb	r10,*4(r11)	# *f->ptr = c
		jbs	$7,16(r11),L2000002	# if line buffered, go test
	L99997:
		movl	4(r11),r0	# save f->ptr
		incl	4(r11)		# f->ptr++
		movzbl	(r0),r0		# evaluate *(old f->ptr)
	L99998:
		cmpl	r0,$-1		# putc(...) == EOF ?
		jneq	L26		# no, skip
		calls	$0,_die		# die()
	L26:

which is not quite as bad as it looks; the most common flow,
assuming that `fp' is not line buffered, goes thus:

		decl	(r11)
		branch
		cvtlb	r10,*4(r11)
		test-bit-without-branching
		movl	4(r11),r0
		incl	4(r11)
		movzbl	(r0),r0
		cmpl	r0,$-1
		branch

which is `only' 9 instructions and two branches.  But note that
several of these instructions (notably those near the end) are
unnecessary.  With what I think is a fairly small compiler change,
we could get instead this:

		sobgeq	(r11),L99994	# if (--f->count < 0),
		pushl	r11		#	push f
		movzbl	r10,-(sp)	#	push c
		calls	$2,__flsbuf	#	call flsbuf
		cmpl	r0,$-1		#	if EOF,
		jeql	L99995		#		go die
		jbr	L28		#	otherwise continue
	L99994:
		cvtlb	r10,*4(r11)	# *f->ptr = c
		jbc	$7,16(r11),L99992 # if line buffered,
		cmpb	*4(r11),$10	# and if *f->ptr == '\n',
		jneq	L99992
		pushl	r11		#	push f
		pushl	$10		#	push c
		calls	$2,__flsbuf	#	call flsbuf
		cmpl	r0,$-1		#	if EOF,
		jeql	L99995		#		go die
		jbr	L28		#	otherwise continue
	L99992:
		incl	4(r11)		# f->ptr++
		jbr	L28		# and continue
	L99995:
		calls	$0,_die
	L28:

Here the common case looks like this:

		sobgeq	(r11),<>	# decrement and do branch
		cvtlb	r10,*4(r11)	# store the byte
		jbc	$7,16(r11),<>	# test and do branch
		incl	4(r11)		# increment the pointer
		jbr	L28		# and continue

or 5 instructions and three branches.  The extra branch is bad on
a pipelined Vax, but saving four instructions probably beats the
pipe loss.  If we could also get rid of the line-buffered test, we
wind up with 4 instructions and two branches.  Reordering the macro
gets us 5 instructions and one branch (or 6 instructions and two
branches, if we do nothing about line buffering).

The same compiler change simplifies the code for getc() as well.
The common case for the loop

	while ((c = getc(in)) != EOF)
		if (putc(c, out) == EOF)
			die();

drops from 19 instructions with 5 branches taken to 12 instructions
with 4 branches taken (both with line buffering left in) or 10
instructions and 3 branches taken (line buffering out).  Hence,
a simple `cat' might be able to go about twice as fast.

For those who are interested, the compiler change basically
involves rewriting

	(a ? b : c) == CONST

as
	(a ? b == CONST : c == CONST)

In general, this should be done if and only if either `b' or `c'
(or both) can produce a value that is never == CONST, e.g.,

	int a, b;
	unsigned char c;
	if ((a ? b : ++c) == -1) ...

ultimately becomes

	if (a ? b == -1 : (++c, 0)) ...

There are more general range-related optimisations, but those would
be much harder to add to PCC.

Oh well, enough rambling.  Maybe I will go see if the air conditioning
in our computer room is working again....
-- 
In-Real-Life: Chris Torek, Univ of MD Computer Science, +1 301 454 7163
Domain: chris@mimsy.umd.edu		Path: ...!uunet!mimsy!chris