chris@trantor.umd.edu (Chris Torek) (03/13/88)
>> For example, if you want to print a string fast, you would write >>a routine that uses only putc... In article <1988Mar13.005841.12770@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >!!NO!! If you want to print a string fast, you use fputs and insist that >your supplier implements it properly. >Unfortunately, not all implementors make the effort to make stdio fast. >In general, System V stdios are fast and 4BSD ones are not. (There are >exceptions to this, both ways, and I believe the 4BSD one is going to be >improved.) `Has been', not `going to be'. The 4.3BSD stdio fputs() is written in assembly, so that it can use the Vax `locc' and `movc3' instructions to push bytes around. In addition, the putc() macro was redone so as to avoid a subroutine call for every character put to a line buffered stream (such as stdout). Unfortunately, whoever came up with the `putc' hack did it by negating the count on line buffered streams, without changing the innards of the printf formatter to handle this. Mixing [f]printf and putc[har] can overflow stdio buffers. This leads to bugs that are often difficult to diagnose. I have fixed this particular bug in my revised stdio by removing the negation. Instead, the putc macro tests the line-buffered flag directly. This affects a relatively small number of modules in /usr/src/lib/libc/stdio (and of course /usr/src/include/stdio.h). While I was at it I also fixed stdio to return errors if one attempts to use printf, fprintf, puts, or fputs on a read-only stream. (getc and putc still work on improperly directed streams; I could fix this by adding a second count, at the expense of making one of the two somewhat slower.) But I made enough other changes that this stdio is not likely to be released very soon. As a workaround which incidentally improves efficiency, I suggest writing programs that set block buffering on stdout. It is unfortunate that there is no way (portable) to say `clear the line buffering flag without changing anything else'; stdout->_flag &= ~_IOLBF; if (stdout->_cnt < 0) stdout->_cnt = -stdout->_cnt; happens to work when done after printing something to stdout, but is horribly ugly. Instead, for reasonably portable efficiency, one must duplicate the part of stdio that discovers the proper buffer size for a stream and allocates and attaches such a buffer, and allocate and attache it `by hand' (here via setbuffer()), and then later free it `by hand' when the file is closed. Simply using setbuf() is rather unsatisfactory because then if stdout is sent to a file, the file system is obliged to allocate and expand fragments. This is particularly bad under 4.2BSD, which would often end up copying a 1K fragment to a 2K fragment, only to have to copy that to a 3K fragment, then to a 4K fragment, and so forth. 4.3BSD notices when a program behaves this way, and when expanding the 1K fragment, finds a whole block first, so that the 2K fragment can be expanded all the way to that whole block without any copies, provided, of course, that no one else takes the rest first. Someone has suggested privately that we stop automatically line buffering files for which `isatty' is true. Programmers would be forced to use fflush() whenever appropriate, or to use setlinebuf() (or setvbuf() from System V) to set line buffering. Were we to delete line buffering entirely, putc() could become considerably shorter: /* approximately: */ #define putc(c, f) \ (--f->count < 0 ? _flsbuf(c, f) : *f->ptr++ = c) vs /* again approximately */ #define putc(c, f) \ (--f->count < 0 ? _flsbuf(c, f) : (*f->ptr = c, \ (f->flags & _IOLBF && *f->ptr == '\n' ? \ _flsbuf('\n', f) : *f->ptr++))) /* the acutal macros are uglier */ On another related topic, given the latter macro and the following C source: register FILE *f; register int c; ... if (putc(c, fp) == EOF) die(); the Vax compiler produces this: decl (r11) # --f->count < 0? jgeq L99999 # no, skip this pushl r11 # yes, push f and c movzbl r10,-(sp) jbr L2000000 # and go call _flsbuf L2000002: # get here if line buffered cmpb *4(r11),$10 # *f->ptr == '\n' ? jneq L99997 # no, go increment pushl r11 # push f and newline pushl $10 L2000000: calls $2,__flsbuf # and go call flsbuf jbr L99998 # ... expression value in r0 L99999: cvtlb r10,*4(r11) # *f->ptr = c jbs $7,16(r11),L2000002 # if line buffered, go test L99997: movl 4(r11),r0 # save f->ptr incl 4(r11) # f->ptr++ movzbl (r0),r0 # evaluate *(old f->ptr) L99998: cmpl r0,$-1 # putc(...) == EOF ? jneq L26 # no, skip calls $0,_die # die() L26: which is not quite as bad as it looks; the most common flow, assuming that `fp' is not line buffered, goes thus: decl (r11) branch cvtlb r10,*4(r11) test-bit-without-branching movl 4(r11),r0 incl 4(r11) movzbl (r0),r0 cmpl r0,$-1 branch which is `only' 9 instructions and two branches. But note that several of these instructions (notably those near the end) are unnecessary. With what I think is a fairly small compiler change, we could get instead this: sobgeq (r11),L99994 # if (--f->count < 0), pushl r11 # push f movzbl r10,-(sp) # push c calls $2,__flsbuf # call flsbuf cmpl r0,$-1 # if EOF, jeql L99995 # go die jbr L28 # otherwise continue L99994: cvtlb r10,*4(r11) # *f->ptr = c jbc $7,16(r11),L99992 # if line buffered, cmpb *4(r11),$10 # and if *f->ptr == '\n', jneq L99992 pushl r11 # push f pushl $10 # push c calls $2,__flsbuf # call flsbuf cmpl r0,$-1 # if EOF, jeql L99995 # go die jbr L28 # otherwise continue L99992: incl 4(r11) # f->ptr++ jbr L28 # and continue L99995: calls $0,_die L28: Here the common case looks like this: sobgeq (r11),<> # decrement and do branch cvtlb r10,*4(r11) # store the byte jbc $7,16(r11),<> # test and do branch incl 4(r11) # increment the pointer jbr L28 # and continue or 5 instructions and three branches. The extra branch is bad on a pipelined Vax, but saving four instructions probably beats the pipe loss. If we could also get rid of the line-buffered test, we wind up with 4 instructions and two branches. Reordering the macro gets us 5 instructions and one branch (or 6 instructions and two branches, if we do nothing about line buffering). The same compiler change simplifies the code for getc() as well. The common case for the loop while ((c = getc(in)) != EOF) if (putc(c, out) == EOF) die(); drops from 19 instructions with 5 branches taken to 12 instructions with 4 branches taken (both with line buffering left in) or 10 instructions and 3 branches taken (line buffering out). Hence, a simple `cat' might be able to go about twice as fast. For those who are interested, the compiler change basically involves rewriting (a ? b : c) == CONST as (a ? b == CONST : c == CONST) In general, this should be done if and only if either `b' or `c' (or both) can produce a value that is never == CONST, e.g., int a, b; unsigned char c; if ((a ? b : ++c) == -1) ... ultimately becomes if (a ? b == -1 : (++c, 0)) ... There are more general range-related optimisations, but those would be much harder to add to PCC. Oh well, enough rambling. Maybe I will go see if the air conditioning in our computer room is working again.... -- In-Real-Life: Chris Torek, Univ of MD Computer Science, +1 301 454 7163 Domain: chris@mimsy.umd.edu Path: ...!uunet!mimsy!chris