[comp.arch] A case for FREQUENCY statements

jimv@radix (Jim Valerio) (01/18/88)
The recent discussion here about frequency statements has surprised me.  I've
wanted to be able to tag unlikely if-statements for years.  I've always thought
that at least some of the information that I knew about an algorithm I coded
would be useful for code generation but wouldn't be seen by a compiler.

Upon reflection, it seems to me that much of the code I have been writing
the last few years has been case-sensitive (if you will).  I've written a
lot of operating system code, math library code, and exception handling code,
all of which does parameter validation and special case handling before ever
getting to the "real" work.  Now I am not saying that the execution of the
checking code dominates the total execution time (and I'm not saying that it
can't, either).  Instead, I'm saying that I've found that checking for rare
cases is often on an important execution path.  I would like to minimize
the costs of those rare case checks.

I have no desire to merge profiling information into my code.  The cases
I'm concerned about are the obvious ones, those cases where the software
is *designed* so that they're rare.  These are the cases where I would
expect to find significant gains, if any are to be had.

For example, consider an implementation of a hash table lookup.  The reason
for using a hash table is so that the first lookup is most likely to
either find a successful match or determine that there is none.  Protracted
searches are rare.  Consider that the whole point for buffered I/O is that
most of the time the next character can be found in the buffer.  In fact,
think of how rare an end-of-file is when processing a file.  But whenever
characters are read in using getc(), for example, a check for EOF is usually
performed.

The optimization I want the compiler to make is simple: given an indication
of a highly-likely branch of an if-then-else construct (call if-then's an
if-then-else with an empty else section), arrange for the execution of that
branch to flow-through, moving the unlikely or rare-case code branch out of
the straight-line common path.  The rare-case code, perhaps placed at the end
of the subroutine, could then jump back into the main execution flow when
appropriate.  With this arrangement, the common-case cost is kept down to the
check itself and one not-taken branch, and the rare-case cost is the check and
two taken branches.

I would expect this arrangement would benefit those machines with aggressive
pre-fetch units.  It might help the instruction cache hit ratio, but I
wouldn't say so.  If there's no rare-case code to execute out-of-line, then
there's no additional savings.  But when there is rare-case code the savings
is there, and accentuated for machines like the 80386 and 68020 that suffer
penalties for taken branches.

At the end of this article, I supply a single data point on the effectiveness
of this optimization.  It shows how the optimization can have a significant
effect (10% for the 80386, I'd guess maybe 2.5% for a RISC machine with
delayed branches).  The data point also shows how the optimization can have a
negligible effect (0.7%) on total execution time.  In short, the data point is
delightfully inconclusive.

Please note that the data point doesn't really measure what I care about:
the latency to starting the "real" work.  I would be interested in results
that anyone might have on that.  Meanwhile, I've found after coding on the
80386 for some months now, my programming style has changed somewhat to
accommodate the common-case fall-through behavior.
--
Jim Valerio	{verdix,omepd}!radix!jimv

-----
For the 80386, generally 6 clocks (roughly two instructions) are lost by
taking a branch: not-taken branches require 3 clocks and taken branches
take 7+m clocks  (`m' is the number of parcels in the next instruction,
usually 2).  This means that if the number of taken branches can be
significantly reduced, the program will run measurably faster.

I measured the potential execution speed gains on our 16 MHz 80386 system
by reorganizing some code in a program I just finished writing.  This C
program reads in a board schematic (netlist) in FutureNet format, performs a
number of consistency checks on it, and then writes out the netlist in a
different format.  The code is neither highly tuned nor sloppy.  Let's look at
the 2 most time consuming routines in the program: get_field (24% of the
execution time) and getstr (17% of the execution time).  The profiling
information is based on processing a 79501 byte netlist (a medium-sized
schematic).  (Actually, I ran it 10 times to lessen the 60 Hz time resolution
noise, but all the following numbers are scaled to the single run.)

The FutureNet file is made up of records, one per line, containing between
1 and 6 comma-separated fields.  Get_field merely reads in the next field,
terminating on a newline or comma (stripping leading and trailing blanks and
tracking line numbers), and then calls getstr with the field.  Getstr looks
for the string in a hash table; if not present, it is added.

In get_field, I found 15 candidates for labeling rare-code sequences.  These
included the check for buffer overflow, reading end-of-file, and getc needing
to call _filbuf.  Of these 15, 3 happened to have the "better" code (i.e. the
optimization was already in place).  The 79501 byte netlist (78 1K stdio
buffers) has 5282 lines and 21386 fields (455 empty, average length 2.7
characters).  By recoding the branches following the a priori known
frequencies, a savings of 256 msec from 2544 msec was realized (10%).  85% of
the savings were due to assuming that EOF and _filbuf were uncommon cases.

In getstr, I found 8 candidates for labeling rare-code sequences.  These
included a checks for matching an entry in the hash table, finding an empty
slot, and growing the table if it got too full.  Of these 8, 6 happened to
have the "better" code.  With 22173 string lookups (20505 matching), 4702
initial entry collisions and 11864 passes through the rehash code, a savings
of 13 msecs from 1819 msec was realized (0.7%).  If the other 6 cases had
been reversed and not had the "better" code, then another 32 msec or 1.7%
would have been added to the total time.