jimv@radix (Jim Valerio) (01/18/88)
The recent discussion here about frequency statements has surprised me. I've wanted to be able to tag unlikely if-statements for years. I've always thought that at least some of the information that I knew about an algorithm I coded would be useful for code generation but wouldn't be seen by a compiler. Upon reflection, it seems to me that much of the code I have been writing the last few years has been case-sensitive (if you will). I've written a lot of operating system code, math library code, and exception handling code, all of which does parameter validation and special case handling before ever getting to the "real" work. Now I am not saying that the execution of the checking code dominates the total execution time (and I'm not saying that it can't, either). Instead, I'm saying that I've found that checking for rare cases is often on an important execution path. I would like to minimize the costs of those rare case checks. I have no desire to merge profiling information into my code. The cases I'm concerned about are the obvious ones, those cases where the software is *designed* so that they're rare. These are the cases where I would expect to find significant gains, if any are to be had. For example, consider an implementation of a hash table lookup. The reason for using a hash table is so that the first lookup is most likely to either find a successful match or determine that there is none. Protracted searches are rare. Consider that the whole point for buffered I/O is that most of the time the next character can be found in the buffer. In fact, think of how rare an end-of-file is when processing a file. But whenever characters are read in using getc(), for example, a check for EOF is usually performed. The optimization I want the compiler to make is simple: given an indication of a highly-likely branch of an if-then-else construct (call if-then's an if-then-else with an empty else section), arrange for the execution of that branch to flow-through, moving the unlikely or rare-case code branch out of the straight-line common path. The rare-case code, perhaps placed at the end of the subroutine, could then jump back into the main execution flow when appropriate. With this arrangement, the common-case cost is kept down to the check itself and one not-taken branch, and the rare-case cost is the check and two taken branches. I would expect this arrangement would benefit those machines with aggressive pre-fetch units. It might help the instruction cache hit ratio, but I wouldn't say so. If there's no rare-case code to execute out-of-line, then there's no additional savings. But when there is rare-case code the savings is there, and accentuated for machines like the 80386 and 68020 that suffer penalties for taken branches. At the end of this article, I supply a single data point on the effectiveness of this optimization. It shows how the optimization can have a significant effect (10% for the 80386, I'd guess maybe 2.5% for a RISC machine with delayed branches). The data point also shows how the optimization can have a negligible effect (0.7%) on total execution time. In short, the data point is delightfully inconclusive. Please note that the data point doesn't really measure what I care about: the latency to starting the "real" work. I would be interested in results that anyone might have on that. Meanwhile, I've found after coding on the 80386 for some months now, my programming style has changed somewhat to accommodate the common-case fall-through behavior. -- Jim Valerio {verdix,omepd}!radix!jimv ----- For the 80386, generally 6 clocks (roughly two instructions) are lost by taking a branch: not-taken branches require 3 clocks and taken branches take 7+m clocks (`m' is the number of parcels in the next instruction, usually 2). This means that if the number of taken branches can be significantly reduced, the program will run measurably faster. I measured the potential execution speed gains on our 16 MHz 80386 system by reorganizing some code in a program I just finished writing. This C program reads in a board schematic (netlist) in FutureNet format, performs a number of consistency checks on it, and then writes out the netlist in a different format. The code is neither highly tuned nor sloppy. Let's look at the 2 most time consuming routines in the program: get_field (24% of the execution time) and getstr (17% of the execution time). The profiling information is based on processing a 79501 byte netlist (a medium-sized schematic). (Actually, I ran it 10 times to lessen the 60 Hz time resolution noise, but all the following numbers are scaled to the single run.) The FutureNet file is made up of records, one per line, containing between 1 and 6 comma-separated fields. Get_field merely reads in the next field, terminating on a newline or comma (stripping leading and trailing blanks and tracking line numbers), and then calls getstr with the field. Getstr looks for the string in a hash table; if not present, it is added. In get_field, I found 15 candidates for labeling rare-code sequences. These included the check for buffer overflow, reading end-of-file, and getc needing to call _filbuf. Of these 15, 3 happened to have the "better" code (i.e. the optimization was already in place). The 79501 byte netlist (78 1K stdio buffers) has 5282 lines and 21386 fields (455 empty, average length 2.7 characters). By recoding the branches following the a priori known frequencies, a savings of 256 msec from 2544 msec was realized (10%). 85% of the savings were due to assuming that EOF and _filbuf were uncommon cases. In getstr, I found 8 candidates for labeling rare-code sequences. These included a checks for matching an entry in the hash table, finding an empty slot, and growing the table if it got too full. Of these 8, 6 happened to have the "better" code. With 22173 string lookups (20505 matching), 4702 initial entry collisions and 11864 passes through the rehash code, a savings of 13 msecs from 1819 msec was realized (0.7%). If the other 6 cases had been reversed and not had the "better" code, then another 32 msec or 1.7% would have been added to the total time.