ka@hropus.UUCP (Kenneth Almquist) (08/04/86)
> As I understand it, a switch/case setup compiles exactly the same as > if (var == const1) {.....}; > else if (var == const2) {.....}; > else {default_action}; > anyway. A compiler should be able to do better than that. Generating good code for switches is one place where compiler writers can show they know something about searching. Using a branch table will produce faster and smaller code if the cases are consecutive integers. The only thing that may be non-obvious is the best way to test whether the expression being switched on is within range before accessing the branch table. A straightforward coding of this test requires two comparisons, but Ritchie's compiler for the PDP-11 generated only one comparison using the machine language equivalent of the following code: if ((unsigned)(value -= MIN) > (MAX - MIN) goto default; On the PDP-11, this generates one less instruction (or two less, if MIN is zero) than: if (value < MIN || value > MAX) goto default; (Learning hacks like these is one of the benefits of reading the compiler.) Since some people like to count starting at one, a possible enhancement would be to change MIN from 1 to 0 in this case by adding an extra entry to the beginning of the branch table. But real C programmer count from 0; wimps who count starting with 1 deserve to have their switch statements execute slower. :-) If the cases are not consecutive, the branch table must be filled with entries for the "missing" cases, making the table larger. The Ritchie compiler will generate a branch table if the number of entries in the branch table would be less than three times the number of cases. If the cases are mostly consecutive with a few outliers, a branch table could be augmented with a few specific tests for the outliers, but I don't know of any compiler that does this. There are two general methods for implementing tables which do not rely on the keys being nearly consecutive: hashing and the binary search. Hashing, which is what the Ritchie compiler uses, has the advantage that its performance is independent of the number of cases, just as with a branch table. The modulo instruction is used as the hash function; the C compiler tries a variety of values for the modulus and selects the best one which means that the time required to generate the hash table is proportional to the square of the number of cases. The UN*X VAX compiler, on the other hand, uses a binary search. A binary search executes in time proportional to the log of the number of cases. This makes it sound worse that the hash search; I presume someone discovered that most switch statements contained few enough statements that the binary search was superior. The binary search works best on a machine with a three way branch like that provided by the Fortran arithmentic IF statement. On a machine with condition codes like the VAX, each step of the binary search is implemented by a compare instruction followed by two branches on the condition codes. Several things may be done to improve the performance of the binary search. First, many machines execute a conditional branch instruction faster when no branch is taken, so that rather than testing the middle value in the table one can test a value somewhat to one side in order to decrease the chance that a branch will be taken. This is implemented in the VAX compiler, but some other possible improvements are not. The binary search could be abandoned in favor of the linear search at some level (e. g. when the number of possibilities has been reduced to 2 or 3). If there are a series of consecutive cases, this can be used to eliminate extra tests. (For example, if a number is known to be less than 5 it cannot be greater than 4, and if a number is greater than 2 and less than 4 then it must be equal to 3, but the VAX compiler will code to test for both these cases.) Finally, each comparison generated by the VAX compiler looks like: cmpl r0,$7 # compare switch value with 7 beql L1 # if equal, branch to the code for case 7 bgtr L8 # if greater, branch to L8 for the next cmpl. # The next cmpl instruction goes here. Since the branch to L8 is more likely to be executed that the branch to L1, the bgtr instruction should come first to minimize the number of instructions executed. For small cases, the good old linear search is used by both the Ritchie and the VAX compilers. Several things are done to speed up the linear search. First, the switch expression is copied into a register. This helps in general but loses when the switch instruction is a register variable or the switch contains only one case. (This is hard to fix because the compilers generate the code for a switch in two pieces; first they generate code to place the value to be switched on in r0, and then they generate the code to implement the switch. Another problem with this separation appears on the PDP-11, where it may be easier to compute an expression in r1 than in r0 due to a brain damaged multiply instruction.) Second, the compiler implements the search as a series of compares followed by branch if equal instructions. This minimizes the time required to reach the default case because in the default case none of the branches will be taken. In contrast, the code at the top of this article will probably be compiled into a series of branch if not equal instructions, all of which will branch if there is no match. The Ritchie compiler at one time actually generated a table of values and labels, and did a linear search on it using a loop. This saves space but takes more time. Furthermore, it doesn't even save space if there are 1 or 2 cases. (The Ritchie compiler has special handling of switch statements with no cases except the default.) The Ritchie compiler was changed to generate linear searches using a series of compare instructions quite a long time ago, but I don't know if Dennis was responsible for the change. I don't know why the VAX compiler uses with a linear search at all, rather than using the binary search. Probably this was a holdover from the Ritchie compiler, which must be prepared to generate a linear search because a linear search will outperform a hash search on a few elements. In general, then, a compiler should do pretty well by using a jump table where possible and a binary search otherwise. What is actually optimum must be determined by measuring various approaches on a specific machine. Kenneth Almquist ihnp4!houxm!hropus!ka (official name) ihnp4!opus!ka (shorter path)
zben@umd5 (Ben Cranston) (08/07/86)
In article <610@hropus.UUCP> ka@hropus.UUCP (Kenneth Almquist) writes: > If the cases are not consecutive, the branch table must be filled with > entries for the "missing" cases, making the table larger. The Ritchie > compiler will generate a branch table if the number of entries in the > branch table would be less than three times the number of cases. This bit me once. In one of an innumerable number of Impress-understanding programs I converted an IF nest into a switch statement, and was chagrined when the program didn't show a vast efficiency improvement. After peeking at the (BSD4.x) C compiler code generator we found the 1/3 fudge factor, and that the Impress codes were only two short of being dense enough to compile into a jump table. We didn't actually try this, but I suppose that if we had defined two or three bogus Impress codes and put in cases for them then the switch statement would have suddenly started being much more efficient. Which brings me to another point: many people seem to make their programs "more efficient" by minimizing the size of the (higher level language) SOURCE code, rather than minimizing the output object. Usually these go hand-in-hand, but the above is an interesting case where MORE source code resulted in LESS (and FASTER) object code... And we won't even TALK about MOVE CORRESPONDING or SORT... -- umd5.UUCP <= {seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben Ben Cranston zben @ umd2.UMD.EDU Kingdom of Merryland Sperrows 1100/92 umd2.BITNET "via HASP with RSCS"
stevesu@copper.UUCP (Steve Summit) (08/07/86)
In article <610@hropus.UUCP>, ka@hropus.UUCP writes: > > As I understand it, a switch/case setup compiles exactly the same as > > if (var == const1) {.....}; > > else if (var == const2) {.....}; > > else {default_action}; > > anyway. > > A compiler should be able to do better than that. Generating good code for > switches is one place where compiler writers can show they know something > about searching. > > Using a branch table will produce faster and smaller code if the cases > are consecutive integers. And then there was version 1 of DEC's VAX/VMS C compiler. I have this image of the people working on the code generator, saying something like "well, we work for DEC, and we're writing this compiler for this Vax, and our friends here at DEC gave us this lovely CASEB instruction, so let's use it!" Now, the CASEB instruction is basically a microcode implementation of a branch table. The trouble is, early versions of VAX11C used it for _a_l_l switch statements, whether the labels were reasonably dense or not. Woe betide the poor programmer who wrote something like switch(x) { case 1: case 10000: Yup, you got this enormous object file, which took forever to compile and link, because it had a CASEB instruction with a ridiculously sparse branch table in it. By the way, this problem has been fixed. Version 2 of the VMS C compiler is quite nice, and I have very few complaints with it. Steve Summit tektronix!copper!stevesu
chris@umcp-cs.UUCP (Chris Torek) (08/08/86)
>In article <610@hropus.UUCP> ka@hropus.UUCP (Kenneth Almquist) writes: >>If the cases are not consecutive, the branch table must be filled with >>entries for the "missing" cases, making the table larger [; eventually >>this becomes a waste of space and the compiler uses a different approach]. In article <1171@umd5> zben@umd5.umd.edu (Ben Cranston) writes: >... In one of an innumerable number of Impress-understanding >programs I converted an IF nest into a switch statement, and was >chagrined when the program didn't show a vast efficiency improvement. >After peeking at the (BSD4.x) C compiler code generator we found >the 1/3 fudge factor, and that the Impress codes were only two >short of being dense enough to compile into a jump table. [... followed by speculation on the effects of adding two cases.] One can force a consecutive sequence in another, less compiler dependent, way. Write a compressing array: char input_class[256] = { /* classes for 0-127 */ is_char, is_char, ..., is_char, /* classes for 128 on up */ is_obj1, is_obj2, is_obj2, is_badobj, is_obj1, ... }; Then use switch (input_class[input]) { case is_char: ... case is_obj1: ... } Those who are familiar with Knuth's TeX system may have seen some of the code supplied to convert TeX .DVI files to device-dependent data (a la Imagen's imPress language). In these, someone---Knuth himself most likely---has written code of the form /* translated to C for this newsgroup */ #define two_cases(base) \ base: case base+1 #define four_cases(base) \ two_cases(base): case two_cases(base+2) #define eight_cases(base) \ four_cases(base): case four_cases(base+4) #define thirty_two_cases(base) \ eight_cases(base): case eight_cases(base+8): \ case eight_cases(base+16): case eight_cases(base+24) #define sixty_four_cases(base) \ thirty_two_cases(base): case thirty_two_cases(base+32) then, later in the program, used these as (e.g.) #define w0 147 ... #define fntnum0 171 ... case four_cases(w0): w = p; x += w; break; ... case sixty_four_cases(fntnum0): font = p; break; ... I originally used rather similar code in my own Imagen and Versatec DVI conversion programs. For some reason I decided one day to try a compressing array like the one above. To my surprise, the code was not only clearer to me, but also ran rather faster. I verified that the generated code indeed used a vax `casel' instruction in both versions. My guess is that the speedup came entirely from fitting more of the main loop into the cache. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu
jso@edison.UUCP (John Owens) (08/15/86)
In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes: > If the cases are not consecutive, the branch table must be filled with > entries for the "missing" cases, making the table larger. [....] If > the cases are mostly consecutive with a few outliers, a branch table > could be augmented with a few specific tests for the outliers, but I > don't know of any compiler that does this. The Microsoft Pascal/Fortran compilers, at least some of the old ones (3.04 and 3.11) do this. I haven't tried it to see if their wonderful new code generators do. John Owens @ General Electric Company (+1 804 978 5726) jso%edison.UUCP@seismo.CSS.GOV [arpa] jso@edison.UUCP [w/ uucp domains] {cbosgd allegra ncsu xanth}!uvacs!edison!jso [roll your own]
asw@tony.UUCP (Tony Williams) (08/20/86)
In article <840@edison.UUCP> jso@edison.UUCP (John Owens) writes: > >In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes: >> If the cases are not consecutive, the branch table must be filled with >> entries for the "missing" cases, making the table larger. [....] If >> the cases are mostly consecutive with a few outliers, a branch table >> could be augmented with a few specific tests for the outliers, but I >> don't know of any compiler that does this. > >The Microsoft Pascal/Fortran compilers, at least some of the old >ones (3.04 and 3.11) do this. I haven't tried it to see if their >wonderful new code generators do. > I am not sure that this is original, as I think it has been used in BCPL compilers and the like for ages. I implemented this in a C compiler for an unusual machine (a Three Rivers Perq running the CMU Accent operating system, with our own Unix emulation layered on top). See comments following for a discussion of the advantages. The algorithm is roughly as follows (barring off-by-one errors - my favourite): genSwitch( caseLabels, addresses, labelCount ) int *caseLabels; /* array of label values */ <mumble> *addresses; /* array of addresses */ int labelCount; /* number of labels */ { int labelRange = caseLabels[labelCount-1] - caseLabels[0]; if( labelRange <= acceptable_sparseness * labelCount ) /* suitably dense */ genJumpTable( caseLabels, addresses, labelCount ) else { /* too sparse - generate some tests * first find some suitable median value of label, * say one that divides the table in half for simplicity */ int median = findMedian( caseLabels, labelCount ); /* eg labelCount / 2; */ /* result is place in table, not value */ genIfStmt( switchExpr, caseLabel[median], less_than ); /* now do the lower half (expr < median) */ genSwitch( caseLabels, addresses, median ); startElse(); /* now the upper half (expr >= median) */ genSwitch( &caseLabels[median], &addresses[median], labelCount-median ); } } This algorithm has several nice features: - the sparsesness threshold can be tuned without hacking code - the use of recursion for each portion of the divided switch means that locally dense clusters in a sparse switch can each have their own jump table, reached by test-and-jumps - the tests for out-of-range expressions can be factored in to the algorithm, - the choice of division point can be made more intelligently if you wish (eg by maximising the density of one or both portions), by only changing findMedian() - non-dense cases are effectively handle by binary search, coded as a jump tree. This approaches optimal, if there is no information on dynamic frequencies. The generated code (subject to findMedian) is usually within one extra test of being optimal. Not bad for a simple-minded algorithm. As a footnote, the virtual machine architecture had no indirect jump instruction, so the same technique (and code) was used here by Trudy Watson in implementing FORTRASH computed goto's and (shudder) assigned goto's. --------------------------------------------------------------------------- Tony Williams |Informatics Division UK JANET: asw@uk.ac.rl.vd |Rutherford Appleton Lab Usenet: {... | mcvax}!ukc!rlvd!asw |Chilton, Didcot ARPAnet: asw%vd.rl.ac.uk@ucl-cs.arpa |Oxon OX11 0QX, UK
gary@darth.UUCP (Gary Wisniewski) (08/21/86)
In article <840@edison.UUCP> jso@edison.UUCP (John Owens) writes: >In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes: >> If the cases are not consecutive, the branch table must be filled with >> entries for the "missing" cases, making the table larger. [....] If >> the cases are mostly consecutive with a few outliers, a branch table >> could be augmented with a few specific tests for the outliers, but I >> don't know of any compiler that does this. > >The Microsoft Pascal/Fortran compilers, at least some of the old >ones (3.04 and 3.11) do this. I haven't tried it to see if their >wonderful new code generators do. Lattice C (for IBM PC/XT/AT) actually generates three possible code sequences for switch statements: 1. For a small number of cases, individual comparisons are generated. 2. For a large number of cases, a branch table is generated, with default branches for the missing cases. 3. For cases which compromise space (in 2) or speed (in 1), the final approach builds a compact branch table and scans it sequentially. This is true for versions 2.14/2.15 of Lattice C (which were the same as the Microsoft C Compiler before version 3.) Lattice no longer supplies the Microsoft compiler, but has continued the traditional switch statement code generation trio at Lattice version 3.00. It is certainly not clear whether or not it is worth the trouble. Especially since the Lattice code generator does not take advantage of some of the fast compare/increment/repeat instructions for the 8086/88 when implementing option (3). Instead, they generate a loop which takes several times the speed and space possible with a better technique. Another interesting note about Lattice's handling of switch statements. Lattice reduces the following: switch (i) { case '0': case '1': case '2': case '3': break; } to: if (i >= '0' || i <= '3') ; However, it will not perform the same optimization if other cases, not serial with the rest, are included in the switch statement. I hope that this is a side effect, since this particular situation would occur seldom, if ever, in real C code. In any case, they seem to have spent some time trying to generate interesting, if not useful, code for switch statements. (I do wish they had recognized the above example as useless and removed it completely, however.) Gary Wisniewski Pittsburgh, PA (412) 363-4685 uucp: {allegra,bellcore,cadre}!pitt!darth!gary