[net.lang.c] Generating code for the switch statement

ka@hropus.UUCP (Kenneth Almquist) (08/04/86)

> As I understand it, a switch/case setup compiles exactly the same as
>	if (var == const1) {.....};
>	else if (var == const2) {.....};
>	else  {default_action};
> anyway.

A compiler should be able to do better than that.  Generating good code for
switches is one place where compiler writers can show they know something
about searching.

Using a branch table will produce faster and smaller code if the cases
are consecutive integers.  The only thing that may be non-obvious is the
best way to test whether the expression being switched on is within
range before accessing the branch table.  A straightforward coding of
this test requires two comparisons, but Ritchie's compiler for the
PDP-11 generated only one comparison using the machine language
equivalent of the following code:

	if ((unsigned)(value -= MIN) > (MAX - MIN) goto default;

On the PDP-11, this generates one less instruction (or two less, if MIN
is zero) than:

	if (value < MIN || value > MAX) goto default;

(Learning hacks like these is one of the benefits of reading the
compiler.)

Since some people like to count starting at one, a possible enhancement
would be to change MIN from 1 to 0 in this case by adding an extra entry
to the beginning of the branch table.  But real C programmer count from
0; wimps who count starting with 1 deserve to have their switch statements
execute slower.  :-)

If the cases are not consecutive, the branch table must be filled with
entries for the "missing" cases, making the table larger.  The Ritchie
compiler will generate a branch table if the number of entries in the
branch table would be less than three times the number of cases.  If
the cases are mostly consecutive with a few outliers, a branch table
could be augmented with a few specific tests for the outliers, but I
don't know of any compiler that does this.

There are two general methods for implementing tables which do not rely
on the keys being nearly consecutive:  hashing and the binary search.
Hashing, which is what the Ritchie compiler uses, has the advantage that
its performance is independent of the number of cases, just as with a
branch table.  The modulo instruction is used as the hash function; the
C compiler tries a variety of values for the modulus and selects the
best one which means that the time required to generate the hash table
is proportional to the square of the number of cases.

The UN*X VAX compiler, on the other hand, uses a binary search.  A
binary search executes in time proportional to the log of the number of
cases.  This makes it sound worse that the hash search; I presume
someone discovered that most switch statements contained few enough
statements that the binary search was superior.  The binary search works
best on a machine with a three way branch like that provided by the
Fortran arithmentic IF statement.  On a machine with condition codes
like the VAX, each step of the binary search is implemented by a compare
instruction followed by two branches on the condition codes.

Several things may be done to improve the performance of the binary
search.  First, many machines execute a conditional branch instruction
faster when no branch is taken, so that rather than testing the middle
value in the table one can test a value somewhat to one side in order to
decrease the chance that a branch will be taken.  This is implemented in
the VAX compiler, but some other possible improvements are not.  The
binary search could be abandoned in favor of the linear search at some
level (e. g. when the number of possibilities has been reduced to 2 or
3).  If there are a series of consecutive cases, this can be used to
eliminate extra tests.  (For example, if a number is known to be less
than 5 it cannot be greater than 4, and if a number is greater than 2 and
less than 4 then it must be equal to 3, but the VAX compiler will code
to test for both these cases.)  Finally, each comparison generated by
the VAX compiler looks like:

	cmpl	r0,$7		# compare switch value with 7
	beql	L1		# if equal, branch to the code for case 7
	bgtr	L8		# if greater, branch to L8 for the next cmpl.
	# The next cmpl instruction goes here.

Since the branch to L8 is more likely to be executed that the branch to
L1, the bgtr instruction should come first to minimize the number of
instructions executed.

For small cases, the good old linear search is used by both the Ritchie
and the VAX compilers.  Several things are done to speed up the linear
search.  First, the switch expression is copied into a register.  This
helps in general but loses when the switch instruction is a register
variable or the switch contains only one case.  (This is hard to fix
because the compilers generate the code for a switch in two pieces;
first they generate code to place the value to be switched on in r0, and
then they generate the code to implement the switch.  Another problem
with this separation appears on the PDP-11, where it may be easier to
compute an expression in r1 than in r0 due to a brain damaged multiply
instruction.)  Second, the compiler implements the search as a series of
compares followed by branch if equal instructions.  This minimizes the
time required to reach the default case because in the default case none
of the branches will be taken.  In contrast, the code at the top of this
article will probably be compiled into a series of branch if not equal
instructions, all of which will branch if there is no match.

The Ritchie compiler at one time actually generated a table of values
and labels, and did a linear search on it using a loop.  This saves
space but takes more time.  Furthermore, it doesn't even save space if
there are 1 or 2 cases.  (The Ritchie compiler has special handling of
switch statements with no cases except the default.)  The Ritchie
compiler was changed to generate linear searches using a series of
compare instructions quite a long time ago, but I don't know if Dennis
was responsible for the change.

I don't know why the VAX compiler uses with a linear search at all,
rather than using the binary search.  Probably this was a holdover from
the Ritchie compiler, which must be prepared to generate a linear search
because a linear search will outperform a hash search on a few elements.

In general, then, a compiler should do pretty well by using a jump table
where possible and a binary search otherwise.  What is actually optimum
must be determined by measuring various approaches on a specific machine.
				Kenneth Almquist
				ihnp4!houxm!hropus!ka	(official name)
				ihnp4!opus!ka		(shorter path)

zben@umd5 (Ben Cranston) (08/07/86)

In article <610@hropus.UUCP> ka@hropus.UUCP (Kenneth Almquist) writes:

> If the cases are not consecutive, the branch table must be filled with
> entries for the "missing" cases, making the table larger.  The Ritchie
> compiler will generate a branch table if the number of entries in the
> branch table would be less than three times the number of cases.  

This bit me once.  In one of an innumerable number of Impress-understanding
programs I converted an IF nest into a switch statement, and was chagrined
when the program didn't show a vast efficiency improvement.  After peeking
at the (BSD4.x) C compiler code generator we found the 1/3 fudge factor, 
and that the Impress codes were only two short of being dense enough to 
compile into a jump table.  We didn't actually try this, but I suppose that 
if we had defined two or three bogus Impress codes and put in cases for 
them then the switch statement would have suddenly started being much more 
efficient.

Which brings me to another point: many people seem to make their programs
"more efficient" by minimizing the size of the (higher level language)
SOURCE code, rather than minimizing the output object.  Usually these go
hand-in-hand, but the above is an interesting case where MORE source code
resulted in LESS (and FASTER) object code...

And we won't even TALK about MOVE CORRESPONDING or SORT...

-- 
                    umd5.UUCP    <= {seismo!umcp-cs,ihnp4!rlgvax}!cvl!umd5!zben
Ben Cranston zben @ umd2.UMD.EDU    Kingdom of Merryland Sperrows 1100/92
                    umd2.BITNET     "via HASP with RSCS"

stevesu@copper.UUCP (Steve Summit) (08/07/86)

In article <610@hropus.UUCP>, ka@hropus.UUCP writes:
> > As I understand it, a switch/case setup compiles exactly the same as
> >	if (var == const1) {.....};
> >	else if (var == const2) {.....};
> >	else  {default_action};
> > anyway.
> 
> A compiler should be able to do better than that.  Generating good code for
> switches is one place where compiler writers can show they know something
> about searching.
> 
> Using a branch table will produce faster and smaller code if the cases
> are consecutive integers.

And then there was version 1 of DEC's VAX/VMS C compiler.  I have
this image of the people working on the code generator, saying
something like "well, we work for DEC, and we're writing this
compiler for this Vax, and our friends here at DEC gave us this
lovely CASEB instruction, so let's use it!"

Now, the CASEB instruction is basically a microcode
implementation of a branch table.  The trouble is, early versions
of VAX11C used it for _a_l_l switch statements, whether the
labels were reasonably dense or not.  Woe betide the poor
programmer who wrote something like

	switch(x)
		{
		case 1:
		case 10000:

Yup, you got this enormous object file, which took forever to
compile and link, because it had a CASEB instruction with a
ridiculously sparse branch table in it.

By the way, this problem has been fixed.  Version 2 of the VMS C
compiler is quite nice, and I have very few complaints with it.

                                         Steve Summit
                                         tektronix!copper!stevesu

chris@umcp-cs.UUCP (Chris Torek) (08/08/86)

>In article <610@hropus.UUCP> ka@hropus.UUCP (Kenneth Almquist) writes:
>>If the cases are not consecutive, the branch table must be filled with
>>entries for the "missing" cases, making the table larger [; eventually
>>this becomes a waste of space and the compiler uses a different approach].

In article <1171@umd5> zben@umd5.umd.edu (Ben Cranston) writes:
>... In one of an innumerable number of Impress-understanding
>programs I converted an IF nest into a switch statement, and was
>chagrined when the program didn't show a vast efficiency improvement.
>After peeking at the (BSD4.x) C compiler code generator we found
>the 1/3 fudge factor, and that the Impress codes were only two
>short of being dense enough to compile into a jump table.

[... followed by speculation on the effects of adding two cases.]

One can force a consecutive sequence in another, less compiler
dependent, way.  Write a compressing array:

	char	input_class[256] = {
		/* classes for 0-127 */
		is_char, is_char, ..., is_char,
		/* classes for 128 on up */
		is_obj1, is_obj2, is_obj2, is_badobj, is_obj1, ...
	};

Then use

	switch (input_class[input]) {
	case is_char:
		...
	case is_obj1:
		...
	}

Those who are familiar with Knuth's TeX system may have seen some
of the code supplied to convert TeX .DVI files to device-dependent
data (a la Imagen's imPress language).  In these, someone---Knuth
himself most likely---has written code of the form

	/* translated to C for this newsgroup */
	#define two_cases(base) \
			base: case base+1
	#define four_cases(base) \
			two_cases(base): case two_cases(base+2)
	#define eight_cases(base) \
			four_cases(base): case four_cases(base+4)
	#define thirty_two_cases(base) \
			eight_cases(base): case eight_cases(base+8): \
			case eight_cases(base+16): case eight_cases(base+24)
	#define	sixty_four_cases(base) \
			thirty_two_cases(base): case thirty_two_cases(base+32)

then, later in the program, used these as (e.g.)

	#define w0	147
	...
	#define	fntnum0	171
	...

		case four_cases(w0):
			w = p;
			x += w;
			break;
		...

		case sixty_four_cases(fntnum0):
			font = p;
			break;
		...

I originally used rather similar code in my own Imagen and Versatec
DVI conversion programs.  For some reason I decided one day to try
a compressing array like the one above.  To my surprise, the code
was not only clearer to me, but also ran rather faster.  I verified
that the generated code indeed used a vax `casel' instruction in
both versions.  My guess is that the speedup came entirely from
fitting more of the main loop into the cache.
-- 
In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1516)
UUCP:	seismo!umcp-cs!chris
CSNet:	chris@umcp-cs		ARPA:	chris@mimsy.umd.edu

jso@edison.UUCP (John Owens) (08/15/86)

In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes:
> If the cases are not consecutive, the branch table must be filled with
> entries for the "missing" cases, making the table larger.  [....]  If
> the cases are mostly consecutive with a few outliers, a branch table
> could be augmented with a few specific tests for the outliers, but I
> don't know of any compiler that does this.

The Microsoft Pascal/Fortran compilers, at least some of the old
ones (3.04 and 3.11) do this.  I haven't tried it to see if their
wonderful new code generators do.

	John Owens @ General Electric Company	(+1 804 978 5726)
	jso%edison.UUCP@seismo.CSS.GOV			[arpa]
	jso@edison.UUCP					[w/ uucp domains]
	{cbosgd allegra ncsu xanth}!uvacs!edison!jso	[roll your own]

asw@tony.UUCP (Tony Williams) (08/20/86)

In article <840@edison.UUCP> jso@edison.UUCP (John Owens) writes:
>
>In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes:
>> If the cases are not consecutive, the branch table must be filled with
>> entries for the "missing" cases, making the table larger.  [....]  If
>> the cases are mostly consecutive with a few outliers, a branch table
>> could be augmented with a few specific tests for the outliers, but I
>> don't know of any compiler that does this.
>
>The Microsoft Pascal/Fortran compilers, at least some of the old
>ones (3.04 and 3.11) do this.  I haven't tried it to see if their
>wonderful new code generators do.
>

I am not sure that this is original, as I think it has been used in
BCPL compilers and the like for ages.
I implemented this in a C compiler for an unusual machine (a Three Rivers
Perq running the CMU Accent operating system, with our own Unix emulation
layered on top).  See comments following for a discussion of the advantages.

The algorithm is roughly as follows (barring 
off-by-one errors - my favourite):

genSwitch( caseLabels, addresses, labelCount )
    int *caseLabels;		/* array of label values */
    <mumble> *addresses;	/* array of addresses */
    int labelCount;		/* number of labels */
{
    int labelRange = caseLabels[labelCount-1] - caseLabels[0];
    if( labelRange <= acceptable_sparseness * labelCount )
		/* suitably dense */
	genJumpTable(  caseLabels, addresses, labelCount )
    else
    {

	/* too sparse - generate some tests
	 * first find some suitable median value of label,
	 * say one that divides the table in half for simplicity
	 */

	int median = findMedian( caseLabels, labelCount );
				/* eg labelCount / 2; */
				/* result is place in table, not value */
	genIfStmt( switchExpr, caseLabel[median], less_than );
		
	/* now do the lower half (expr < median) */
	genSwitch( caseLabels, addresses, median );
		
	startElse();
		
	/* now the upper half (expr >= median) */
	genSwitch( &caseLabels[median], &addresses[median], labelCount-median );

    }
}

This algorithm has several nice features:
  - the sparsesness threshold can be tuned without hacking code
  - the use of recursion for each portion of the divided switch
    means that locally dense clusters in a sparse switch can each have their
    own jump table, reached by test-and-jumps
  - the tests for out-of-range expressions can be factored in to the
    algorithm,
  - the choice of division point can be made more intelligently if you
    wish (eg by maximising the density of one or both portions),
    by only changing findMedian()
  - non-dense cases are effectively handle by binary search,
    coded as a jump tree.  This approaches optimal, if there is
    no information on dynamic frequencies.

The generated code (subject to findMedian) is usually within one extra test
of being optimal.  Not bad for a simple-minded algorithm.

As a footnote, the virtual machine architecture had no indirect jump
instruction, so the same technique (and code) was used here by Trudy Watson
in implementing FORTRASH computed goto's and (shudder) assigned goto's.
---------------------------------------------------------------------------
Tony Williams					|Informatics Division
UK JANET:	asw@uk.ac.rl.vd			|Rutherford Appleton Lab
Usenet:		{... | mcvax}!ukc!rlvd!asw	|Chilton, Didcot
ARPAnet:	asw%vd.rl.ac.uk@ucl-cs.arpa	|Oxon OX11 0QX, UK

gary@darth.UUCP (Gary Wisniewski) (08/21/86)

In article <840@edison.UUCP> jso@edison.UUCP (John Owens) writes:
>In article <610@hropus.UUCP>, ka@hropus.UUCP (Kenneth Almquist) writes:
>> If the cases are not consecutive, the branch table must be filled with
>> entries for the "missing" cases, making the table larger.  [....]  If
>> the cases are mostly consecutive with a few outliers, a branch table
>> could be augmented with a few specific tests for the outliers, but I
>> don't know of any compiler that does this.
>
>The Microsoft Pascal/Fortran compilers, at least some of the old
>ones (3.04 and 3.11) do this.  I haven't tried it to see if their
>wonderful new code generators do.

Lattice C (for IBM PC/XT/AT) actually generates three possible code sequences
for switch statements:

	1.  For a small number of cases, individual comparisons
	    are generated.
	2.  For a large number of cases, a branch table is generated, with
	    default branches for the missing cases.
	3.  For cases which compromise space (in 2) or speed (in 1), the
	    final approach builds a compact branch table and scans it
	    sequentially.

This is true for versions 2.14/2.15 of Lattice C (which were the same
as the Microsoft C Compiler before version 3.)  Lattice no longer supplies
the Microsoft compiler, but has continued the traditional switch statement
code generation trio at Lattice version 3.00.

It is certainly not clear whether or not it is worth the trouble.
Especially since the Lattice code generator does not take advantage of
some of the fast compare/increment/repeat instructions for the 8086/88 when
implementing option (3).  Instead, they generate a loop which takes several
times the speed and space possible with a better technique.

Another interesting note about Lattice's handling of switch statements.
Lattice reduces the following:

	switch (i)
	  {
	    case '0':
	    case '1':
	    case '2':
	    case '3':
	      break;
	  }

to:

	if (i >= '0' || i <= '3')
	  ;

However, it will not perform the same optimization if other cases, not
serial with the rest, are included in the switch statement.  I hope that
this is a side effect, since this particular situation would occur
seldom, if ever, in real C code.

In any case, they seem to have spent some time trying to generate
interesting, if not useful, code for switch statements.  (I do
wish they had recognized the above example as useless and removed it
completely, however.)

Gary Wisniewski
Pittsburgh, PA
(412) 363-4685
uucp:	{allegra,bellcore,cadre}!pitt!darth!gary