[comp.arch] AM29000 Booleans numbers; long

mash@mips.UUCP (John Mashey) (05/07/87)

In article <16560@amdcad.AMD.COM> tim@amdcad.UUCP (Tim Olson) writes:
>In article <1270@aw.sei.cmu.edu> firth@sei.cmu.edu (Robert Firth) writes:

>|  The machine has comparison instructions that
>|  yield a Boolean result in a register.  The
>|  processor description says that TRUE is
>|  represented by a 1 in the MOST significant
>|  bit.  Is this a typo?
>No, it is not a typo.  One of the restrictions to the Boolean
>representation is that it had to be a single bit to allow quick
>determination of the target of a conditional jump.  Given this, it could
>either be placed in the most-significant bit (msb) or the
>least-significant bit (lsb).  The code generated for either of these
>representations is, in general, equivalent in terms of code space and
>cycles, except for two cases: the msb representation has the benefit of
>a quick "jump on negative", while the lsb representation "looks like a C
>Boolean", i.e.  it has the correct value when a Boolean is assigned to a
>variable. 
>
>For example, the code sequences that are generated for these cases are:
>			if (x < 0) .....
	(generated code)
>			x = (y < z);
	(generated code)
>Since the first code sequence is *much* more prevalent than the second in
>typical C code, it is better to "optimize" the first sequence at the
>expense of the second.

This is clearly correct [i.e., optimize for more frequent case],
although I suspect compiler writers may moan a little.
However, it does raise an interesting question:  was it not possible
with the 29K pipeline to offer the "other" fast branches, i.e., those that
do no arithmetic comparison, but that include the following set:
beqz
beq	(compares 2 regs)
bne	(compares 2 regs)
bnez
bgtz
blez
bgez*
bltz*

The *'d ones are the ones equivalent to the 29K's instructions.
Here is some data: over a set of 12 programs [as, ccom, compress, dhrystone,
espresso, hspice, nroff, terse, uopt, whetd, whets, timberwolf, i.e.,
mostly large real programs], we get the following data for dynamic frequencies,
as percentage of the instruction cycles [not cache/TLB miss]:

		mean	min	max
beqz/bnez	8.7%	3.7%	16.7%	A
bltz/bgez	0.83%	<0.1%	 4.6%	B	 [29K equiv]
beq,bne,bgtz,
blez		3.6%	1.3%	 8.4%	C

We have a set of compare instructions that let one materialize all of
the combinations.  The numbers for them are:

setlessthans	3.2%	1.1%	 8.0%	D

In general, in most cases, such instructions are shortly followed by
a beqz/bnez [I'll ignore the cases where one is just computing a 0 or 1].
Thus: E = % (beqz/bnez used WITHOUT compare) = A - D = 5.5%.

Let us grossly estimate the average instruction cycle hit [for us, as usual,
does not necesarily apply to anyone else] for several design choices.

Case 1: what increase in instruction cycles would we get if we didn't
have the fast branches [but instead, used compare + test condition code]:
= (A - D) + B + C = 5.5% + .8% + 3.6% = 9.9%, i.e., almost a 10% hit.

Case 2: how much would we get back if we added bltz/bgez back:
= B = .8%

As usual, there are all sorts of caveats about special cases, but
I think this is a reasonable estimate.

Bottom line:
a) Having the bltz/bgez (alone) is worth about .8%, which is at least
a modest win, and is probably worth the minor hassle of the shift to
get the 1-bit back.

b) Not having the other fast branches is about a 9% hit.  If the
cycle time is improved that much by not supporting them [unlikely,
but possible], then not having them is a win, else, it would have better
to do the full set, and then the 1-bit can go back in the other end
of the register.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@amdcad.AMD.COM (Brian Case) (05/07/87)

In article <369@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>This is clearly correct [i.e., optimize for more frequent case],
>although I suspect compiler writers may moan a little.

Why?  The straight-forward approach works very well:  always generate
compare/branch sequences, then look at the pairs and see if they can
be optimized to eliminate the compare.  Believe me on this one, there
is no cause for moaning over adding this optimization to the peepholer.

>However, it does raise an interesting question:  was it not possible
>with the 29K pipeline to offer the "other" fast branches, i.e., those that
>do no arithmetic comparison, but that include the following set:
>beqz
>beq	(compares 2 regs)
>bne	(compares 2 regs)
>bnez
>bgtz
>blez
>bgez*
>bltz*
>
>The *'d ones are the ones equivalent to the 29K's instructions.
>Here is some data: over a set of 12 programs [as, ccom, compress, dhrystone,

[lots off good data, typical of John's postings (thanks again!)]

>b) Not having the other fast branches is about a 9% hit.  If the
>cycle time is improved that much by not supporting them [unlikely,
>but possible], then not having them is a win, else, it would have better
>to do the full set, and then the 1-bit can go back in the other end
>of the register.

Well, with our four-stage pipeline, branches must be executed in the
decode stage in order to avoid having double-delayed branches (something
that we *very* much want to avoid since at least some of our customers
will be doing some non-trivial assembly coding.  Single-delayed is
tough enough for human beings).  In order to execute a conditional
branch in the decode stage, all the information must be present then:
the branch condition and the target address.  The Am29000 has a
dedicated branch-offset adder to form target addresses, and the branch
condition is available, at the VERY end of the cycle, from either the
register file or from the ALU (if a compare instruction is generating
the branch condition, then it is in the execute stage when the branch
is in the decode stage, and forwarding will send the boolean result
to the branch target mux select line just in time).  Since our register
file does write before read to avoid two levels of forwarding (and
for other reasons, such as overlapping the local register file sp+
offset calculation with writing), there is no time to do a full,
32-bit zero detect to implement the most common branch if register==
zero/notzero construct.  We realized that there would be significant
benefit if it were possible, but there just didn't seem to be a way
to fit it in with all the other features we wanted.  Plus, we wanted
to clear the way for future implementation in other technologies.
Granted, we may have made other decisions that will make those
implementations less than pefect, but who's perfect?  (Don't answer
that. :-)

Does the MIPS processor have double-delayed branches?  That would
explain why that processor can get away with these things.

    bcase

mash@mips.UUCP (John Mashey) (05/07/87)

In article <16588@amdcad.AMD.COM> bcase@amdcad.UUCP (Brian Case) writes:
>In article <369@winchester.UUCP> mash@winchester.UUCP (John Mashey) writes:
>>This is clearly correct [i.e., optimize for more frequent case],
>>although I suspect compiler writers may moan a little.

(No big deal: compiler writers would prefer that 0/1 be generated;
this is not a major complaint.)

>>However, it does raise an interesting question:  was it not possible
>>with the 29K pipeline to offer the "other" fast branches....

> ...Brian explains that the 29K pipeline design was incompatible
> with other fast branches, unless they were made double-delayed.

>Well, with our four-stage pipeline, branches must be executed in the
>decode stage in order to avoid having double-delayed branches (something
>that we *very* much want to avoid since at least some of our customers
>will be doing some non-trivial assembly coding.....

Even more important, the statistics (from Stanford) show that
it's very hard to fill the 2nd delay slot, much harder than the first.

>Does the MIPS processor have double-delayed branches?  That would
>explain why that processor can get away with these things.

No, we found a way to do those branches with just one delay cycle.
As noted, double-delays can hurt a lot.  It took a lot of work,
and the timing is very tight, but the statistics claimed it was a win.
All of this is very dependent on pipeline design, so it may or may
not be the right tradeoff for different architectures.  We thought
it was real important, so we made the design go that way.
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{decvax,ucbvax,ihnp4}!decwrl!mips!mash, DDD:  	408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086