[comp.arch] Nitpicking on Steve's explanation of 960 instruction speeds

jimv@radix (Jim Valerio) (04/17/88)
As has been pointed out by others, the performance bottlenecks visible
in the current 960 offering are largely implementation issues and
not architecture issues.  There are two interesting architectural
performance issues: no delayed branches (branch squashing is encouraged
instead), and call/ret instructions that allocate and release new register
frames (branch-and-link instructions exist also to short-circuit this).

But this article doesn't talk about the architectural performance limitations.
Instead, I clarify (or nitpick) Steve McGeady's recent performance numbers for
the current 960KB implementation.


In article <3375@omepd> mcg@iwarpo3.UUCP (Steve McGeady) implies that the
major reason the 960 averages more than 2 cycles per instruction is
that the register file has an insufficient number of ports.  This shows
up not only in register ALU ops, which he mentioned explicitly, but also
for pipelined (data, not instruction) memory loads which share the ALU's
datapath to the register file.

What Steve forgot to mention is what I'll call the IFU Botch (sorry, R :-),
which fundamentally limits the "peak MIPS" to 3/4 the clock rate, even
though statistically the impact is less.  In retrospect, there turns
out to be a reasonably straightforward way to remove this limitation,
although I don't believe the design has been changed to incorporate it.

Steve is also correct in pointing out that compare-and-branch timing
is weak (would be nice to have a 2 clock latency and not 3), and that
the call/ret overhead is relatively expensive.

The three-deep memory bus request fifo appears at first glance to be too
deep for systems with many wait-states, but is no problem as long as the
instruction prefetcher is throttled when the fifo is nonempty.  I don't
know if the throttle is in the announced processor.

Technically, the multiplier is not an early-out multiplier, but rather a
2-bit-per-cycle multiplier with speedups for strings of 0's and 1's; this
tends to be slightly faster than an early-out multiplier such as is found
in the 386.

Very few floating-point instructions (or any instructions, for that matter)
are interruptible or resumable.  The design requirement was a maximum
interrupt latency of 5 microseconds, which all the floating-point instructions
fit under (in all precisions) except square root, remainder, and the
transcendental approximation instructions.


By the way, Steve, hasn't the 960 C compiler fixed that long outstanding
performance nit in it's entry code?

	_foo:	# foo takes four integer args, has int [100] auto array
		ldconst	400,r15
		addo	sp,r15,sp	# allocate auto space on stack

This should be:

	_foo:	# foo takes four integer args, has int [100] auto array
		lda	400(sp),sp	# allocate auto space on stack

(For those of you not familiar with the 960 instruction set: lda is
"load effective address".  One pleasant attribute of the 960 load/store
addressing modes is that they may be any subset of the generalized form:

	basereg + offset + indexreg*scale

where offset can be up to 32 bits and scale is 1, 2, 4, 8, or 16.)

--
Jim Valerio	jimv%radix@omepd.intel.com, {verdix,omepd}!radix!jimv