[comp.arch] lets start another processor war - i860 vs RIOS

grunwald@foobar.colorado.edu (Dirk Grunwald) (06/14/90)

well, there's been too much talk about Mac OS's and trivia. Time to
get back to what comp.arch is good at.

I just finished reading most of the architecture articles in the IBM
journal about the RIOS/America architecture. I've also read articles &
microprocessor manuals for the i860.

The frank opinion I got was that the i860 was a total barf for
compilers and people. You can certainly manage the pipeline, but it's
a royal pain. Witness that the i860 has been available for a year, and
no one has an (available) compiler that gives you more than about 5MF
for the 40Mhz parts, while the RIOS is getting >7 with a 20Mhz part
and >10 with a 25Mz part. The additional interlocking logic on the
RIOS simplifies compiler and programming tasks considerably. You also
get the nice addition of correct exceptions.

The only things that would seem to stop the RS/6000 from sweeping the
Intel juggernaut off the earth is:

	(a) Intel will sell you chips, IBM will sell you systems
	(b) RIOS is four chips + cache + everything else, Intel
	    is one chip + more cache + everything else

The questions I had from their articles was the expansion path for the
architecture. Certainly, you can crank up the clock - the part's only
doing 25Mhz in the market place now, but allegedly, they're going to
produce parts that give you 50 -> 100 SPECmarks within a year or two,
which is either a conservative doubling or not-so-conservative
(without more integration) quadrupling of clock speed (& memory &
cache).

What I'm wondering is, can they stick in more FXU's or FPU's without
changing the architecture semantics. Right now, the single FXU handles
load/stores for the FPU as well as the integer core insns. There's
interlocking between the two for precise exceptions. With the current
wires to memory, you could probably add another FXU -- you can fetch 4
insn's at once, and there's currently four functional units. Clearly,
everything isn't going to be kept maximally busy w/o more insn's being
fetched, for the straight-line basic-blocks, the additional FXU might
make an improvement.

So, for people who've look at the architecture, does it look like you
could toss in another FXU and still keep precise exceptions? If not,
would the FPU/FXU blocking make the advantages of an additional FXU
moot? Likewise, insn fetch - would you really need one insn fetched
per functional unit, or would ``sharing'' the bandwidth between the
branch/extra FXU be an advantage?

Dirk Grunwald -- Univ. of Colorado at Boulder	(grunwald@foobar.colorado.edu)
						(grunwald@boulder.colorado.edu)

marc@oahu.cs.ucla.edu (Marc Tremblay) (06/16/90)

In article <22257@boulder.Colorado.EDU> grunwald@foobar.colorado.edu writes:
>What I'm wondering is, can they stick in more FXU's or FPU's without
>changing the architecture semantics.
...
>So, for people who've look at the architecture, does it look like you
>could toss in another FXU and still keep precise exceptions? If not,
>would the FPU/FXU blocking make the advantages of an additional FXU
>moot?

The Fixed-Point Unit is very limited in the amount of parallelism
it can accomplished. Even though the chip set is superscalar in
terms of issuing several instructions per cycle, only one fixed-point
arithmetic instruction can be executed per cycle. Adding another ALU
seems to be a good way to take advantage of the additional parallelism
available. According to [Smith, Johnson, Horowitz ASPLOS 89], adding
an ALU is more profitable than another load pipe.
Of course the addition of another ALU would introduce out-of-order
completions of instructions since the latencies of the different
functional units are different. Then a scheme such as a result buffer
would be required in order to maintain precise interrupts.

The way that the IBM chip set maintains precise interrupts is simple
but it also impairs performance. For example, the FXU and the FPU are
interlocked in the first few pipeline stages and loads (even floating-
point loads) are all handled by the FXU so that out-of-order loads
do not occur. That's fine but it also means that the FXU can't decode
other instructions when loads are encountered, thus a limitation.
Also since parallelism can occur between the FPU and the FXU, whenever
an instruction in the FXU can cause an interrupt, subsequent instructions
in the FPU are blocked until the FXU finds out if there was an interrupt
or not. This reduces the possible overlapping between the two units
in order to make the synchronization simple. Parallelism is sacrificed
for "rare" cases when interrupts occur. In future implementations,
we may see a more elaborate scheme where the FPU is allowed to proceed,
given that its state can be restored upon an interrupt.

Notice that most of the performance-impairing problems mentioned above
can be reduced by proper instruction scheduling. By mixing FXU and FPU
instructions and by scheduling possible-interrupt-causing instructions
in a proper way, the FPU *will not* have to wait for the FXU interrupt
to be resolved (it will already be resolved by the time the FPU instruction
reaches the execute stage).

So we ask the question: 

	"How often can the compiler schedule instructions in a way
	 that the FPU and FXU are synchronized so that they don't wait
	 on each other?"

	-It doesn't matter, only SPECmark counts!

			Marc Tremblay
			internet: marc@CS.UCLA.EDU
			UUCP: ...!{uunet,ucbvax,rutgers}!cs.ucla.edu!marc