[comp.arch] RPM-40 really forwarding

tim@amdcad.AMD.COM (Tim Olson) (03/04/88)

In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
| IMHO, a pipelined processor should run as fast as the its ALU 
| lets it. ...
|
| Even a simple bypass path adds to this delay. It means
| that whatever the setup and delay times of this path,
| it must be added to the basic machine cycle time, IF
| that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
| This is LESS of a penalty than adding a register access,
| but still a penalty. So is it a win ?

It depends upon how often alu forwarding occurs (see below).  If it is
frequent, it is much better to slow the pipeline by the small amount of
time it takes to forward the result, rather than stalling a whole cycle. 
For example <numbers taken out of a hat>, if the cycle time through the
ALU is 20ns, forwarding takes 2ns, and forwarding occurs for 30% of all
instructions, then

	Processor A (no forwarding)	Processor B (forwarding)
cpi		1.3				1.0
cycle time	20ns				22ns

Raw MIPS	38.5				45.5

| To be honest, I don't know. Although I have read plenty of
| research on BRANCH latency, I haven't seen much research on
| how often ALU result latency would result in interlocks, or
| even on how often LOAD latency would result in interlocks.
| Perhaps John Mashey has. If so, I'd like to see the
| references. Until then, I don't know what John means when he
| says "any high-performance system" will :likely" have zero latency.

Here are some numbers from the Am29000 simulator running a small "nroff"

instructions executed:				89435
instructions requiring alu forwarding:		41420 (46%)
instructions forwarding from load buffer:	13669 (15%)

I haven't seen published studies on dynamic forwarding frequencies --
does anyone know of such papers?

	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/05/88)

An article by tim@amdcad.UUCP (Tim Olson) says:
] In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
] | IMHO, a pipelined processor should run as fast as the its ALU 
] | lets it. ...
] |
] | Even a simple bypass path adds to this delay. It means
] | that whatever the setup and delay times of this path,
] | it must be added to the basic machine cycle time, IF
] | that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
] | This is LESS of a penalty than adding a register access,
] | but still a penalty. So is it a win ?
] 
] It depends upon how often alu forwarding occurs (see below).  If it is
] frequent, it is much better to slow the pipeline by the small amount of
] time it takes to forward the result, rather than stalling a whole cycle. 
] [... example deleted ...]

 
So far I agree, but there's more ...
How often forwarding is needed is only PART of the story. The other
part is how often you could "fill" the delay from forwarding.


] Here are some numbers from the Am29000 simulator running a small "nroff"
] 
] instructions executed:				89435
] instructions requiring alu forwarding:		41420 (46%)
] instructions forwarding from load buffer:	13669 (15%)

But if I can fill 90%, say, of the one-cycle latency delays with
a reorganizer, then I only incur a penalty of 5%, which means,
for RPM40, that a bypass path is justified only if it incurs
a penalty of 1.2 nanoseconds or less. If I can fill 80% of
the latencies, then a bypass that inflicts a penalty on the
basic cycle time of 2.5 nanoseconds or less is a win. SO
not only do we need data like you've provided, we need to
know how often we can reorganize the delay away. Unfortuneately,
I don't really have good data for either of these factors.

] I haven't seen published studies on dynamic forwarding frequencies --
] does anyone know of such papers?
] 	-- Tim Olson

I, too, would be VERY interested in any such works.
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

earl@mips.COM (Earl Killian) (03/08/88)

In article <9799@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes:

   So far I agree, but there's more ...
   How often forwarding is needed is only PART of the story. The other
   part is how often you could "fill" the delay from forwarding.

   ] Here are some numbers from the Am29000 simulator running a small "nroff"
   ] instructions executed:				89435
   ] instructions requiring alu forwarding:		41420 (46%)
   ] instructions forwarding from load buffer:	13669 (15%)

   But if I can fill 90%, say, of the one-cycle latency delays with
   a reorganizer, then I only incur a penalty of 5%, which means,
   for RPM40, that a bypass path is justified only if it incurs
   a penalty of 1.2 nanoseconds or less. If I can fill 80% of
   the latencies, then a bypass that inflicts a penalty on the
   basic cycle time of 2.5 nanoseconds or less is a win. SO
   not only do we need data like you've provided, we need to
   know how often we can reorganize the delay away. Unfortuneately,
   I don't really have good data for either of these factors.

   ] I haven't seen published studies on dynamic forwarding frequencies --
   ] does anyone know of such papers?

   I, too, would be VERY interested in any such works.

In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes:

   1) Slows down critical path.  Any finely tuned risc CPU will most
   probably have it's cycle time determined by the latency through the
   ALU.  Using a loopback of ALU results might result (depending on
   layout, tech, etc) in up to a 20% slowdown in the ALU, plus
   increase the chip area and layout problems.  This doesn't mean a
   loopback is a loss necessarily, but that it does have a measurable
   cost which must be weighed against the benefits.

   2) In combination with (1) above, I'm not sure that having a
   one-cycle delay in ALU results causes any large loss.  A good
   reorganizer can fill those latencies, or move the ALU op forward
   into, for example, a load delay.  In high-speed (> 15 Mhz) RISCs
   (and maybe slower ones as well), load delays are usually the
   determining factor, or a large part of it.  What studies do you
   have that compare RISC's with 1 cycles ALU delays and 0-cycle?  I'd
   like to see anything you can drag up.

To answer these questions I reran a local analysis program on the
results of 13 program runs.

First a note on terminology: I call the latency of an op the time it
takes until you can reference the result.  The delay is the latency
minus the time to issue the instruction itself (usually latency - 1).

The program defaults to
	-alu_rate 1 -alu_latency 1 -shift_rate 1 -shift_latency 1
	-load_rate 1 -load_latency 2
i.e. a model where you can use the result of an alu/shift instruction
in the next instruction and the result of a load one after that.  E.g.
the MIPSco R2000.  I instead specified
	-alu_rate 1 -alu_latency 2 -shift_rate 1 -shift_latency 2
	-load_rate 1 -load_latency 3 -reorganize
which simulates no bypassing (i.e. increase latencies by 1, but leave
rates alone).  The -reorganize says to reorganize to the new
constraints before analysis.  I then took the ratio of the new cycle
count and the old count and averaged:

13 samples
minimum		    1.024 (-1.7o)
harmonic mean	    1.207 (-0.091o)
geometric mean	    1.212 (-0.045o)
mean		    1.217 o=0.1150, cov=0.09449
median		    1.228 (+0.096o)
maximum		    1.408 (+1.7o)

I.e. the lack of bypassing is equivalent to a cycle time increase of
20%.  I.e. 5ns @ 40MHz.  The effect was as low as 2.4% and as high as
41%, which simply proves you can prove anything you like by looking at
single data points.

Anyway, I hope the hard data helps the discussion.

jesup@pawl21.pawl.rpi.edu (Randell E. Jesup) (03/09/88)

In article <1800@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes:
>
>   1) Slows down critical path.  Any finely tuned risc CPU will most
>   probably have it's cycle time determined by the latency through the
>   ALU.  Using a loopback of ALU results might result (depending on
>   layout, tech, etc) in up to a 20% slowdown in the ALU, plus
>   increase the chip area and layout problems.  This doesn't mean a
...
>To answer these questions I reran a local analysis program on the
>results of 13 program runs.

[data indicating 20% loss on Mips R2000 by removing loopback AND increasing
 load delay to 3]

>I.e. the lack of bypassing is equivalent to a cycle time increase of
>20%.  I.e. 5ns @ 40MHz.  The effect was as low as 2.4% and as high as
>41%, which simply proves you can prove anything you like by looking at
>single data points.

	Thanks for the data!  Sounds like a nice piece of software for
playing with architectures.

	Two points:  1)  The RPM-40 does have bypass on loads, you can use the
result of a load in the cycle it's going into the register file.  Bypass is
only missing on ALU ops.  I'd appreciate it is you'd re-run using just an
increased ALU latency.  2)  I suspect that the software is assuming that it
can't store the result of an ALU op in the next cycle.  In the rpm-40, you can
store it in the next cycle, as the store accesses the register in it's WB
phase; it's using it's ALU phase for address calculation.  Also, we have a
smaller number of GP registers, which causes more modify-store and load-modify-
store operations.  

	It looks like my 20% figure (of the top of my head) was 'interesting'.
Of curse that was just chance.  I agree that there is a cost due to not having
ALU bypassing, but I think your 20% figure is a upper limit for the average
loss.  I suspect maybe more like 5-15% will be the case, given the factors
above.

>Anyway, I hope the hard data helps the discussion.

	Most certainly!  Thank you.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)