tim@amdcad.AMD.COM (Tim Olson) (03/04/88)
In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes: | IMHO, a pipelined processor should run as fast as the its ALU | lets it. ... | | Even a simple bypass path adds to this delay. It means | that whatever the setup and delay times of this path, | it must be added to the basic machine cycle time, IF | that cycle time is determined by the ALU, as it SHOULD BE (IMHO). | This is LESS of a penalty than adding a register access, | but still a penalty. So is it a win ? It depends upon how often alu forwarding occurs (see below). If it is frequent, it is much better to slow the pipeline by the small amount of time it takes to forward the result, rather than stalling a whole cycle. For example <numbers taken out of a hat>, if the cycle time through the ALU is 20ns, forwarding takes 2ns, and forwarding occurs for 30% of all instructions, then Processor A (no forwarding) Processor B (forwarding) cpi 1.3 1.0 cycle time 20ns 22ns Raw MIPS 38.5 45.5 | To be honest, I don't know. Although I have read plenty of | research on BRANCH latency, I haven't seen much research on | how often ALU result latency would result in interlocks, or | even on how often LOAD latency would result in interlocks. | Perhaps John Mashey has. If so, I'd like to see the | references. Until then, I don't know what John means when he | says "any high-performance system" will :likely" have zero latency. Here are some numbers from the Am29000 simulator running a small "nroff" instructions executed: 89435 instructions requiring alu forwarding: 41420 (46%) instructions forwarding from load buffer: 13669 (15%) I haven't seen published studies on dynamic forwarding frequencies -- does anyone know of such papers? -- Tim Olson Advanced Micro Devices (tim@amdcad.amd.com)
oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/05/88)
An article by tim@amdcad.UUCP (Tim Olson) says:
] In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
] | IMHO, a pipelined processor should run as fast as the its ALU
] | lets it. ...
] |
] | Even a simple bypass path adds to this delay. It means
] | that whatever the setup and delay times of this path,
] | it must be added to the basic machine cycle time, IF
] | that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
] | This is LESS of a penalty than adding a register access,
] | but still a penalty. So is it a win ?
]
] It depends upon how often alu forwarding occurs (see below). If it is
] frequent, it is much better to slow the pipeline by the small amount of
] time it takes to forward the result, rather than stalling a whole cycle.
] [... example deleted ...]
So far I agree, but there's more ...
How often forwarding is needed is only PART of the story. The other
part is how often you could "fill" the delay from forwarding.
] Here are some numbers from the Am29000 simulator running a small "nroff"
]
] instructions executed: 89435
] instructions requiring alu forwarding: 41420 (46%)
] instructions forwarding from load buffer: 13669 (15%)
But if I can fill 90%, say, of the one-cycle latency delays with
a reorganizer, then I only incur a penalty of 5%, which means,
for RPM40, that a bypass path is justified only if it incurs
a penalty of 1.2 nanoseconds or less. If I can fill 80% of
the latencies, then a bypass that inflicts a penalty on the
basic cycle time of 2.5 nanoseconds or less is a win. SO
not only do we need data like you've provided, we need to
know how often we can reorganize the delay away. Unfortuneately,
I don't really have good data for either of these factors.
] I haven't seen published studies on dynamic forwarding frequencies --
] does anyone know of such papers?
] -- Tim Olson
I, too, would be VERY interested in any such works.
--
Dennis O'Connor oconnor%sungod@steinmetz.UUCP
ARPA: OCONNORDM@ge-crd.arpa
(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)
earl@mips.COM (Earl Killian) (03/08/88)
In article <9799@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes: So far I agree, but there's more ... How often forwarding is needed is only PART of the story. The other part is how often you could "fill" the delay from forwarding. ] Here are some numbers from the Am29000 simulator running a small "nroff" ] instructions executed: 89435 ] instructions requiring alu forwarding: 41420 (46%) ] instructions forwarding from load buffer: 13669 (15%) But if I can fill 90%, say, of the one-cycle latency delays with a reorganizer, then I only incur a penalty of 5%, which means, for RPM40, that a bypass path is justified only if it incurs a penalty of 1.2 nanoseconds or less. If I can fill 80% of the latencies, then a bypass that inflicts a penalty on the basic cycle time of 2.5 nanoseconds or less is a win. SO not only do we need data like you've provided, we need to know how often we can reorganize the delay away. Unfortuneately, I don't really have good data for either of these factors. ] I haven't seen published studies on dynamic forwarding frequencies -- ] does anyone know of such papers? I, too, would be VERY interested in any such works. In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes: 1) Slows down critical path. Any finely tuned risc CPU will most probably have it's cycle time determined by the latency through the ALU. Using a loopback of ALU results might result (depending on layout, tech, etc) in up to a 20% slowdown in the ALU, plus increase the chip area and layout problems. This doesn't mean a loopback is a loss necessarily, but that it does have a measurable cost which must be weighed against the benefits. 2) In combination with (1) above, I'm not sure that having a one-cycle delay in ALU results causes any large loss. A good reorganizer can fill those latencies, or move the ALU op forward into, for example, a load delay. In high-speed (> 15 Mhz) RISCs (and maybe slower ones as well), load delays are usually the determining factor, or a large part of it. What studies do you have that compare RISC's with 1 cycles ALU delays and 0-cycle? I'd like to see anything you can drag up. To answer these questions I reran a local analysis program on the results of 13 program runs. First a note on terminology: I call the latency of an op the time it takes until you can reference the result. The delay is the latency minus the time to issue the instruction itself (usually latency - 1). The program defaults to -alu_rate 1 -alu_latency 1 -shift_rate 1 -shift_latency 1 -load_rate 1 -load_latency 2 i.e. a model where you can use the result of an alu/shift instruction in the next instruction and the result of a load one after that. E.g. the MIPSco R2000. I instead specified -alu_rate 1 -alu_latency 2 -shift_rate 1 -shift_latency 2 -load_rate 1 -load_latency 3 -reorganize which simulates no bypassing (i.e. increase latencies by 1, but leave rates alone). The -reorganize says to reorganize to the new constraints before analysis. I then took the ratio of the new cycle count and the old count and averaged: 13 samples minimum 1.024 (-1.7o) harmonic mean 1.207 (-0.091o) geometric mean 1.212 (-0.045o) mean 1.217 o=0.1150, cov=0.09449 median 1.228 (+0.096o) maximum 1.408 (+1.7o) I.e. the lack of bypassing is equivalent to a cycle time increase of 20%. I.e. 5ns @ 40MHz. The effect was as low as 2.4% and as high as 41%, which simply proves you can prove anything you like by looking at single data points. Anyway, I hope the hard data helps the discussion.
jesup@pawl21.pawl.rpi.edu (Randell E. Jesup) (03/09/88)
In article <1800@gumby.mips.COM> earl@mips.COM (Earl Killian) writes: >In article <475@imagine.PAWL.RPI.EDU> jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) writes: > > 1) Slows down critical path. Any finely tuned risc CPU will most > probably have it's cycle time determined by the latency through the > ALU. Using a loopback of ALU results might result (depending on > layout, tech, etc) in up to a 20% slowdown in the ALU, plus > increase the chip area and layout problems. This doesn't mean a ... >To answer these questions I reran a local analysis program on the >results of 13 program runs. [data indicating 20% loss on Mips R2000 by removing loopback AND increasing load delay to 3] >I.e. the lack of bypassing is equivalent to a cycle time increase of >20%. I.e. 5ns @ 40MHz. The effect was as low as 2.4% and as high as >41%, which simply proves you can prove anything you like by looking at >single data points. Thanks for the data! Sounds like a nice piece of software for playing with architectures. Two points: 1) The RPM-40 does have bypass on loads, you can use the result of a load in the cycle it's going into the register file. Bypass is only missing on ALU ops. I'd appreciate it is you'd re-run using just an increased ALU latency. 2) I suspect that the software is assuming that it can't store the result of an ALU op in the next cycle. In the rpm-40, you can store it in the next cycle, as the store accesses the register in it's WB phase; it's using it's ALU phase for address calculation. Also, we have a smaller number of GP registers, which causes more modify-store and load-modify- store operations. It looks like my 20% figure (of the top of my head) was 'interesting'. Of curse that was just chance. I agree that there is a cost due to not having ALU bypassing, but I think your 20% figure is a upper limit for the average loss. I suspect maybe more like 5-15% will be the case, given the factors above. >Anyway, I hope the hard data helps the discussion. Most certainly! Thank you. // Randell Jesup Lunge Software Development // Dedicated Amiga Programmer 13 Frear Ave, Troy, NY 12180 \\// beowulf!lunge!jesup@steinmetz.UUCP (518) 272-2942 \/ (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)