ad@romeo.cs.duke.edu (Apostolos Dollas) (05/03/89)
A few days ago, we announced an architecture for sustained instruction pipeline performance. Clearly it was not possible to cover all aspects of this work. Here are the answers to the two questions posed through the network: In response to our original posting, Mark Smotherman (mark@hubcap.clemson.edu) writes: > >How does this proposed architecture attack the short runlengths between >conditional branches? Riseman and Foster thought about an infinite >resource machine in '71 or so (IEEE TC paper) and decided that they >could get a mere ten-fold speedup if they allowed 64K (yes, 64*1024) >prefetch paths. How does compile-time info solve the power-of-two >branch path problem? Do you profile and add compensation code (as does >Multiflow Trace)? Do you back out of prefetched instructions when you >decide you have predicted the wrong path (as does HPS)? > >Are you publishing this work? Comp. Arch. Symp.? ASPLOS? IEEE TC? Our architecture is designed specifically to handle the short runlengths typically found between branches. Unlike the scheme studied by Riseman and Foster, we are only prefetching and decoding far enough into the future to keep the pipeline full. In our prototype, three instruction decoding units (IDU) are sufficient. Thus, the bounded pipeline depth limits the power-of-two branch path problem to small numbers. We will be developing the Program Execution Controller next to dynamically load the IDU's (at the present we load them statically). This requires adequate aggregate bandwidth on the main memory to IDU bus. The specific numbers depend on the instruction set, pipeline design, and the ratio of the primary cycle time to the main memory cycle time. Achieving this bandwidth is quite feasible for most typical architectures. On the subject of code arrangement: we do not backtrack, but prefetched segments which are not used (e.g. in conditionals) are removed to make space for other segments. Since the execution stage is unique, we do not have to ever compensate. We will be submitting this work for publication in the near future. Anyone interested in preprints/technical reports as they become available can write to me. We should have a fairly complete version of all aspects of the architecture by late June. Also, Andrew Klossner (andrew@frip.wv.tek.com) writes: >Never an instruction miss? What happens when executing code for a >switch statement, such as: > > load r2,jumptable(r1) > jmp r2 > >The number of IDUs required to avoid all instruction misses would seem >to be unbounded. This is a very good point, but clearly, we don't work miracles. As I'm sure Andrew knows, one can easily prove that if the entry points of the program are not known *a priori*, the resources to always prefetch are bounded only by the program size, hence impractical. Switch statements, or execution-dependent program counter computation will lead to bubbles. With the exception of returns from subroutines, we assume that all other program flow changes are known at compilation time, hence the program flow graph uniquely characterizes the program flow. Returns from subroutines are handled as nodes to the program flow graph which are determined at the call time, so the same techniques apply for prefetching. One last note: I have received lots of e-mail with many interesting comments. I will try to answer directly in order to avoid cluttering the network. More questions/comments are very welcome. Apostolos Dollas | ad@duke.cs.duke.edu Assistant Professor | Dept. of Electrical Engineering | Duke University | Durham, NC 27706 |