[comp.arch] More on the Sustained Performance Architecture

ad@romeo.cs.duke.edu (Apostolos Dollas) (05/03/89)

A few days ago, we announced an architecture for sustained instruction 
pipeline performance. Clearly it was not possible to cover all aspects of this
work. Here are the answers to the two questions posed through the network:

In response to our original posting, Mark Smotherman (mark@hubcap.clemson.edu)
writes:
>
>How does this proposed architecture attack the short runlengths between
>conditional branches?  Riseman and Foster thought about an infinite
>resource machine in '71 or so (IEEE TC paper) and decided that they
>could get a mere ten-fold speedup if they allowed 64K (yes, 64*1024)
>prefetch paths.  How does compile-time info solve the power-of-two
>branch path problem?  Do you profile and add compensation code (as does
>Multiflow Trace)?  Do you back out of prefetched instructions when you
>decide you have predicted the wrong path (as does HPS)?
>
>Are you publishing this work?  Comp. Arch. Symp.?  ASPLOS?  IEEE TC?

Our architecture is designed specifically to handle the short runlengths 
typically found between branches. Unlike the scheme studied by Riseman 
and Foster, we are only prefetching and decoding far enough into the 
future to keep the pipeline full. In our prototype, three instruction 
decoding units (IDU) are sufficient. Thus, the bounded pipeline depth limits
the power-of-two branch path problem to small numbers. We will be developing
the Program Execution Controller next to dynamically load the 
IDU's (at the present we load them statically). This requires adequate
aggregate bandwidth on the main memory to IDU bus. The specific numbers 
depend on the instruction set, pipeline design, and the ratio of the primary 
cycle time to the main memory cycle time. Achieving this bandwidth is quite
feasible for most typical architectures. On the subject of code arrangement: 
we do not backtrack, but prefetched segments which are not used (e.g. in 
conditionals) are removed to make space for other segments. Since the 
execution stage is unique, we do not have to ever compensate.

We will be submitting this work for publication in the near future. Anyone
interested in preprints/technical reports as they become available can write to
me. We should have a fairly complete version of all aspects of the architecture
by late June.


Also, Andrew Klossner (andrew@frip.wv.tek.com) writes:
>Never an instruction miss?  What happens when executing code for a
>switch statement, such as:
>
>	load	r2,jumptable(r1)
>	jmp	r2
>
>The number of IDUs required to avoid all instruction misses would seem
>to be unbounded.

This is a very good point, but clearly, we don't work miracles. As I'm sure
Andrew knows, one can easily prove that if the entry points of the program are
not known *a priori*, the resources to always prefetch are bounded only by the
program size, hence impractical. Switch statements, or execution-dependent 
program counter computation will lead to bubbles. With the exception of 
returns from subroutines, we assume that all other program flow changes are 
known at compilation time, hence the program flow graph uniquely characterizes 
the program flow. Returns from subroutines are handled as nodes to the program 
flow graph which are determined at the call time, so the same techniques 
apply for prefetching.

One last note: I have received lots of e-mail with many interesting comments. 
I will try to answer directly in order to avoid cluttering the network. More
questions/comments are very welcome.

Apostolos Dollas                  |    ad@duke.cs.duke.edu
Assistant Professor               |
Dept. of Electrical Engineering   |
Duke University                   |
Durham, NC 27706                  |