[comp.arch] Frequency

crick@bnr-rsc.UUCP (Bill Crick) (01/27/88)

Given that some sort of frequency statement or profiling results are
available, and are useful for increasing performance (jury is still out?),
how does one build a machine to take the best advantage of this info to 
get lots of speed? There have been several compiler suggestions, but 
what do you do to the actual machine hardware to take advantage of this?
This could include but shouldn't be limited to the CPU itself. What about
the memory heirachy?   Any ideas? <- I will probably regret this!

Bill Crick
Computo, Ergo Sum! (V Pbzchgr, Gurersber, V Nz!)


Disclaimer: I don't speak, I type!

root@mfci.UUCP (SuperUser) (01/29/88)

In article <606@bnr-rsc.UUCP> crick@bnr-rsc.UUCP (Bill Crick) writes:
>
>Given that some sort of frequency statement or profiling results are
>available, and are useful for increasing performance (jury is still out?),
>how does one build a machine to take the best advantage of this info to 
>get lots of speed? There have been several compiler suggestions, but 
>what do you do to the actual machine hardware to take advantage of this?
>This could include but shouldn't be limited to the CPU itself. What about
>the memory heirachy?   Any ideas? <- I will probably regret this!
>
>Bill Crick
>Computo, Ergo Sum! (V Pbzchgr, Gurersber, V Nz!)
>
>
>Disclaimer: I don't speak, I type!


Frequency directives and profiling results are not the only ways to
go; our compiler does a very good job with heuristics at deciding
which way conditional branches are going to go.  Here is the standard
VLIW party line:  once you know which way branches are going to go,
you suddenly have a lot of "sequential" code for which you can do
reasonably straightforward "compaction" -- if you have enough HW
resources, then you can do a lot of things at the same time, and with
enough code to choose from, you have a lot of things to do.  Of
course, you need to control all that parallel hardware, hence the
Very Long Instruction Word.

The memory hierarchy is a good question.  If you're running
scientific code, you are often going to need a very large physical
memory to hold the data set (you don't want to page yourself to
death).  If the memory is to be large, then (except for Cray) the
only reasonable way to build it is to use the densest available
memory chips, which are usually organized as "long and skinny" --
256K x 1, 512K x 1, 1M x 1.  The problems with those are a) their
cycle time is 2 - 4 times longer than the cycle time of the CPU; and
b) since they are so long, there are lots of other addresses inside
them that may contain the data you want next (when the chip is still
busy with the last request).  If you are up on bank-organized memory
design, sorry for dwelling on the basics so much.

There are some things you can do about this.  Obviously, one is to
make sure that the data is stored in the RAMs according to their
expected access patterns.  But you can't always do that, even on
clearly-vectorizable code.  One of the components in our compiler
performs memory-bank disambiguation in order to decide whether any
two memory references are likely to collide.  This stuff is covered
in John Ellis' thesis, and mentioned in our ASPLOS-II paper.

For a different point of view (a hardware one), you can check out
the Structured Control Flow work of Wedig and Flynn at Stanford, 
extended by Uht at Carnegie-Mellon in 1985.

  Bob Colwell   <I don't speak for Multiflow>
  Multiflow Computer
  175 N. Main St. 
  Branford, CT 06405   203-488-6090

  mfci!colwell@uunet.uucp