crick@bnr-rsc.UUCP (Bill Crick) (01/27/88)
Given that some sort of frequency statement or profiling results are available, and are useful for increasing performance (jury is still out?), how does one build a machine to take the best advantage of this info to get lots of speed? There have been several compiler suggestions, but what do you do to the actual machine hardware to take advantage of this? This could include but shouldn't be limited to the CPU itself. What about the memory heirachy? Any ideas? <- I will probably regret this! Bill Crick Computo, Ergo Sum! (V Pbzchgr, Gurersber, V Nz!) Disclaimer: I don't speak, I type!
root@mfci.UUCP (SuperUser) (01/29/88)
In article <606@bnr-rsc.UUCP> crick@bnr-rsc.UUCP (Bill Crick) writes: > >Given that some sort of frequency statement or profiling results are >available, and are useful for increasing performance (jury is still out?), >how does one build a machine to take the best advantage of this info to >get lots of speed? There have been several compiler suggestions, but >what do you do to the actual machine hardware to take advantage of this? >This could include but shouldn't be limited to the CPU itself. What about >the memory heirachy? Any ideas? <- I will probably regret this! > >Bill Crick >Computo, Ergo Sum! (V Pbzchgr, Gurersber, V Nz!) > > >Disclaimer: I don't speak, I type! Frequency directives and profiling results are not the only ways to go; our compiler does a very good job with heuristics at deciding which way conditional branches are going to go. Here is the standard VLIW party line: once you know which way branches are going to go, you suddenly have a lot of "sequential" code for which you can do reasonably straightforward "compaction" -- if you have enough HW resources, then you can do a lot of things at the same time, and with enough code to choose from, you have a lot of things to do. Of course, you need to control all that parallel hardware, hence the Very Long Instruction Word. The memory hierarchy is a good question. If you're running scientific code, you are often going to need a very large physical memory to hold the data set (you don't want to page yourself to death). If the memory is to be large, then (except for Cray) the only reasonable way to build it is to use the densest available memory chips, which are usually organized as "long and skinny" -- 256K x 1, 512K x 1, 1M x 1. The problems with those are a) their cycle time is 2 - 4 times longer than the cycle time of the CPU; and b) since they are so long, there are lots of other addresses inside them that may contain the data you want next (when the chip is still busy with the last request). If you are up on bank-organized memory design, sorry for dwelling on the basics so much. There are some things you can do about this. Obviously, one is to make sure that the data is stored in the RAMs according to their expected access patterns. But you can't always do that, even on clearly-vectorizable code. One of the components in our compiler performs memory-bank disambiguation in order to decide whether any two memory references are likely to collide. This stuff is covered in John Ellis' thesis, and mentioned in our ASPLOS-II paper. For a different point of view (a hardware one), you can check out the Structured Control Flow work of Wedig and Flynn at Stanford, extended by Uht at Carnegie-Mellon in 1985. Bob Colwell <I don't speak for Multiflow> Multiflow Computer 175 N. Main St. Branford, CT 06405 203-488-6090 mfci!colwell@uunet.uucp