sporer@stellar.COM (Michael Sporer @stellar) (08/23/89)
Hi, this is my first posting so please excuse any awkwardness or incorrect usage. I am interested in Superscalar architectures. We at Stellar designed our own ISA which has lots of RISC characteristics: 32 integer registers, 8 DP FP registers, single-cycle execution, delayed branch with conditional squash (nullification). We did extend the RISC ideas by incorporating memory operates, complex addressing (effective addressing in the operates as well as load/store), and the incorporation of vector operations and concurrency operations. The vector unit operates in a load/store mode with the architecture defining 6 vector registers each with 32 DP elements. After all this was done, we found that not all our functional units were busy all the time so we added a feature called packetization - which the industry now calls superscalar. However, after much work, we found that we only get a 10 percent increase in overall performance. We can packetize 2 integer, or 2 fp (a + and a *), or 1 fp and 1 integer. The biggest restriction is that only one instruction can be memory reference. Now, I am beginning to hear about RISC chips (or chip sets) that will have multiple functional units and an ability to examine 2,3 or even 4 instructions for dispatch. However, there seems to be some limit (like 1.3x increase) to the performance increase over a single functional unit. First off, is this true. Are there really limits to our current instruction set descriptions of programs that limit this type of parallelism? Secondly, if this is true, is there really any work being done on true superscalar ISAs? I am aware of the work being done by Bill Wulf at University of Virginia (The WM Computer Architecture, Definition and Rational - Computer Science Report No. TR-88-22, Oct 21, 1988). It seems that he has defined an elegant method of specifying multiple memory references at once. Anyway, I've gone on long enough. Thanks for your interest. -- -- Michael Sporer uunet!stellar!sporer Stellar Computer Inc sporer@stellar.com
slackey@bbn.com (Stan Lackey) (08/24/89)
In article <1989Aug22.212330.8119@stellar.COM> sporer@stellar.com writes: >We at Stellar designed our own ISA which has lots of RISC characteristics: >We did extend the RISC ideas >by incorporating memory operates, complex addressing (effective addressing >in the operates as well as load/store), and the incorporation of vector >operations and concurrency operations. >that not all our functional units were busy all the time so we added a >feature called packetization - which the industry now calls superscalar. >However, after much work, we found that we only get a 10 percent increase >in overall performance. We can packetize 2 integer, or 2 fp (a + and a *), >or 1 fp and 1 integer. You already have 90%? of what 'superscalar' gets in your instruction set: mem-to-reg operations, complex addressing, and vectors. Now, a true RISC doesn't have those features, and so they will see very significant improvements. Go ahead and flame me, but that's why I say that superscalar and CISC get the same thing, but in slightly different ways. Sure, superscalar is more general and according to what I get from the posting 10% better than CISC. But it can be more expensive, depending upon issues like backwards compatibility. -Stan
mcg@mipon2.intel.com (08/25/89)
In article <1989Aug22.212330.8119@stellar.COM> sporer@stellar.com writes: > >We at Stellar designed our own ISA which has lots of RISC characteristics: >32 integer registers, 8 DP FP registers, single-cycle execution, delayed >branch with conditional squash (nullification). We did extend the RISC ideas >by incorporating memory operates, complex addressing (effective addressing >in the operates as well as load/store), and the incorporation of vector >operations and concurrency operations. Characterizing this ISA as having "lots of RISC characteristics" begs the perennial question of what RISC means, but let's put that aside for the moment. >The vector unit ... > >After all this was done, we found that not all our functional units were >busy all the time so we added a feature called packetization - which the >industry now calls superscalar ... Here is a stab at a definition of Superscalar. (note that [Jouppi] is an excellent paper defining Superscalar and Superpipelined machines, and explored the isomorphism of them and vector machines. See bibliography at the end of this article.) Superscalar: an implementation technique for an instruction set architecture that decodes, dispatches, executes, and returns results from more than one independent instruction from an otherwise linear instruction stream during each of the processor's basic clock cycle. It is equally important to say what superscalar is NOT. Most importantly it is not a VLIW (Very Long Instruction Word) processor [Nicolau]. I say "most importantly" because this is superscalar's most immediate heritage. The difference between VLIW and superscalar is that VLIW processors fail the test of executing "from an otherwise linear instruction stream". Processors which resort to a "dual-instruction mode" or other artifice also fail this test. Another important point is that superscalar is an *implementation technique*, not a fundamental ISA design principle like RISC. With sufficient resources, one could probably build a superscalar VAX implementation. However, the effectiveness (degree of speed-up vs. implementation cost) of a superscalar implementation does depend on many aspects of the ISA, notably how easy or hard it is to detect inter-instruction dependencies and potential parallelism between functional units. > ... [and] found that we only get a 10 percent increase in overall >performance. In general, there are two recent papers and many older ones (see attached partial bibliography) that characterize the potential speedups one might hope to see from fine-grained parallel architectures. The results are mixed. In all but one case, the research has focused on adapting another architecture (e.g. Cray, IBM 360, MIPS, DECWRL Titan) to superscalar implementation. Most of the papers agree that the speedup begins to level off at somewhere between 2x and 4x. There is disagreement on what the limiting factor is. While everyone agrees that a major problem is distance between conditional branches, others go on to argue that the fundemental parallelism is limited. All, however, seem to point out that 2x improvement is achievable. This can also be seen in our implementation of a superscalar microprocessor, mentioned in [Hinton]. Your 10% improvement might be due to a hiding of potential parallelism by the architecture, or poor instruction scheduling by the compiler - there is no way to tell without a much more thorough examination of the architecture. > We can packetize 2 integer, or 2 fp (a + and a *), >or 1 fp and 1 integer. The biggest restriction is that only one instruction >can be memory reference. Executing FP add and multiply in parallel (if you can truly do that, and complete them in one cycle), will give you excellent speedups on many scientific algorithms, provided that they are hand-coded or that you have a good compiler. In particular, matrix multiply can be improved by up to 2x by this. So if you are doing this and still only getting 10%, your memory system or some other aspect of your implementation is out of balance. >Now, I am beginning to hear about RISC chips (or chip sets) that will have >multiple functional units and an ability to examine 2,3 or even 4 instructions >for dispatch. However, there seems to be some limit (like 1.3x increase) to >the performance increase over a single functional unit. AMD, MIPS, National, Motorola, and Intel are all known to be working on multi-function-unit processors. I have commented in the past on my beliefs about the difficulties some of these architectures will place on superscalar implementations. >First off, is this true. Are there really limits to our current instruction >set descriptions of programs that limit this type of parallelism? As described above, yes and no. Current research and practice shows that 2-3x is achievable, given appropriate balance of memory bandwidth (especially instruction fetch) and funtional units. VLIW people would have you believe that 1000x speedup is possible with VLIW, and anything VLIW can do, Superscalar can do also, possibly with a little more difficulty. The difference is: a) what benchmarks are important to you: are they loop-dominated or straigh-code dominated? what is their fundamental parallelism?; and b) how much effort do you put into the compiler to do loop unrolling, software pipelining, inter-basic-block code scheduling. Does your architecture support branch prediction, out-of-order instruction execution, etc? Can you hope to effectively implement them? >Secondly, if this is true, is there really any work being done on true >superscalar ISAs? I am aware of the work being done by Bill Wulf at University >of Virginia (The WM Computer Architecture, Definition and Rational - Computer >Science Report No. TR-88-22, Oct 21, 1988). It seems that he has defined an >elegant method of specifying multiple memory references at once. I don't have a copy of the WM stuff handy, but I remember that it struck me as closer to the VLIW end of the spectrum than superscalar. As I mentioned, Intel has published a paper, and I have spoken several times about research we are doing. I'm speaking on this subject at Michael Slater's Microprocessor Forum, the Embedded Systems Conference, and Wescon on this subject, so maybe you can attend those conferences. We have announced that a product will be available during 1989. Other than that, you'll have to wait. In article <44677@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >You already have 90%? of what 'superscalar' gets in your instruction >set: mem-to-reg operations, complex addressing, and vectors. Now, a >true RISC doesn't have those features, and so they will see very >significant improvements. Go ahead and flame me, but that's why I say >that superscalar and CISC get the same thing, but in slightly >different ways. Sure, superscalar is more general and according to >what I get from the posting 10% better than CISC. But it can be more >expensive, depending upon issues like backwards compatibility. Here is the flame you asked for: I do not believe that superscalar and CISC have anything to do with one another. It is true (and I have said many times) that certain instruction set features expand the amount of easy-to-see parallelism in the architecture. Among these are slightly more complex addressing modes, include scaled-index and scaled-index-plus-base register, but we believe that it is most effective to base a superscalar implementation on a quite simple load/store architecture. Keeping loads and stores distinct from register-register operations is, in my opinion, quite important to the ease of superscalar implementation. And, as mentioned before, superscalar techniques now in existence coupled with known techniques for compilation and code scheduling can get 2x to 3x speedups *over RISC processors*. Also, I don't know the meaning of the statement "depending on issues like backwards compatibility". Superscalar implementations, done correctly, are more likely to be binary backwards compatible that RISC architectures, because RISC people often exposed the pipeline - a pipeline that they will have to change for future implementations. The next 2 years is going to see a lot of talk about superscalar, you may want to read the attached papers. S. McGeady Intel Corp. Bibliography ------------ R. Acosta, J. Kjelstrup, H. Torng, "An Instruction Approach to Enhancement Performance in Multiple Functional Unit Processors", IEEE Transactions on Computers, Vol c-35, #9, Sept. 1986 W. Hwu & Y. Patt, "Checkpoint Repair for Out-of-Order Execution Machines", 1987, ACM 0084-7495 N. Jouppi & D. Wall, "Available Instruction-Level Parallelism for Superscalar and Superpipelined Machines", 1989, ASPLOS-III Proceedings, ACM Sigplan Notices Vol 24, May 1989 G. Hinton, "80960 - Next Generation", Proceedings of 34th COMPCON, IEEE, March 1989 D. Kuck, Y. Muroka & S-C Chen, "On the Number of Operations Simulataneously Executable in Fortran-Like Programs and Their Resulting Speedup", IEEE Transactions on Computers, Vol C-21, #12, Dec 1972 A. Nicolau & J. Fisher, "Measuring the Parallelism Available for Very Long Instruction Word Architectures", IEEE Transactions on Computers, Vol c-33, #11, Nov 1984 E. Riseman & C. Foster, "The Inhibition of Potential Parallelism by Conditional Jumps", IEEE Transactions on Computers, Dec 1972 M. Smith, M. Johnson, & M. Horowitz, "Limits on Multiple Instruction Issue", 1989, ASPLOS-III Proceedings, ACM Sigplan Notices Vol 24, May 1989 J. Thorton, "Parallel Operation in the Control Data 6600", AFIPS Proceedings FJCC, vol 26, 1964 R. Tomosulo, "An Efficient Algorithm for Exploiting Multiple Arithmetic Units", IBM Journal, vol 11, Jan 1967 G. Tjaden & M. Flynn, "Detection and Parallel Execution of Independent Instructions", IEEE Transactions on Computers, Vol C-19, #10, Oct 1970 S. Weiss & J. Smith, "Instruction Issue ogic in Pipelined Supercomputers", IEEE Transactions on Computers, Vol c-33, #11, Nov. 1984