dwc@hou2b.UUCP (D.CHEN) (06/16/85)
> A reasonable approach to producing CISC machines, then, would be to have the > compiler writers design the instruction set; and to have the instruction set > be changeable for different languages. funny how things come full circle. one way to implement the above is to have dynamically loaded microprogrammed instruction sets. but as dave paterson mentions in his paper (i think), the difficulty of writing and supporting microprograms along with the time penalty in using microprogrammed logic (versus hardwired logic) led to the RISC idea.
deel@sdcsvax.UUCP (Don Deel) (06/17/85)
>> A reasonable approach to producing CISC machines, then, would be to have the >> compiler writers design the instruction set; and to have the instruction set >> be changeable for different languages. > > funny how things come full circle. one way to implement the above is to > have dynamically loaded microprogrammed instruction sets. but as dave > paterson mentions in his paper (i think), the difficulty of writing > and supporting microprograms along with the time penalty in using > microprogrammed logic (versus hardwired logic) led to the RISC idea. Actually, there's a well-known large company that's already been around this dynamically-changing-instruction-set bush.... Anyone from Burroughs care to comment on how the B1700 worked out after all these years? For the uninitiated, the B1700 made (makes?) extensive use of its ability to dynamically change its mind about which instruction set it was executing. The operating system used one instruction set, the COBOL machine emulation used another, etc. The idea was that different languages have different needs as far as instructions sets are concerned.
jer@peora.UUCP (J. Eric Roskos) (06/17/85)
> funny how things come full circle. one way to implement the above is to > have dynamically loaded microprogrammed instruction sets. but as dave > paterson mentions in his paper (i think), the difficulty of writing > and supporting microprograms along with the time penalty in using > microprogrammed logic (versus hardwired logic) led to the RISC idea. At this point I think we have more or less exhausted the basic tenets of the two viewpoints here. You are certainly correct that this is the RISC argument; yet I had this in mind when I made the statement you are commenting on. [Now, I am naturally a little biased in favor of dynamically loaded micro- programmed instruction sets; my Master's thesis research was on the foundations of the tools necessary to produce exactly that on a PE 3220. (See Micro-14). (It was subsequently implemented by Len B. Reed for his Master's thesis, providing a microprogram image that was swappable on a per-user basis; and this was then built upon by E. Mike Carter for part of his PhD dissertation to implement dynamic type-oriented vertical migration. All this was done under the direction of R. I. Winner.)] A weak point of the RISC argument is that a lot of it is based on the notion of what is "easy" and what is "difficult". It's "difficult" to write and maintain microcode; it is "easy" to build RISC machines. Furthermore some CISC machines are poorly-designed, giving some vague support to the benefits of the RISC machines. And of course, it has been shown that the instruction set required for a particular program may be quite different from that required for another. Thus, we come down in some sense to the very old argument of the merits or limitations of generating subroutines inline. I think that the two issues have fundamental similarities. If you have instruction sequences that are recurrent throughout a program, is it better to generate them again each time, or generate a subroutine call to one copy of them? Is it possible to produce a general-purpose subroutine that can be used by many programs, or will such a routine be unusable for one reason or another? What about the possibility of discovering a new, better way to do the code sequence; in one case, you can change the subroutine; in the other, you must generate the code all over again. On the other hand, with compilers it's not that hard to regenerate it. I think the real merits of the RISC architecture are not in the "RISC vs. CISC" debate. It appears that people have a tendency to become conceptually entrenched in the design of architectures of various types. Every once in awhile, someone comes along and does it all in a simple way, and people marvel at how well it works, and how simple it is. Operating systems were like this; Unix was an example of a demonstration of what you can and can't do with a much simpler approach than what had been the focus of research. -- Full-Name: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 "Gnyx gb gur fhayvtug, pnyyre..."
mat@amdahl.UUCP (Mike Taylor) (06/18/85)
The implementation for Amdahl's large S/370 compatible processors might be of interest to this discussion. While Amdahl has no choice about the instruction-set architecture (defined by IBM), there is considerable latitude in implementation. The primary emphasis in the machine is fast execution of the common instructions (load, store, branch, etc.). These can be executed at a sustained rate of one per cycle (no interlocks or misses) and are effectively implemented in hardware. More complex instructions are implemented by microcode sequences, using a so-called nanostore which maps short microwords into long control words. The most complex instructions are implemented in software, called macrocode. Any operation can be defined (including one already supported in the machine) as being implemented in macrocode. Loadable tables in the CPU control the effective architecture. Hardware support for macrocode includes a second register file, and private memory. When a macrocode-supported operation is detected, the second register file is set up with the decoded instruction and operands, and a vectored branch to the software routine is stuffed into the pipe. This costs very few cycles. The software routine uses the second register file, avoiding save/restore, and has special support to set up the return conditions. This allows new instructions to be readily added with good performance. The software routines to support new instructions can be written using special tools to generate essentially interlock-free code, which will keep the functional units as busy as microcode could. The only real additional cost is cache misses caused by the software routine which would not happen from microstore. This ability to redefine the instruction set architecture dynamically is used to support the simultaneous operation of SCP's requiring different architectures, for example MVS/370 and MVS/XA (or UTS!). Other facilities (addressing, I/O message and interrupt routing, timers, etc.) combined with the above provide a kind of hardware virtual machine support called the "Multiple Domain Facility (MDF)." Switches between domains simply cause the architectural controls to be reloaded - the cost of this is only slightly higher than any other state switch. -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
freeman@spar.UUCP (Jay Freeman) (06/20/85)
[There's no such thing as the line-ea<*GULP*>] In article <1691@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes: >> The implementation for Amdahl's large S/370 compatible processors >> might be of interest to this discussion. [concise description follows >> -- JF] >> More complex instructions are implemented by microcode >> sequences, using a so-called nanostore which maps short microwords >> into long control words. The most complex instructions are implemented >> in software, called macrocode. It sounds like the macrocode is user-definable, how 'bout the microcode? >> Hardware support for macrocode includes a second register file Interesting. Some of these concepts crop up even in the smallest of microprocessors, too: The Z-80 has its "alternate" registers, and the recently notorious 8<censored>> 86 has 256 different software interrupts, which many people use as hooks to commonly-used library routines. >> Switches between domains >> simply cause the architectural controls to be reloaded - the cost of this >> is only slightly higher than any other state switch. Does microcode reload (or switch) as well? The cited architecture sounds neat: It looks to me as if, in principle, users have access to the machine at four levels (assuming the microcode is rewritable): (ordinary) code eg, call a library routine macrocode eg, execute "machine instruction" FOO, where FOO traps to a macrocode routine microcode/hardware eg, execute "machine instruction" BAR, which actually IS a machine instruction hardware eg, change some microcode Have I got that right? Clever programming software ought to be able to look at an algorithm to be implemented, and perform an optimization that included moving code (should we call that "metacode") across all of these level boundaries. -- Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
larry@mips.UUCP (Larry Weber) (06/21/85)
> > It is also "difficult" to write system software for that subclass of RISC > machine that simplifies hardware to the point of requiring the software to > allow for the resolution of pipeline data-dependencies. It would be an > Interesting Task to do a code-generator for such a beast. > ... > Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer) This is simply not true. The simplest way to resolve pipeline dependencies is to put into the instruction stream exactly the number of nop instructions that the hardware would do in a machine which has interlocks. This does not make the program run any slower because the hardware would have delayed the program anyways. The real gain comes when your modify the order of instructions to eliminate delay cycles. The complexity of the machine determines the complexity of the algorithm. On simpler machines you can actually resolve delays in 200 lines. A number of years ago I modified IBM's compiler for their systems language for the 370 to resolve an interlock between computation and address calculation. It resulted in a 6% improvement in performance; this modest improvement is respectable when you consider that the compiler was already highly tuned to produce very good code. On RISC machines it is possible to achieve a much larger improvement because you can schedule a wider class of delays - branch delays, load delays, coprocessor delays etc. I believe that pipeline scheduling is an excellent "peephole" optimization whether the hardware has interlocks or not. The problem of code generation needn't be any harder than in other machine, in theory you would like to generate instructions that are good candidates to resolve delays just produced or about to be produced. In practice, all you have to is to not "over-work" a single register which would result in all instructions being interdependent. -- -Larry B Weber UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!larry DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
freeman@spar.UUCP (Jay Freeman) (06/21/85)
[beware of the line-eat<*CHOMP*>] In article <145@mips.UUCP> larry@mips.UUCP (Larry Weber) writes: >> >> It is also "difficult" to write system software for that subclass of RISC >> machine that simplifies hardware to the point of requiring the software to >> allow for the resolution of pipeline data-dependencies. It would be an >> Interesting Task to do a code-generator for such a beast. >> ... >> [from me -- JF] > >This is simply not true. [...] The problem of code generation needn't be >any harder than in other machine [...] You are certainly correct that putting in NOOPs provides a quick and dirty resolution of these dependencies. I erred in not stating precisely, that I meant "doing it right". I believe it is important to do it right; because adding substantial numbers of NOOPs to code -- possibly as many or more than the code has "real" instructions -- would appear likely to slow up execution sufficiently that a RISC no longer has a speed advantage over a CISC. (I have no data -- is that correct?) I will also accept that a peephole optimizer for removing pipeline dependencies need not be longer than a few hundred lines. However, at least some global data-flow analysis appears desirable, to ensure that cycles are not wasted in unnecessary resolution after jumps and calls: (It is well known that many programs spend most of their time in relatively short loops -- adding NOOPs needlessly at the start of each loop could hurt. (Again, no data -- key issues would appear to be how short the "typical" loop is, how many NOOPs the particular RISC requires to flush its pipelines, and what actual advantage the RISC with fully-optimized code has over its competitors, anyway.)) (A similar issue may exist with respect to call overhead.) Notwithstanding, I think I still see a problem with respect to getting good system software for these beasties. I reason as follows: Much of system software requires well-optimized code. Therefore we either write it in a high-level language for which we have a compiler that generates good code, or else we go in and hand-code the major bottlenecks in assembler. I submit that optimizing pipeline dependencies by hand, in assembler, is likely to provide a rich source of subtle, persistant and fascinating bugs. Great fun [Insert :-) here, I think], but not necessarily conducive to lots of good code. I also suggest that decent optimizing compilers are not really across-the-board state-of-the-art for the microcomputer industry yet. I have a sense that although they exist, they are often slow to come out, slow to be updated and made bug-free, and relatively scarce. If the Acme RISC Machine Co. introduces a new wonder chip, how long will it be before such tools are available in sufficient variety to suit all the eager developers? To the extent that these arguments are convincing, I am induced to speculate that the success or failure of computer systems based on RISC-machine hardware, may be embarrassingly tightly linked to the state of the system-software industry. I an not sure that my arguments are correct, and I am not confident of my perception of the software industry, either. What do other people think? -- Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer)
cdshaw@watmum.UUCP (Chris Shaw) (06/23/85)
Does not the average RISC type machine do a purge-pipe action upon receipt of a branch instruction, or is there something so incredibly obvious going on here that I am missing? The reason I ask is that in any but the most nontrivial loops, throwing in enough NOPs to fill the pipe so that the next instruction is the right one strikes me as grossly inefficient. The alternative with a simple "purge-pipe" is to repeat enough of the beginning of the loop such that you end up doing the right instructions if you branched to continue the loop. Otherwise you purge, and continue after the repeated instructions without actually executing any of them. Thus we have a bogus program with a 2-element pipe: start: load a store b start+2:add c store d decr e brnz start+2: load a \ do these 2 if branch taken, else throw them away store b / ... (purge) and do the xor xor a ---> continue here because "load" & "store" were purged ... when the branch was not taken (Forgive me if this was utterly stupid) Chris Shaw watmath!watmum!cdshaw or cdshaw@watmath University of Waterloo In doubt? Eat hot high-speed death -- the experts' choice in gastric vileness !
mat@amdahl.UUCP (Mike Taylor) (06/24/85)
> > >> More complex instructions are implemented by microcode > >> sequences, using a so-called nanostore which maps short microwords > >> into long control words. The most complex instructions are implemented > >> in software, called macrocode. > > It sounds like the macrocode is user-definable, how 'bout the microcode? > In principle, yes. As a matter of policy, customers are discouraged from changing either simply because of the maintenance and diagnostic implications. However, both macrocode and microcode are stored on a hard disk which is part of the console (service processor) subsystem, and loaded during the system initialization sequence. > Does microcode reload (or switch) as well? No. All microcode is loaded all the time. If a particular instruction is not part of the architecture for a given domain, it is logically disabled, but the microcode is still there. > > The cited architecture sounds neat: It looks to me as if, in principle, > users have access to the machine at four levels (assuming the microcode is > rewritable): > > (ordinary) code eg, call a library routine > macrocode eg, execute "machine instruction" FOO, > where FOO traps to a macrocode routine > microcode/hardware eg, execute "machine instruction" BAR, > which actually IS a machine instruction > hardware eg, change some microcode > > Have I got that right? Yes. > > Clever programming software ought to be able to look at an algorithm to > be implemented, and perform an optimization that included moving code > (should we call that "metacode") across all of these level boundaries. > -- > Jay Reynolds Freeman (Schlumberger Palo Alto Research)(canonical disclaimer) Probably. However, in practice, the biggest hits come from optimizing pipeline performance by removing interlocks (i.e. reordering instructions so that they don't interfere) and optimizing cache performance. The 580 has a 5-stage pipeline, so that execute/generate interlocks such as: l 5,4(3,4) a 6,0(5) where the result of the first instruction is needed to calculate the EA of the second, cost three cycles and therefore fixing this common sequence by doing something else helps a lot. Also the 580 has dual caches (Instruction and Operand). The instruction cache is read-only, so that stores to lines in the I-cache are very expensive. Also good code will know how to use the background prefetching properly to reduce cache misses dramatically. Good macrocode software will mostly achieve the theoretical maximum rate of execution (41 MIPS on RX-type instructions) whereas something like 12 MIPS would be more normal when the MVS operating system is running. -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
mash@mips.UUCP (John Mashey) (06/30/85)
> Chris Shaw watmath!watmum!cdshaw writes: > Does not the average RISC type machine do a purge-pipe action upon > receipt of a branch instruction, or is there something so incredibly obvious > going on here that I am missing? .... > Thus we have a bogus program with a 2-element pipe: > start: load a > store b > start+2:add c > store d > decr e > brnz start+2: > load a \ do these 2 if branch taken, else throw them away > store b / ... (purge) and do the xor > xor a ---> continue here because "load" & "store" were purged > ... when the branch was not taken The idea in RISCs with branch-delay slots is to NEVER do purge-pipes, except in interrupt-processing. A typical RISC design like the above would ALWAYS execute the 2 instructions following the branch, so the pipe would never be purged. A more likely code sequence would be: start:load a start+1:store b start+2:add c decr e brnz start+1: store d load a xor a [Note: the program is bogus, so it's hard to tell exactly what is going on, but the 2nd example is much more in the spirit of the way RISCs work. Worth reading are 2 excellent articles: Dave Patterson's Jan 85 CACM article or John Hennessy's Dec 84 IEE Trans on Computers one. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
alan@drivax.UUCP (Alan Fargusson) (06/30/85)
> Does not the average RISC type machine do a purge-pipe action upon > receipt of a branch instruction, or is there something so incredibly obvious > going on here that I am missing? The reason I ask is that in any but the most > nontrivial loops, throwing in enough NOPs to fill the pipe so that the next > instruction is the right one strikes me as grossly inefficient. > start: load a > store b > start+2:add c > store d > decr e > brnz start+2: > load a \ do these 2 if branch taken, else throw them away > store b / ... (purge) and do the xor > xor a ---> continue here because "load" & "store" were purged > ... when the branch was not taken > The Bearkly type RISC (and most others I think) delay the action of the branch for one cycle. Thus for your example program a noop would have to be placed after the brnz to start+2. In many cases the noop can be optimized out by a smart compiler. -- Alan Fargusson. { ihnp4, amdahl, mot }!drivax!alan