joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) (12/07/90)
I would like some input on the following idea to extend the life of CISC processors. Consider a hypothetical machine: IM 68386C (CISCized). First, determine the dynamic instruction profile of the target mix. If the target is engineering programs, then determine the dynamic frequency of all instructions. (A LOAD with indirect addressing and a LOAD with direct addressing are considered different instructions in the context of this posting.) Then, rank the instructions from highest frequency to lowest. Exclude I/O instructions. Suppose that there are a total of "n" non-I/O instructions. Suppose that I[n] is the instruction with the highest frequency and that I[1] is the instruction with the lowest frequency. The ranking might look something like the following: instruction dynamic frequency I[1] 22% I[2] 8% . . . . . . I[n - 1] 0.002% I[n] 0.001% In a CISC chip, there is a certain redundancy. In other words, some of the complex instructions can be written in terms of the simpler instructions. An instruction to move a block of data from one place in memory to another place can be replaced by a loop of simpler LOAD and STORE instructions. Now, from the ranking of the instructions, determine the smallest "i" such that all I[j] with "j > i" can be written in terms all I[k] with "k <= j". Designate the set of the first i instructions from the above ranking to be the "RISC Set". Designate the set of I/O instructions to be the "I/O Set". Relabel "i" to be "M", the minimal number. Designate the remaining instructions from the above ranking to be the "CISC Set". Now, using timing analysis, estimate the performance of implementing the RISC Set and the I/O Set in hardware and implementing the CISC Set as subroutines in a microcode store. These subroutines are written with instructions from the RISC Set. Whenever an instruction from the CISC Set is encountered in the instruction stream, it causes a trap to the appropriate subroutine in the microcode store. Essentially, what we have is a RISC machine with some subroutines coded into ROM. There might need to be additional registers over and above those in the programmer's model in the IM 68386C in order to maintain information like the following: (1) the processor is executing instructions in a subroutine in microcode and is not executing instructions in the normal instruction stream from main memory (2) the address of the current byte of memory and the destination to which the byte is transferred by a CISC Set block-move instruction (3) etc. Designate these additional registers "Extra Registers". Naturally, they would be saved just prior to the servicing of an interrupt. The great thing about the IM 68386R (RISCized) processor is that super-scalarizing it will be no harder than for a RISC processor because, we now essentially have a RISC processor (one with subroutines microcoded to handle CISC Set instructions). We will only be super-scalarizing the RISC Set, _not_ the full set of the IM 68386C. The other great thing is that the IM 68386R is upward compatible with the IM 68386C and can use its large installed base of programs. By the way, IM 68386C is a labeling derived from 68xxx (Motorola = M) and xx386 (Intel = I).
msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (12/07/90)
In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: >I would like some input on the following idea to extend the life of >CISC processors. >Consider a hypothetical machine: IM 68386C (CISCized). >First, determine the dynamic instruction profile of the target mix. >If the target is engineering programs, then determine the dynamic >frequency of all instructions. (A LOAD with indirect addressing >and a LOAD with direct addressing are considered different instructions >in the context of this posting.) >Then, rank the instructions from highest frequency to lowest. >Exclude I/O instructions. Suppose that there are a total of "n" non-I/O >instructions. Suppose that I[n] is the instruction with the highest >frequency and that I[1] is the instruction with the lowest frequency. [deletion] >In a CISC chip, there is a certain redundancy. In other words, >some of the complex instructions can be written in terms of the >simpler instructions. An instruction to move a block of data >from one place in memory to another place can be replaced >by a loop of simpler LOAD and STORE instructions. [deletion] >Now, using timing analysis, estimate the performance of >implementing the RISC Set and the I/O Set in hardware >and implementing the CISC Set as subroutines in >a microcode store. These subroutines are written with >instructions from the RISC Set. Whenever an instruction from >the CISC Set is encountered in the instruction stream, it >causes a trap to the appropriate subroutine in the >microcode store. Essentially, what we have is a >RISC machine with some subroutines coded into ROM. [deletion] What about instruction decode? RICS machines tend to have fixed-format instructions that are easy to decode. (i.e. all instructions are 32 bits, the first 6 are opcode, the next 15 specify registers, the rest immediate data). CISCs tend to have instructions of varying length and format. RISCs tend to have alignment restrictions to a greater extent than CISCs. You lose some of the benefits of RISC if you have to deal with these things. Does anyone know how the internals of the 80486 and 68040 compare to this scheme? -- Michael Pereckas * InterNet: m-pereckas@uiuc.edu * just another student... (CI$: 72311,3246) Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado
lewine@cheshirecat.rtp.dg.com (Donald Lewine) (12/07/90)
In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: |> |> In a CISC chip, there is a certain redundancy. In other words, |> some of the complex instructions can be written in terms of the |> simpler instructions. An instruction to move a block of data |> from one place in memory to another place can be replaced |> by a loop of simpler LOAD and STORE instructions. |> |> Now, using timing analysis, estimate the performance of |> implementing the RISC Set and the I/O Set in hardware |> and implementing the CISC Set as subroutines in |> a microcode store. That was exactly what was done in the MicroVAX architecture back in 1982. The more complex instructions were emulated using the simple instructions. There was some cleverness in using hardware to decode the full set of VAX instructions and then call software to do the rest. This does not give you a RISC in the sense of architecutral purity. The VAX (or 386 or 68K) instruction stream is still a bear to decode and does many things that violate the RISC religion. You have merely proposed a new way to implement a CISC machine. The VAX 9000 also uses a technique very similar to the one you describe. ***HOWEVER***, the advantage of RISC is moving work from runtime to compile time. The big speedup comes from compiler work not hardware. At Data General we have modified some of the compilers for our CISC MV-series to compile simple code instead of using instructions like WEDIT. This has produced major performance enhancements because a compiler can generate special case code. -------------------------------------------------------------------- Donald A. Lewine (508) 870-9008 Voice Data General Corporation (508) 366-0750 FAX 4400 Computer Drive. MS D112A Westboro, MA 01580 U.S.A. uucp: uunet!dg!lewine Internet: lewine@cheshirecat.webo.dg.com
brandis@inf.ethz.ch (Marc Brandis) (12/07/90)
In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: >I would like some input on the following idea to extend the life of >CISC processors. > >Consider a hypothetical machine: IM 68386C (CISCized). >First, determine the dynamic instruction profile of the target mix. [ some stuff deleted ] >Then, rank the instructions from highest frequency to lowest. [ some more deleted ] >In a CISC chip, there is a certain redundancy. In other words, >some of the complex instructions can be written in terms of the >simpler instructions. [ some stuff deleted ] >Now, from the ranking of the instructions, determine the >smallest "i" such that all I[j] with "j > i" can be written >in terms all I[k] with "k <= j". Designate the set of >the first i instructions from the above ranking to >be the "RISC Set". [ some stuff deleted ] >Now, using timing analysis, estimate the performance of >implementing the RISC Set and the I/O Set in hardware >and implementing the CISC Set as subroutines in >a microcode store. These subroutines are written with >instructions from the RISC Set. This is exactly what modern computer architecture is all about. Look for the often encountered cases and optimize these while accepting some overhead for the less common ones. This technique has been heavily used in the design of RISC processors, but it is not restricted to this area, of course. Dynamic distributions have also been used to optimize modern CISC chips. The feasibility of the approach to encode the less common instructions using combinations of the "RISC set" in microcode ROM depends heavily on how well you can express their functionality using the "RISC set". It depends also on how much overhead you have to pay for the switch to microcode. The switch to microcode can be done in 0 cycles as the Intel i960 CA Users Manual states, but I am not sure that it can easily be done. Note that it is not always easy to express complex instructions on a CISC processor in terms of simple instructions. Complex instructions often have a lot of side-effects and you have to simulate them correctly. One thing causing trouble is the condition-code register. Some of the simpler instructions that you would like to use to simulate the complex ones may change the condition code in a way that does not match the semantics of the complex instruction. You can get rid of the problem by introducing some new instructions (whether they are only usable from the microcode ROM or not is a different issue), but it is not an easy task. Moreover, one of the though parts in designing high-performance CISC processors is to make the instruction decoder fast. RISC processors have typically very simple and regular instruction sets, where each instruction has the same size. This makes decoding them straightforward. Implementing an instruction decoder that can decode one instruction per cycle for a complex instruction set is hard to do, as there are a lot of different formats to be considered. Note that instruction decoding in a CISC environment does not naturally lead to pipelined solutions, as you need the size of the previous instruction in order to begin decoding the current instruction in the right place. >The great thing about the IM 68386R (RISCized) processor is >that super-scalarizing it will be no harder than for >a RISC processor because, we now essentially have >a RISC processor (one with subroutines microcoded to >handle CISC Set instructions). We will only be super-scalarizing >the RISC Set, _not_ the full set of the IM 68386C. No, here I disagree for two reasons. First, you have to treat the stream of instructions as if the complex instruction had been in place replaced by the stream from the microcode ROM. As this stream has originally been designed as one instruction, it has a high likelihood of having a lot of dependencies in it, so that there is not a lot of parallelism to be gained. You may get rid of this problem by using huge reservation stations and at least one level of speculative execution, but this means a lot of hardware. Second, as you said before, instructions from the RISC set are often encountered in the program. If you want to achieve an execution rate of more than one instruction per cycle, your decoder (the one decoding the CISC instruction set) has to decode more than one instruction per cycle. As I already said, it is pretty hard to design such a decoder that is able to decode one instruction per cycle, not to talk about one that can do multiple instructions per cycle. Note that each instruction has to be decoded after the other because of the varying size of the instructions. One way to solve this is to speculatively decode instructions starting at different offsets and then to discard the wrong ones. Let us assume you want to decode three instructions in the 386 instruction set per cycle on the average. The average instruction length on the 386 is 4.6 bytes as I remember. So with 14 (!!!) instruction decoders you should have a reasonable chance to get 3 instructions decoded per cycle. >The other great thing is that the IM 68386R is upward compatible >with the IM 68386C and can use its large installed base of >programs. Here I heavily disagree. It would be better to get away from these architectures as soon as possible. Note that each hour this machines are around new software for it is being written (that may not be easily ported to other architectures) giving more and more weight to your argument. Marc-Michael Brandis Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology) CH-8092 Zurich, Switzerland email: brandis@inf.ethz.ch
kls30@duts.ccc.amdahl.com (Kent L Shephard) (12/07/90)
That sounds like the '040 and the i486. Both have RISC cores and use microcode for the more complicated instructions. Both companies saw that the only way to gain speed was to hardwire most of the processor, and put a decent pipeline in them. They also solve some memory problems with on chip cache. The i486 now does loads & stores in one clock cycle and if you don't worry about the pipline latency, the i486 running sequential code is very fast. Both processors have performance better than 1st generation RISC ie. the first SPARC from Sun. (The 460 was the first.) I've heard that the i586 will be out around '92' and will be super scalar. (But that just rumours.) Kent -- /* -The opinions expressed are my own, not my employers. */ /* For I can only express my own opinions. */ /* */ /* Kent L. Shephard : email - kls30@DUTS.ccc.amdahl.com */
herrickd@iccgcc.decnet.ab.com (12/08/90)
In article <1990Dec7.061826.28241@ux1.cso.uiuc.edu>, msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes: > In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: > [Description of RISC architecture with CISC instructions microcoded deleted] > > > > What about instruction decode? RICS machines tend to have > fixed-format instructions that are easy to decode. (i.e. all > instructions are 32 bits, the first 6 are opcode, the next 15 specify > registers, the rest immediate data). CISCs tend to have instructions > of varying length and format. RISCs tend to have alignment > restrictions to a greater extent than CISCs. You lose some of the > benefits of RISC if you have to deal with these things. > Does anyone know how the internals of the 80486 and 68040 compare to > this scheme? > -- Cannot we preserve the intent of the original poster by adding one RISC instruction, "Nonsense Coming", that means the next few words of instruction memory contain data to be interpreted by the microcoded CISC program. Constrain the length of the nonsense to preserve the RISC program alignment requirements. Or, even, put the roms holding the microcode in the primary address space of the machine and invoke these CISC instruction subroutines the same way as any other subroutine. With a RISC call. dan herrick herrickd@astro.pc.ab.com > > > Michael Pereckas * InterNet: m-pereckas@uiuc.edu * > just another student... (CI$: 72311,3246) > Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (12/08/90)
In article <d5FA02R903ql01@JUTS.ccc.amdahl.com> kls30@DUTS.ccc.amdahl.com (Kent L. Shephard) writes: | Both processors have performance better than 1st generation RISC ie. | the first SPARC from Sun. (The 460 was the first.) I've heard that | the i586 will be out around '92' and will be super scalar. (But that | just rumours.) As editor of the 386-users mailing list I see a lot more rumors than almost anyone else in the world, and believe fewer ;-) A number of magazines have reported that Compaq and IBM are pushing Intel to get the 586 out in 91 because clone makers like AST are talking about making RISC based clone PCs. Mars Microsystems makes a SPARC clone with 386 added running DOS which certainly looks like a step in this direction. Note what I said about belief, this is not written in stone, but I hear it from a lot of people. As to the 586, the only consistent thing I hear is that there will be on-chip support for windows. I don't know if that means MS or X, but I find it hard to believe that Intel would be so stupid as to do anything which wouldn't at least be highly useful for both. And I would also suspect that the number of customers for the 586, at least initially, will be greater for UNIX than DOS. That may be true of the 486 now, I don't know. Assuming that the 586 does have support for windows in general, the price of performance will go down again. One chip with FPU, MMU, and windows hardware takes les {power, space, pins, glue chips} than any multichip solution. That could lead to some killer workstation class machines at PC prices. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) VMS is a text-only adventure game. If you win you can use unix.
msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (12/08/90)
In <2339.275f7e44@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes: >In article <1990Dec7.061826.28241@ux1.cso.uiuc.edu>, msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes: >> In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: >> >[Description of RISC architecture with CISC instructions microcoded > deleted] >> >> >> >> What about instruction decode? RICS machines tend to have >> fixed-format instructions that are easy to decode. (i.e. all >> instructions are 32 bits, the first 6 are opcode, the next 15 specify >> registers, the rest immediate data). CISCs tend to have instructions >> of varying length and format. RISCs tend to have alignment >> restrictions to a greater extent than CISCs. You lose some of the >> benefits of RISC if you have to deal with these things. >> Does anyone know how the internals of the 80486 and 68040 compare to >> this scheme? >> -- >Cannot we preserve the intent of the original poster by adding >one RISC instruction, "Nonsense Coming", that means the next >few words of instruction memory contain data to be interpreted >by the microcoded CISC program. Constrain the length of the >nonsense to preserve the RISC program alignment requirements. >Or, even, put the roms holding the microcode in the primary >address space of the machine and invoke these CISC instruction >subroutines the same way as any other subroutine. With a RISC >call. But then it wouldn't be compatable anymore. What's the point? It might be possable to design a system that allows you to automagically translate a binary compiled for the old CISC, but I suspect that this would be very hard to do and that it wouldn't work very well. And chances are that no body would want to use it. If the translator is imperfect (likely) then end-users won't want to try it and probably couldn't (no sources to work from). The vendor might well decide it would be easier to port it to a normal RISC. This might make it easier to port stuff written is assembly, but then, you wrote it in assembly to get speed, right? Rewrite it for the RISC and it will go faster. -- Michael Pereckas * InterNet: m-pereckas@uiuc.edu * just another student... (CI$: 72311,3246) Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado
mike@cs.umn.edu (Mike Haertel) (12/08/90)
Certainly RISCizing a CISC processor has been done, in the 68040 and 80486. The big problem I see with it is, why waste all that silicon space on the hair necessary to pipeline a complex instruction set? I think it would be far more worthwhile to waste it on things like faster multipliers or larger caches. -- Mike Haertel <mike@ai.mit.edu> "There are two ways of constructing a software design. One way is to make it so simple that there are obviously no deficiencies, and the other is to make it so complicated that there are no obvious deficiencies." -- C. A. R. Hoare
suitti@ima.isc.com (Stephen Uitti) (12/08/90)
> What about instruction decode? RICS machines tend to have > fixed-format instructions that are easy to decode. (i.e. all The 386 & 68K are easy compared to a VAX. However, this sort of thing has been done for VAXen. > RISCs tend to have alignment > restrictions to a greater extent than CISCs. This is one of those things that must be dealt with. It can be done with traps - slowly. If you have unaligned references on a VAX 780, it is slower. Most people won't notice. Even relatively dumb pcc based C compilers attempted to make this unlikely. Some examples: static data is aligned, malloc returns aligned pointers. It still happens. The uVAX II, for example, does not implement the entire VAX set. There wasn't enough room, or something. The OS gets traps, and emulates the instructions. This has been done for floating point for years on all sorts of machines. > If the translator is imperfect (likely) then end-users won't want > to try it and probably couldn't (no sources to work from). This isn't a new thing. For example, Interactive's UNIX for the 386 emulates a 387 if one isn't there. Think a 387 is simple to emulate? ...easy to test? Intel doesn't always get the chips right the first time either. In fact, people who produce CPUs often get it wrong for awhile on their 2nd & on generations. That doesn't mean it can't or won't be done. And it doesn't mean it isn't worth doing. Actually, there are lots of timing problems in systems that get shipped. Sometimes customers find them. I've had software that ran properly have a couple non-repeatable glitches here and there after only three or four months of CPU time on what most of us would call very reliable machines. It happens. The other thing you can do is design your hardware so that most of the instructions (that are used) get run in a cycle, and the CPU does the less used stuff in microcode. It can still be on chip. You won't get the advantage of not using a big chip. You won't get the high speed instruction decode you'd get from RISC. These are solvable - larger chips, more chips, multiple decoders, etc. It can be OK to spend money on the CPU for systems if the CPU costs are low relative to the systems. On the other end of the spectrum, people still produce weird 4 bit systems that are hard to program, that don't have lots of RAM or ROM, that don't have expandability for RAM, ROM, I/O, or anything, just because the CPU chip is the system, and thousands or millions of them are to be produced. Everybody wants a faster system. Yet, there are lots of people whose primary programing vehicle is the shell... Stephen. suitti@ima.isc.com "We Americans want peace, and it is now evident that we must be prepared to demand it. For other peoples have wanted peace, and the peace they received was the peace of death." - the Most Rev. Francis J. Spellman, Archbishop of New York. 22 September, 1940
hrubin@pop.stat.purdue.edu (Herman Rubin) (12/08/90)
In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: > I would like some input on the following idea to extend the life of > CISC processors. ....................... > instruction dynamic frequency > I[1] 22% > I[2] 8% > . . > . . > . . > I[n - 1] 0.002% > I[n] 0.001% > > In a CISC chip, there is a certain redundancy. In other words, > some of the complex instructions can be written in terms of the > simpler instructions. An instruction to move a block of data > from one place in memory to another place can be replaced > by a loop of simpler LOAD and STORE instructions. What about the operations which did not appear in the sample? Calculations using high precidion arithmetic may not even be identified as such. What about conversion between integer and floating point? On machines with the possibility of unnormalized floating point, would they even be recognized? On machines such as the IBM 360 series or the RS/6000, for which the conversions are clumsy and already comples, would they be noticed? Suppose that the CISC machine being analyzed has common integer and floating registers. Would the analysis catch the cases in which Boolean operations are used on floats? Suppose the machine has unaligned capabilities. Would the analysis catch those cases in which it was deliberately used in the algorithm? What we need is not the analysis of the current bad software for the needed instructions, but to ask the few who can think up new ways of using the natural capabilities of hardware what can be useful. Even then, much will be missed. Also, what is a simple instruction? Which is conceptually simpler, finding the distance to the next one in a bit stream, with the attendant problems about running out of bits, etc., or the clumsy way this must be approached on the so-called "efficient" architectures? -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
dlau@mipos2.intel.com (Dan Lau) (12/11/90)
In article <1200@dg.dg.com> uunet!dg!lewine writes: >In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: > ***HOWEVER***, the advantage of RISC is moving work from > runtime to compile time. The big speedup comes from compiler > work not hardware. At Data General we have modified some of > the compilers for our CISC MV-series to compile simple code > instead of using instructions like WEDIT. This has produced > major performance enhancements because a compiler can generate > special case code. I don't understand the comment above about the MV-series compilers. Are you saying that after DG changed the MV-series compilers to generate simple code, there was a major performance improvement (over the complex code)? Or are you saying that "because a compiler can generate special case code" (i.e., very complex instructions like WEDIT), there was a major performance enhancement over the simple code? I am confused, can you please clarify the above. Thanks. Dan Lau
hassey@matrix.rtp.dg.com (John Hassey) (12/11/90)
In article <1311@inews.intel.com> dlau@mipos2.UUCP (Dan Lau) writes: >In article <1200@dg.dg.com> uunet!dg!lewine writes: >>In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: >> ***HOWEVER***, the advantage of RISC is moving work from >> runtime to compile time. The big speedup comes from compiler >> work not hardware. At Data General we have modified some of >> the compilers for our CISC MV-series to compile simple code >> instead of using instructions like WEDIT. This has produced >> major performance enhancements because a compiler can generate >> special case code. > >I don't understand the comment above about the MV-series compilers. >Are you saying that after DG changed the MV-series compilers to generate >simple code, there was a major performance improvement (over the complex >code)? Or are you saying that "because a compiler can generate special >case code" (i.e., very complex instructions like WEDIT), there was a >major performance enhancement over the simple code? > >I am confused, can you please clarify the above. Thanks. > Dan Lau > While not the original poster, I think I can clarify the above statement. The DG Eclipse MV series has quite a few very complex instructions to handle things like Cobol data types (packed decimal etc...). WEDIT is used to do an "edited" store of a decimal number. It takes a source and destination pointer and the addres of an "edit" subprogram. Most of these instructions have a fairly high startup cost, and a per-byte cost equivalent to a load/store, when they were implemented in micro-code. However, the commercial instruction set used up alot of micro-code space and did nothing to improve the typical Fortran benchmarks, and so they were often emulated by taking an instruction trap (making them very slow). By making the compilers smarter, and detecting special cases, it is possible to avoid the use of these expensive instructions (especially when emulated) and generate code that is quite a bit faster. When implemented in micro-code these instructions weren't all that bad and they sure made building the Cobol compiler easier. john hassey hassey@dg-rtp.dg.com
meissner@osf.org (Michael Meissner) (12/12/90)
In article <1311@inews.intel.com> dlau@mipos2.intel.com (Dan Lau) writes: | In article <1200@dg.dg.com> uunet!dg!lewine writes: | >In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes: | > ***HOWEVER***, the advantage of RISC is moving work from | > runtime to compile time. The big speedup comes from compiler | > work not hardware. At Data General we have modified some of | > the compilers for our CISC MV-series to compile simple code | > instead of using instructions like WEDIT. This has produced | > major performance enhancements because a compiler can generate | > special case code. | | I don't understand the comment above about the MV-series compilers. | Are you saying that after DG changed the MV-series compilers to generate | simple code, there was a major performance improvement (over the complex | code)? Or are you saying that "because a compiler can generate special | case code" (i.e., very complex instructions like WEDIT), there was a | major performance enhancement over the simple code? | | I am confused, can you please clarify the above. Thanks. | Dan Lau Let me try to clarify some things. Only certain compilers actually generated WEDIT (notably Cobol and PL/1, possibly Basic). The {,W}EDIT instruction was actually a secondary instruction set that read a bytestream to figure out how to convert a number to a stream of bytes (I'm slightly fuzzy here, because in my ten years at Data General, I never once used a WEDIT instruction). Most programs do not need the complex interpretation, since the format is known at compile time. On these programs, the code generator would issue multiple simple instructions instead of WEDIT. I believe for some machines at least, WEDIT was removed, and the kernel would then simulate it if a WEDIT was actually used (old program, etc.). While I'm talking about the MV, let me expound on a successful way the MV was extended, and an unsuccessful way. For those of you who have never looked at the DG Nova/Eclipse/MV instruction set, there are 4 integer registers (on all versions), and 4 floating point registers (on the Eclipse and MV/Eclipse). Only two of the integer registers can be used as index registers. On the MV/Eclipse, the 4 stack values (stack pointer, frame pointer, stack base, and stack limit) are also held in registers, but there is no direct addressing mode to use these registers. The standard save instruction puts the frame pointer in one of the index registers. Needless to say, this put a crimp in code generation, particularly in doing things like: p1->field1 = p2->field1; p1->field2 = auto_var; p1->field3 = p2->field3; So we in Langauges, requested an addition to the instruction set that would give frame pointer relative addressing (and possibly stack pointer as well). For existing machines in the field, there was a slight penality to the upgrade, but one of the machines (the MV/7800 if I remember correctly) that was under development, but not yet shipped could only do this instruction in 27 clocks (ie, it would be faster on that machine to do a push, load register, whatever, pop). So, this feature had to be scrapped, because the hardware people didn't/couldn't respin the silicon. Sigh.... The more successful upgrade was how the sine, cosine, etc. instructions were added. For the high end machines (MV/10000 with FPU, MV/20000, and presumably MV/40000), the machine would have a hardware accelerator which would do the operation, but it was important to have the same binaries run on the low end machines as well with as little slowdown regarding the old method of calling library functions. The architect noticed that the standard long call instruction had a left over bit that was easy for the microcode to access, so the new instructions had the format: <16 bit opcode> <32 bit address of emulator> <16 bit subopcode> (on the long call instruction, the <16 bit subocode> field was the argument could that was pushed on top of the stack, so the return instruction could know how many words to pop off). This way, you did not have to trap to the kernel to implement the instructions, which can be much too slow, but instead just called the emulator directly. -- Michael Meissner email: meissner@osf.org phone: 617-621-8861 Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142 Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?
rst@cs.hull.ac.uk (Rob Turner) (12/14/90)
I first heard about this technique a few years ago when I was reading the documentation for the Clipper microprocessor. From what I can remember, the designers did *exactly* the thing you describe. Rob