[comp.arch] RISCizing a CISC processor

joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) (12/07/90)

I would like some input on the following idea to extend the life of
CISC processors.

Consider a hypothetical machine: IM 68386C (CISCized).
First, determine the dynamic instruction profile of the target mix.
If the target is engineering programs, then determine the dynamic
frequency of all instructions.  (A LOAD with indirect addressing
and a LOAD with direct addressing are considered different instructions
in the context of this posting.)

Then, rank the instructions from highest frequency to lowest.
Exclude I/O instructions.  Suppose that there are a total of "n" non-I/O
instructions.  Suppose that I[n] is the instruction with the highest
frequency and that I[1] is the instruction with the lowest frequency.
The ranking might look something like the following:

                 instruction       dynamic frequency
                    I[1]                22%
                    I[2]                8%
                     .                  .
                     .                  .
                     .                  .
                    I[n - 1]            0.002%
                    I[n]                0.001%

In a CISC chip, there is a certain redundancy.  In other words,
some of the complex instructions can be written in terms of the
simpler instructions.  An instruction to move a block of data
from one place in memory to another place can be replaced
by a loop of simpler LOAD and STORE instructions.

Now, from the ranking of the instructions, determine the
smallest "i" such that all I[j] with "j > i" can be written
in terms all I[k] with "k <= j".  Designate the set of
the first i instructions from the above ranking to
be the "RISC Set".  Designate the set of I/O instructions
to be the "I/O Set".  Relabel "i" to be "M", the minimal
number.  Designate the remaining instructions from the
above ranking to be the "CISC Set".

Now, using timing analysis, estimate the performance of
implementing the RISC Set and the I/O Set in hardware
and implementing the CISC Set as subroutines in
a microcode store.  These subroutines are written with
instructions from the RISC Set.  Whenever an instruction from
the CISC Set is encountered in the instruction stream, it
causes a trap to the appropriate subroutine in the
microcode store.  Essentially, what we have is a
RISC machine with some subroutines coded into ROM.
There might need to be additional registers over and above
those in the programmer's model in the IM 68386C in order
to maintain information like the following:

(1) the processor is executing instructions in a subroutine
    in microcode and is not executing instructions in the 
    normal instruction stream from main memory
(2) the address of the current byte of memory and the destination 
    to which the byte is transferred by a CISC Set block-move 
    instruction
(3) etc.

Designate these additional registers "Extra Registers".  Naturally,
they would be saved just prior to the servicing of an interrupt.

The great thing about the IM 68386R (RISCized) processor is
that super-scalarizing it will be no harder than for
a RISC processor because, we now essentially have
a RISC processor (one with subroutines microcoded to
handle CISC Set instructions).  We will only be super-scalarizing
the RISC Set, _not_ the full set of the IM 68386C.

The other great thing is that the IM 68386R is upward compatible
with the IM 68386C and can use its large installed base of
programs.

By the way, IM 68386C is a labeling derived from
68xxx (Motorola = M) and xx386 (Intel = I).

msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (12/07/90)

In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:

>I would like some input on the following idea to extend the life of
>CISC processors.

>Consider a hypothetical machine: IM 68386C (CISCized).
>First, determine the dynamic instruction profile of the target mix.
>If the target is engineering programs, then determine the dynamic
>frequency of all instructions.  (A LOAD with indirect addressing
>and a LOAD with direct addressing are considered different instructions
>in the context of this posting.)

>Then, rank the instructions from highest frequency to lowest.
>Exclude I/O instructions.  Suppose that there are a total of "n" non-I/O
>instructions.  Suppose that I[n] is the instruction with the highest
>frequency and that I[1] is the instruction with the lowest frequency.

[deletion]

>In a CISC chip, there is a certain redundancy.  In other words,
>some of the complex instructions can be written in terms of the
>simpler instructions.  An instruction to move a block of data
>from one place in memory to another place can be replaced
>by a loop of simpler LOAD and STORE instructions.

[deletion]

>Now, using timing analysis, estimate the performance of
>implementing the RISC Set and the I/O Set in hardware
>and implementing the CISC Set as subroutines in
>a microcode store.  These subroutines are written with
>instructions from the RISC Set.  Whenever an instruction from
>the CISC Set is encountered in the instruction stream, it
>causes a trap to the appropriate subroutine in the
>microcode store.  Essentially, what we have is a
>RISC machine with some subroutines coded into ROM.

[deletion]



What about instruction decode?  RICS machines tend to have
fixed-format instructions that are easy to decode.  (i.e.  all
instructions are 32 bits, the first 6 are opcode, the next 15 specify
registers, the rest immediate data).  CISCs tend to have instructions
of varying length and format.  RISCs tend to have alignment
restrictions to a greater extent than CISCs.  You lose some of the
benefits of RISC if you have to deal with these things.
Does anyone know how the internals of the 80486 and 68040 compare to
this scheme?
--


Michael Pereckas               * InterNet: m-pereckas@uiuc.edu *
just another student...          (CI$: 72311,3246)
Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado

lewine@cheshirecat.rtp.dg.com (Donald Lewine) (12/07/90)

In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
|> 
|> In a CISC chip, there is a certain redundancy.  In other words,
|> some of the complex instructions can be written in terms of the
|> simpler instructions.  An instruction to move a block of data
|> from one place in memory to another place can be replaced
|> by a loop of simpler LOAD and STORE instructions.
|> 
|> Now, using timing analysis, estimate the performance of
|> implementing the RISC Set and the I/O Set in hardware
|> and implementing the CISC Set as subroutines in
|> a microcode store. 

    That was exactly what was done in the MicroVAX architecture
    back in 1982.  The more complex instructions were emulated
    using the simple instructions.

    There was some cleverness in using hardware to decode the
    full set of VAX instructions and then call software to do
    the rest.

    This does not give you a RISC in the sense of architecutral
    purity.  The VAX (or 386 or 68K) instruction stream is still
    a bear to decode and does many things that violate the RISC
    religion.  You have merely proposed a new way to implement
    a CISC machine.  The VAX 9000 also uses a technique very 
    similar to the one you describe.

	***HOWEVER***, the advantage of RISC is moving work from 
    runtime to compile time.  The big speedup comes from compiler
    work not hardware. At Data General we have modified some of
    the compilers for our CISC MV-series to compile simple code
    instead of using instructions like WEDIT.  This has produced
    major performance enhancements because a compiler can generate
    special case code. 

--------------------------------------------------------------------
Donald A. Lewine                (508) 870-9008 Voice
Data General Corporation        (508) 366-0750 FAX
4400 Computer Drive. MS D112A
Westboro, MA 01580  U.S.A.

uucp: uunet!dg!lewine   Internet: lewine@cheshirecat.webo.dg.com

brandis@inf.ethz.ch (Marc Brandis) (12/07/90)

In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
>I would like some input on the following idea to extend the life of
>CISC processors.
>
>Consider a hypothetical machine: IM 68386C (CISCized).
>First, determine the dynamic instruction profile of the target mix.
[ some stuff deleted ]
>Then, rank the instructions from highest frequency to lowest.
[ some more deleted ]
>In a CISC chip, there is a certain redundancy.  In other words,
>some of the complex instructions can be written in terms of the
>simpler instructions. 
[ some stuff deleted ]
>Now, from the ranking of the instructions, determine the
>smallest "i" such that all I[j] with "j > i" can be written
>in terms all I[k] with "k <= j".  Designate the set of
>the first i instructions from the above ranking to
>be the "RISC Set". 
[ some stuff deleted ]
>Now, using timing analysis, estimate the performance of
>implementing the RISC Set and the I/O Set in hardware
>and implementing the CISC Set as subroutines in
>a microcode store.  These subroutines are written with
>instructions from the RISC Set. 

This is exactly what modern computer architecture is all about. Look for the
often encountered cases and optimize these while accepting some overhead for
the less common ones. This technique has been heavily used in the design of
RISC processors, but it is not restricted to this area, of course. Dynamic
distributions have also been used to optimize modern CISC chips. 

The feasibility of the approach to encode the less common instructions using
combinations of the "RISC set" in microcode ROM depends heavily on how well
you can express their functionality using the "RISC set". It depends also
on how much overhead you have to pay for the switch to microcode. The switch
to microcode can be done in 0 cycles as the Intel i960 CA Users Manual states,
but I am not sure that it can easily be done.

Note that it is not always easy to express complex instructions on a CISC
processor in terms of simple instructions. Complex instructions often have a
lot of side-effects and you have to simulate them correctly. One thing causing 
trouble is the condition-code register. Some of the simpler instructions that
you would like to use to simulate the complex ones may change the condition
code in a way that does not match the semantics of the complex instruction.
You can get rid of the problem by introducing some new instructions (whether
they are only usable from the microcode ROM or not is a different issue), but
it is not an easy task.

Moreover, one of the though parts in designing high-performance CISC processors
is to make the instruction decoder fast. RISC processors have typically very
simple and regular instruction sets, where each instruction has the same size.
This makes decoding them straightforward. Implementing an instruction decoder
that can decode one instruction per cycle for a complex instruction set is
hard to do, as there are a lot of different formats to be considered. Note
that instruction decoding in a CISC environment does not naturally lead to
pipelined solutions, as you need the size of the previous instruction in order
to begin decoding the current instruction in the right place.

>The great thing about the IM 68386R (RISCized) processor is
>that super-scalarizing it will be no harder than for
>a RISC processor because, we now essentially have
>a RISC processor (one with subroutines microcoded to
>handle CISC Set instructions).  We will only be super-scalarizing
>the RISC Set, _not_ the full set of the IM 68386C.

No, here I disagree for two reasons. First, you have to treat the stream of
instructions as if the complex instruction had been in place replaced by the
stream from the microcode ROM. As this stream has originally been designed
as one instruction, it has a high likelihood of having a lot of dependencies
in it, so that there is not a lot of parallelism to be gained. You may get
rid of this problem by using huge reservation stations and at least one level
of speculative execution, but this means a lot of hardware.

Second, as you said before, instructions from the RISC set are often 
encountered in the program. If you want to achieve an execution rate of more
than one instruction per cycle, your decoder (the one decoding the CISC
instruction set) has to decode more than one instruction per cycle. As I
already said, it is pretty hard to design such a decoder that is able to
decode one instruction per cycle, not to talk about one that can do multiple
instructions per cycle. Note that each instruction has to be decoded after
the other because of the varying size of the instructions. One way to solve
this is to speculatively decode instructions starting at different offsets
and then to discard the wrong ones. Let us assume you want to decode three
instructions in the 386 instruction set per cycle on the average. The average
instruction length on the 386 is 4.6 bytes as I remember. So with 14 (!!!)
instruction decoders you should have a reasonable chance to get 3 instructions 
decoded per cycle. 

>The other great thing is that the IM 68386R is upward compatible
>with the IM 68386C and can use its large installed base of
>programs.

Here I heavily disagree. It would be better to get away from these architectures
as soon as possible. Note that each hour this machines are around new software
for it is being written (that may not be easily ported to other architectures)
giving more and more weight to your argument. 

Marc-Michael Brandis
Computer Systems Laboratory, ETH-Zentrum (Swiss Federal Institute of Technology)
CH-8092 Zurich, Switzerland
email: brandis@inf.ethz.ch

kls30@duts.ccc.amdahl.com (Kent L Shephard) (12/07/90)

That sounds like the '040 and the i486.  Both have RISC cores and use
microcode for the more complicated instructions.  Both companies saw
that the only way to gain speed was to hardwire most of the processor, and
put a decent pipeline in them.  They also solve some memory problems
with on chip cache.  The i486 now does loads & stores in one clock cycle
and if you don't worry about the pipline latency, the i486 running
sequential code is very fast.

Both processors have performance better than 1st generation RISC ie.
the first SPARC from Sun. (The 460 was the first.)  I've heard that
the i586 will be out around '92' and will be super scalar. (But that
just rumours.)

                Kent
--
/*  -The opinions expressed are my own, not my employers.    */
/*      For I can only express my own opinions.              */
/*                                                           */
/*   Kent L. Shephard  : email - kls30@DUTS.ccc.amdahl.com   */

herrickd@iccgcc.decnet.ab.com (12/08/90)

In article <1990Dec7.061826.28241@ux1.cso.uiuc.edu>, msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:
> In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
> 
[Description of RISC architecture with CISC instructions microcoded
 deleted]
> 
> 
> 
> What about instruction decode?  RICS machines tend to have
> fixed-format instructions that are easy to decode.  (i.e.  all
> instructions are 32 bits, the first 6 are opcode, the next 15 specify
> registers, the rest immediate data).  CISCs tend to have instructions
> of varying length and format.  RISCs tend to have alignment
> restrictions to a greater extent than CISCs.  You lose some of the
> benefits of RISC if you have to deal with these things.
> Does anyone know how the internals of the 80486 and 68040 compare to
> this scheme?
> --
Cannot we preserve the intent of the original poster by adding
one RISC instruction, "Nonsense Coming", that means the next
few words of instruction memory contain data to be interpreted
by the microcoded CISC program.  Constrain the length of the 
nonsense to preserve the RISC program alignment requirements.

Or, even, put the roms holding the microcode in the primary
address space of the machine and invoke these CISC instruction
subroutines the same way as any other subroutine.  With a RISC
call.

dan herrick
herrickd@astro.pc.ab.com

> 
> 
> Michael Pereckas               * InterNet: m-pereckas@uiuc.edu *
> just another student...          (CI$: 72311,3246)
> Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado

davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (12/08/90)

In article <d5FA02R903ql01@JUTS.ccc.amdahl.com> kls30@DUTS.ccc.amdahl.com (Kent L. Shephard) writes:

| Both processors have performance better than 1st generation RISC ie.
| the first SPARC from Sun. (The 460 was the first.)  I've heard that
| the i586 will be out around '92' and will be super scalar. (But that
| just rumours.)

  As editor of the 386-users mailing list I see a lot more rumors than
almost anyone else in the world, and believe fewer ;-)

  A number of magazines have reported that Compaq and IBM are pushing
Intel to get the 586 out in 91 because clone makers like AST are talking
about making RISC based clone PCs. Mars Microsystems makes a SPARC clone
with 386 added running DOS which certainly looks like a step in this
direction. Note what I said about belief, this is not written in stone,
but I hear it from a lot of people.

  As to the 586, the only consistent thing I hear is that there will be
on-chip support for windows. I don't know if that means MS or X, but I
find it hard to believe that Intel would be so stupid as to do anything
which wouldn't at least be highly useful for both. And I would also
suspect that the number of customers for the 586, at least initially,
will be greater for UNIX than DOS. That may be true of the 486 now, I
don't know.

  Assuming that the 586 does have support for windows in general, the
price of performance will go down again. One chip with FPU, MMU, and
windows hardware takes les {power, space, pins, glue chips} than any
multichip solution. That could lead to some killer workstation class
machines at PC prices.
-- 
bill davidsen	(davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen)
    VMS is a text-only adventure game. If you win you can use unix.

msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) (12/08/90)

In <2339.275f7e44@iccgcc.decnet.ab.com> herrickd@iccgcc.decnet.ab.com writes:

>In article <1990Dec7.061826.28241@ux1.cso.uiuc.edu>, msp33327@uxa.cso.uiuc.edu (Michael S. Pereckas) writes:
>> In <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp> joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
>> 
>[Description of RISC architecture with CISC instructions microcoded
> deleted]
>> 
>> 
>> 
>> What about instruction decode?  RICS machines tend to have
>> fixed-format instructions that are easy to decode.  (i.e.  all
>> instructions are 32 bits, the first 6 are opcode, the next 15 specify
>> registers, the rest immediate data).  CISCs tend to have instructions
>> of varying length and format.  RISCs tend to have alignment
>> restrictions to a greater extent than CISCs.  You lose some of the
>> benefits of RISC if you have to deal with these things.
>> Does anyone know how the internals of the 80486 and 68040 compare to
>> this scheme?
>> --
>Cannot we preserve the intent of the original poster by adding
>one RISC instruction, "Nonsense Coming", that means the next
>few words of instruction memory contain data to be interpreted
>by the microcoded CISC program.  Constrain the length of the 
>nonsense to preserve the RISC program alignment requirements.

>Or, even, put the roms holding the microcode in the primary
>address space of the machine and invoke these CISC instruction
>subroutines the same way as any other subroutine.  With a RISC
>call.

But then it wouldn't be compatable anymore.  What's the point?  It
might be possable to design a system that allows you to automagically
translate a binary compiled for the old CISC, but I suspect that this
would be very hard to do and that it wouldn't work very well.  And
chances are that no body would want to use it.  If the translator is
imperfect (likely) then end-users won't want to try it and probably
couldn't (no sources to work from).  The vendor might well decide it
would be easier to port it to a normal RISC.  This might make it
easier to port stuff written is assembly, but then, you wrote it in
assembly to get speed, right?  Rewrite it for the RISC and it will go
faster.  

--

Michael Pereckas               * InterNet: m-pereckas@uiuc.edu *
just another student...          (CI$: 72311,3246)
Jargon Dept.: Decoupled Architecture---sounds like the aftermath of a tornado

mike@cs.umn.edu (Mike Haertel) (12/08/90)

Certainly RISCizing a CISC processor has been done, in the 68040 and 80486.
The big problem I see with it is, why waste all that silicon space on the
hair necessary to pipeline a complex instruction set?  I think it would
be far more worthwhile to waste it on things like faster multipliers
or larger caches.
-- 
Mike Haertel <mike@ai.mit.edu>
"There are two ways of constructing a software design.  One way is to make it
 so simple that there are obviously no deficiencies, and the other is to make
 it so complicated that there are no obvious deficiencies." -- C. A. R. Hoare

suitti@ima.isc.com (Stephen Uitti) (12/08/90)

> What about instruction decode?  RICS machines tend to have
> fixed-format instructions that are easy to decode.  (i.e.  all

The 386 & 68K are easy compared to a VAX.  However, this sort
of thing has been done for VAXen.

> RISCs tend to have alignment
> restrictions to a greater extent than CISCs.

This is one of those things that must be dealt with.  It can
be done with traps - slowly.  If you have unaligned references
on a VAX 780, it is slower.  Most people won't notice.  Even
relatively dumb pcc based C compilers attempted to make this
unlikely.  Some examples: static data is aligned, malloc returns
aligned pointers.  It still happens.

The uVAX II, for example, does not implement the entire VAX set.
There wasn't enough room, or something.  The OS gets traps, and
emulates the instructions.  This has been done for floating point
for years on all sorts of machines.

> If the translator is imperfect (likely) then end-users won't want
> to try it and probably couldn't (no sources to work from).

This isn't a new thing.  For example, Interactive's UNIX for the
386 emulates a 387 if one isn't there.  Think a 387 is simple to
emulate?  ...easy to test?  Intel doesn't always get the chips
right the first time either.  In fact, people who produce CPUs
often get it wrong for awhile on their 2nd & on generations.
That doesn't mean it can't or won't be done.  And it doesn't mean
it isn't worth doing.

Actually, there are lots of timing problems in systems that get
shipped.  Sometimes customers find them.  I've had software that
ran properly have a couple non-repeatable glitches here and there
after only three or four months of CPU time on what most of us
would call very reliable machines.  It happens.

The other thing you can do is design your hardware so that most
of the instructions (that are used) get run in a cycle, and the
CPU does the less used stuff in microcode.  It can still be on
chip.  You won't get the advantage of not using a big chip.  You
won't get the high speed instruction decode you'd get from RISC.
These are solvable - larger chips, more chips, multiple decoders,
etc.  It can be OK to spend money on the CPU for systems if the
CPU costs are low relative to the systems.

On the other end of the spectrum, people still produce weird 4
bit systems that are hard to program, that don't have lots of RAM
or ROM, that don't have expandability for RAM, ROM, I/O, or
anything, just because the CPU chip is the system, and thousands
or millions of them are to be produced.

Everybody wants a faster system.  Yet, there are lots of people
whose primary programing vehicle is the shell...

Stephen.
suitti@ima.isc.com
"We Americans want peace, and it is now evident that we must be
prepared to demand it.  For other peoples have wanted peace, and
the peace they received was the peace of death." - the Most Rev.
Francis J. Spellman, Archbishop of New York.  22 September, 1940

hrubin@pop.stat.purdue.edu (Herman Rubin) (12/08/90)

In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
> I would like some input on the following idea to extend the life of
> CISC processors.

			.......................

>                  instruction       dynamic frequency
>                     I[1]                22%
>                     I[2]                8%
>                      .                  .
>                      .                  .
>                      .                  .
>                     I[n - 1]            0.002%
>                     I[n]                0.001%
> 
> In a CISC chip, there is a certain redundancy.  In other words,
> some of the complex instructions can be written in terms of the
> simpler instructions.  An instruction to move a block of data
> from one place in memory to another place can be replaced
> by a loop of simpler LOAD and STORE instructions.

What about the operations which did not appear in the sample?  Calculations
using high precidion arithmetic may not even be identified as such.  What
about conversion between integer and floating point?  On machines with
the possibility of unnormalized floating point, would they even be recognized?
On machines such as the IBM 360 series or the RS/6000, for which the
conversions are clumsy and already comples, would they be noticed?

Suppose that the CISC machine being analyzed has common integer and floating
registers.  Would the analysis catch the cases in which Boolean operations
are used on floats?  Suppose the machine has unaligned capabilities.  Would
the analysis catch those cases in which it was deliberately used in the
algorithm?

What we need is not the analysis of the current bad software for the needed
instructions, but to ask the few who can think up new ways of using the
natural capabilities of hardware what can be useful.  Even then, much will
be missed.

Also, what is a simple instruction?  Which is conceptually simpler, finding
the distance to the next one in a bit stream, with the attendant problems 
about running out of bits, etc., or the clumsy way this must be approached
on the so-called "efficient" architectures?
--
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

dlau@mipos2.intel.com (Dan Lau) (12/11/90)

In article <1200@dg.dg.com> uunet!dg!lewine writes:
>In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
>	***HOWEVER***, the advantage of RISC is moving work from 
>    runtime to compile time.  The big speedup comes from compiler
>    work not hardware. At Data General we have modified some of
>    the compilers for our CISC MV-series to compile simple code
>    instead of using instructions like WEDIT.  This has produced
>    major performance enhancements because a compiler can generate
>    special case code. 

I don't understand the comment above about the MV-series compilers.
Are you saying that after DG changed the MV-series compilers to generate
simple code, there was a major performance improvement (over the complex
code)?  Or are you saying that "because a compiler can generate special
case code" (i.e., very complex instructions like WEDIT), there was a
major performance enhancement over the simple code?

I am confused, can you please clarify the above.  Thanks.
	Dan Lau

hassey@matrix.rtp.dg.com (John Hassey) (12/11/90)

In article <1311@inews.intel.com> dlau@mipos2.UUCP (Dan Lau) writes:
>In article <1200@dg.dg.com> uunet!dg!lewine writes:
>>In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
>>	***HOWEVER***, the advantage of RISC is moving work from 
>>    runtime to compile time.  The big speedup comes from compiler
>>    work not hardware. At Data General we have modified some of
>>    the compilers for our CISC MV-series to compile simple code
>>    instead of using instructions like WEDIT.  This has produced
>>    major performance enhancements because a compiler can generate
>>    special case code. 
>
>I don't understand the comment above about the MV-series compilers.
>Are you saying that after DG changed the MV-series compilers to generate
>simple code, there was a major performance improvement (over the complex
>code)?  Or are you saying that "because a compiler can generate special
>case code" (i.e., very complex instructions like WEDIT), there was a
>major performance enhancement over the simple code?
>
>I am confused, can you please clarify the above.  Thanks.
>	Dan Lau
>

    While not the original poster, I think I can clarify the above
    statement.

    The DG Eclipse MV series has quite a few very complex instructions to
    handle things like Cobol data types (packed decimal etc...). WEDIT
    is used to do an "edited" store of a decimal number.  It takes a
    source and destination pointer and the addres of an "edit" subprogram.

    Most of these instructions have a fairly high startup cost, and a
    per-byte cost equivalent to a load/store,  when they were implemented
    in micro-code.  However, the commercial instruction set used up alot
    of micro-code space and did nothing to improve the typical Fortran
    benchmarks, and so they were often emulated by taking an instruction
    trap (making them very slow).

    By making the compilers smarter, and detecting special cases, it is
    possible to avoid the use of these expensive instructions (especially
    when emulated) and generate code that is quite a bit faster.

    When implemented in micro-code these instructions weren't all that bad
    and they sure made building the Cobol compiler easier.

    john hassey
    hassey@dg-rtp.dg.com

meissner@osf.org (Michael Meissner) (12/12/90)

In article <1311@inews.intel.com> dlau@mipos2.intel.com (Dan Lau)
writes:

| In article <1200@dg.dg.com> uunet!dg!lewine writes:
| >In article <9012070105.AA02416@hcrlgw.crl.hitachi.co.jp>, joe@hcrlgw.crl.hitachi.co.JP (Dwight Joe) writes:
| >	***HOWEVER***, the advantage of RISC is moving work from 
| >    runtime to compile time.  The big speedup comes from compiler
| >    work not hardware. At Data General we have modified some of
| >    the compilers for our CISC MV-series to compile simple code
| >    instead of using instructions like WEDIT.  This has produced
| >    major performance enhancements because a compiler can generate
| >    special case code. 
| 
| I don't understand the comment above about the MV-series compilers.
| Are you saying that after DG changed the MV-series compilers to generate
| simple code, there was a major performance improvement (over the complex
| code)?  Or are you saying that "because a compiler can generate special
| case code" (i.e., very complex instructions like WEDIT), there was a
| major performance enhancement over the simple code?
| 
| I am confused, can you please clarify the above.  Thanks.
| 	Dan Lau

Let me try to clarify some things.  Only certain compilers actually
generated WEDIT (notably Cobol and PL/1, possibly Basic).  The
{,W}EDIT instruction was actually a secondary instruction set that
read a bytestream to figure out how to convert a number to a stream of
bytes (I'm slightly fuzzy here, because in my ten years at Data
General, I never once used a WEDIT instruction).  Most programs do not
need the complex interpretation, since the format is known at compile
time.  On these programs, the code generator would issue multiple
simple instructions instead of WEDIT.  I believe for some machines at
least, WEDIT was removed, and the kernel would then simulate it if a
WEDIT was actually used (old program, etc.).

While I'm talking about the MV, let me expound on a successful way the
MV was extended, and an unsuccessful way.

For those of you who have never looked at the DG Nova/Eclipse/MV
instruction set, there are 4 integer registers (on all versions), and
4 floating point registers (on the Eclipse and MV/Eclipse).  Only two
of the integer registers can be used as index registers.  On the
MV/Eclipse, the 4 stack values (stack pointer, frame pointer, stack
base, and stack limit) are also held in registers, but there is no
direct addressing mode to use these registers.  The standard save
instruction puts the frame pointer in one of the index registers.
Needless to say, this put a crimp in code generation, particularly in
doing things like:

	p1->field1 = p2->field1;
	p1->field2 = auto_var;
	p1->field3 = p2->field3;

So we in Langauges, requested an addition to the instruction set that
would give frame pointer relative addressing (and possibly stack
pointer as well).  For existing machines in the field, there was a
slight penality to the upgrade, but one of the machines (the MV/7800
if I remember correctly) that was under development, but not yet
shipped could only do this instruction in 27 clocks (ie, it would be
faster on that machine to do a push, load register, whatever, pop).
So, this feature had to be scrapped, because the hardware people
didn't/couldn't respin the silicon.  Sigh....

The more successful upgrade was how the sine, cosine, etc.
instructions were added.  For the high end machines (MV/10000 with
FPU, MV/20000, and presumably MV/40000), the machine would have a
hardware accelerator which would do the operation, but it was
important to have the same binaries run on the low end machines as
well with as little slowdown regarding the old method of calling
library functions.  The architect noticed that the standard long call
instruction had a left over bit that was easy for the microcode to
access, so the new instructions had the format:

	<16 bit opcode>
	<32 bit address of emulator>
	<16 bit subopcode>

(on the long call instruction, the <16 bit subocode> field was the
argument could that was pushed on top of the stack, so the return
instruction could know how many words to pop off).  This way, you did
not have to trap to the kernel to implement the instructions, which
can be much too slow, but instead just called the emulator directly.


--
Michael Meissner	email: meissner@osf.org		phone: 617-621-8861
Open Software Foundation, 11 Cambridge Center, Cambridge, MA, 02142

Considering the flames and intolerance, shouldn't USENET be spelled ABUSENET?

rst@cs.hull.ac.uk (Rob Turner) (12/14/90)

I first heard about this technique a few years ago when I was reading
the documentation for the Clipper microprocessor. From what I can
remember, the designers did *exactly* the thing you describe.

Rob