[comp.arch] RPM-40 microprocessor @ 40 MHz; data from ISSCC

mark@mips.COM (Mark G. Johnson) (02/22/88)

   
Several articles have recently appeared, alluding to a CMOS  uP
built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
<9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.

These USENET articles mention that the chip, called the "GE RPM-40",
runs a reduced instruction set, operates from 40 MHz clocks, and will
be described at ISSCC (International Solid State Ciruits Conference)
on February 17th.

The paper has now been delivered and published.  The authors were
David Lewis, Theodore Wyman, Mark French, and Frederic Boericke
(no acknowledgments were presented).

Here are a few items of interest on the RPM-40, obtained from the
oral presentation and the printed digest of technical papers.  No
analysis or critique is attempted; only a dump of raw data.  The
most noticeable unknowns are marked with a double asterisk **;
perhaps others can fill in these gaps (if the data isn't secret).
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

			GE RPM-40 CMOS MICROPROCESSOR

1.  The chip was built under a DOD contract.  It is one of several
    implementations under this contract.  There are at least three:
    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.


2.  The instruction set is "DARPA MIPS, core ISA (instruction set
    archictecture)".  In the GE chip, instructions are 16 bits long.
    They are fetched from Instruction Memory two-at-a-time (making
32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

Here is the summary chart of the instruction set:
***************************************************************************
*             15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   *
*           +-----------------------------------------------------------+ *
* ALU       | 0   0 | i |    opcode     |     src1/dest     | src2/imm  | *
*           +-----------------------------------------------------------+ *
* COND      | 0   1 | i |     test      |        src1       | src2/imm  | *
*           +-----------------------------------------------------------+ *
* LD        | 1   0   0 | s |     dest      |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* ST        | 1   0   1 | s |    source     |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* XPLD      | 1   1   0   0 |   xp-field    |      base     |  offset   | *
*           +-----------------------------------------------------------+ *
* BRA       | 1   1   0   1 |           branch displacement             | *
*           +-----------------------------------------------------------+ *
* PFX       | 1   1   1   0 |             prefix-immediate              | *
*           +-----------------------------------------------------------+ *
* XPINS     | 1   1   1   1 |         co-processor instruction          | *
*           +-----------------------------------------------------------+ *
***************************************************************************


The ALU format has two register specifiers; presumably you can code
"R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

The Store format has a source register, a base register, and a 4-bit
offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

Branch instructions _seem_ to have only a 12-bit displacement field;
there doesn't appear to be a "Branch Register", "Branch And Link",
or "Conditional Branch" instruction.  Perhaps the "COND" instruction
is the conditional-skip instruction recently mentioned on the net**.

ALU ops can have a 4-bit immediate field.  If this is too small, the
"PREFIX" instruction contains a 12-bit prefix that can be concatenated
to the immediate, to create a 16-bit immediate value.  Perhaps the
PREFIX instruction can be used with loads, stores, and conditionals
too. **

There are 21 32-bit registers; I _believe_ these are arranged as
16 general-purpose registers, plus 5 hardware stacks/queues (used in
exception processing) that are mapped into the register space. **

8-bit and 16-bit external data are converted into the internal 32-bit
format by zero-fill (unsigned) or sign-extend (signed).  This is to
fulfill the DOD requirement for byte and halfword support.  With only
a single "s" bit in the opcode it is difficult to see how these
instructions are encoded (load byte, load haldword, load word) "cross"
(signed, unsigned). **


3.  A four-stage instruction pipeline is used (except for loads, see
    below): Instruction Fetch, Instruction Decode, ALU, Writeback.
    Address calculations (branch addresses or operand addresses) are
performed in the ALU.


4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
    per second.  For the DOD, they benchmarked on a standard US Air Force
    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
value on that mix is 14 MIPS", the speaker said.


5.  The GE implementation uses a Harvard bus structure, with completely
    seperate Instruction Memory and Operand Memory.  GE currently is
    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
At present there is no way to increase the amount of physical memory
(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
for "embedded applications".


6.  There is a "branch target instruction cache" which consists of 32
    entries.  Each entry holds 5 instructions (10 bytes).  When a branch
    occurs, the chip looks (fully associatively) to see whether it holds
the instruction at the branch target address in its cache.  If a hit
(target instruction present) occurs, then the branch target instruction,
and the next 4 instructions, are read from the on-chip cache. Meanwhile
the off-chip Imem is readying itself to begin delivering the 6th thru Nth
instructions after the branch.  Claimed hit rates of the branch target
instruction cache are > 90%.  On a miss there is a 3-cycle latency to get
the Imem SRAM chips delivering instructions (and updating the b.t.i. cache).


7.  The instruction memory contains a "lookahead counter".  This lessens
    traffic on the address bus; instruction addresses only squirt out of
    the CPU after a branch .... leisurely reloading the counter while the
branch target instruction cache supplies the 5 instructions after a branch.


8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
    doesn't use the target register of a load until > 3 instructions after
    the load ("3 load delay slots" in some folks' parlance), then there
is no interlock and instructions are issued one per cycle.  If you use
the target register of a load <= 3 cycles later, there is a pipeline stall
while waiting for the Operand Memory to supply the data.

Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
constraints that prevent 1 store per cycle always, nor did they compare
and contrast loads vs. stores. **


9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
    opcode plus 12 bit coprocessor instruction type) are passed through
    the CPU, and sent over the address bus to the coprocessor(s).  They
can be stored in the branch target address cache.  So it _appears_ that
two cycles are required to do a coprocessor op, one to communicate it
from the CPU to the coprocessor and one to do it **.  GE didn't say
whether there were architecturally-visible register files on the
coprocessors **, but there _appears_ to be an "Xternal Processor Load"
instruction **.  The Floating Point coprocessor is in fab now and is
expected out this month.


10. The CPU chip contains 92,000 transistors and is housed in a 132 pin
    package.  The design style is fully static which is helpful for
    achieving radiation-hard parts.  40 pins are inputs, 46 pins are
outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins &
7 Ground pins.  No mention was made of whether this package configuration
had been "certified" to run at 40 MHz, nor what agency would perform such
certifications. **  The fab process is 1.2 micron bulk CMOS.


11. A simple virtual memory scheme called "most significant bit replacement"
    is used.  A process-id is appended to the MSB's of an address before
    sending it out of the CPU.  A special case occurs when those bits
are all-0's or all-1's.... ** **

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@apple.UUCP (Brian Case) (02/23/88)

In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
>
>   
>Several articles have recently appeared, alluding to a CMOS  uP
>built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.
>
>Branch instructions _seem_ to have only a 12-bit displacement field;
>there doesn't appear to be a "Branch Register", "Branch And Link",
>or "Conditional Branch" instruction.  Perhaps the "COND" instruction
>is the conditional-skip instruction recently mentioned on the net**.

Allen Baum (who attended the conference) told me that the single branch
instruction is only available in the conditional form.  Thus, for
an unconditional branch, you must make sure that you know the state of
the single boolean bit (compares test a condition and set the state of
the boolean bit).

>11. A simple virtual memory scheme called "most significant bit replacement"
>    is used.  A process-id is appended to the MSB's of an address before
>    sending it out of the CPU.  A special case occurs when those bits
>are all-0's or all-1's.... ** **

Isn't this the original Stanford MIPS scheme?

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/23/88)

An article by mark@mips.COM (Mark G. Johnson) says:
] The paper has now been delivered and published.  The authors were
] David Lewis, Theodore Wyman, Mark French, and Frederic Boericke
] (no acknowledgments were presented).

ISSCC is a circuit-design conference : these are the three people
most responsible for the circuit design, I think.

] 			GE RPM-40 CMOS MICROPROCESSOR
] 
] 1.  The chip was built under a DOD contract.  It is one of several
]     implementations under this contract.  There are at least three:
]     General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
] Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
] different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.

TI is teamed with CDC on there GaAs effort, and Sperry (Unisys) had a
contract for a different CMOS version.

] 2.  The instruction set is "DARPA MIPS, core ISA (instruction set
]     archictecture)".

The contract for all the processors specified efficient execution
AFTER TRANSLATION of the Core ISA. Core ISA is NOT the machine language.

]  In the GE chip, instructions are 16 bits long.
]  They are fetched from Instruction Memory two-at-a-time (making
]  32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.
 
] Here is the summary chart of the instruction set:
] ***************************************************************************
] *             15  14  13  12  11  10  9   8   7   6   5   4   3   2   1   *
] *           +-----------------------------------------------------------+ *
] * ALU       | 0   0 | i |    opcode     |     src1/dest     | src2/imm  | *
] *           +-----------------------------------------------------------+ *
] * COND      | 0   1 | i |     test      |        src1       | src2/imm  | *
] *           +-----------------------------------------------------------+ *
] * LD        | 1   0   0 | s |     dest      |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * ST        | 1   0   1 | s |    source     |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * XPLD      | 1   1   0   0 |   xp-field    |      base     |  offset   | *
] *           +-----------------------------------------------------------+ *
] * BRA       | 1   1   0   1 |           branch displacement             | *
] *           +-----------------------------------------------------------+ *
] * PFX       | 1   1   1   0 |             prefix-immediate              | *
] *           +-----------------------------------------------------------+ *
] * XPINS     | 1   1   1   1 |         co-processor instruction          | *
] *           +-----------------------------------------------------------+ *
] ***************************************************************************
] 
] The ALU format has two register specifiers; presumably you can code
] "R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

Correct, you need a two-instruction pair for three-address ops.

] The Store format has a source register, a base register, and a 4-bit
] offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

] Branch instructions _seem_ to have only a 12-bit displacement field;

... extendable to any length by PREFIX instructions ...

] there doesn't appear to be a "Branch Register", "Branch And Link",

... Branch Register is simply a MOV, B&L is a two-instruction pair ...

] or "Conditional Branch" instruction.  Perhaps the "COND" instruction
] is the conditional-skip instruction recently mentioned on the net**.

Yes, it can be applied to ANY instruction, not just branches.

] ALU ops can have a 4-bit immediate field.  If this is too small, the
] "PREFIX" instruction contains a 12-bit prefix that can be concatenated
] to the immediate, to create a 16-bit immediate value.  Perhaps the
] PREFIX instruction can be used with loads, stores, and conditionals
] too. **

PREFIXs can be prepended to ANY instruction that contains an immediate
field, including other PREFIX instructions, allowing immediates of any
size to be formed in the instruction stream without use of a g.p.
register or disruption of the pipeline flow.

] There are 21 32-bit registers; I _believe_ these are arranged as
] 16 general-purpose registers, plus 5 hardware stacks/queues (used in
] exception processing) that are mapped into the register space. **

There are 21 32-bit G.P. registers, plus various status registers, the
PCR, a TRAP register, and 5 hardware queues used for exception
processing. There are 32 register positions in the register map.

] 8-bit and 16-bit external data are converted into the internal 32-bit
] format by zero-fill (unsigned) or sign-extend (signed).  This is to
] fulfill the DOD requirement for byte and halfword support.  With only
] a single "s" bit in the opcode it is difficult to see how these
] instructions are encoded (load byte, load haldword, load word) "cross"
] (signed, unsigned). **

The one bit differentiates WORD and NON-WORD. What NON-WORD signifies
is determined by two bits (three for LD) in the user-accessable SR2 register.
 
] 3.  A four-stage instruction pipeline is used (except for loads, see
]     below): Instruction Fetch, Instruction Decode, ALU, Writeback.
]     Address calculations (branch addresses or operand addresses) are
]     performed in the ALU.

The "Instruction Fetch" (IF) stage doesn't really exist. The
instruction memory system is a look-ahead design.

] 5.  The GE implementation uses a Harvard bus structure, with completely
]     seperate Instruction Memory and Operand Memory.  GE currently is
]     using a total of 128Kbytes of memory: 16KWords of static RAM, each,
] for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
] At present there is no way to increase the amount of physical memory
] (e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
] for "embedded applications".

Dynamic RAM was not deemed applicable to the environments in which the
RPM40 is designed to function. The current limits on memory size are
NOT architectural, but a function of the size ands speed of available
RAM and the drive capacity of the RPM40 address bus drivers. The RPM40
is architectually able to address 4GBytes of instruction and 4GBytes
of operand memory.

] 6.  There is a "branch target instruction cache" which consists of 32
]     entries.  Each entry holds 5 instructions (10 bytes).  When a branch
]     occurs, the chip looks (fully associatively) to see whether it holds
] the instruction at the branch target address in its cache.  If a hit
] (target instruction present) occurs, then the branch target instruction,
] and the next 4 instructions, are read from the on-chip cache. Meanwhile
] the off-chip Imem is readying itself to begin delivering the 6th thru Nth
] instructions after the branch.  Claimed hit rates of the branch target
] instruction cache are > 90%.  On a miss there is a 3-cycle latency to get
] the Imem SRAM chips delivering instructions (and updating the b.t.i. cache).

Good luck on your patent application, AMD29000 people. This design
dates back to March 1986, was "published" by GE in October 1986, and is
first mentioned back in '75 or '76 in some SIGArch conference
proceedings. GE didn't think the architecture it was patentable.
Various implimentations, of course, may be.
 
] 7.  The instruction memory contains a "lookahead counter".  This lessens
]     traffic on the address bus; instruction addresses only squirt out of
]     the CPU after a branch .... leisurely reloading the counter while the
] branch target instruction cache supplies the 5 instructions after a branch.

"Leisurely" was a major part of RPM40 design : no splitting cycles on
external busses. 25ns just isn't long enough to multiplex, in CMOS.

] 8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
]     doesn't use the target register of a load until > 3 instructions after
]     the load ("3 load delay slots" in some folks' parlance), then there
] is no interlock and instructions are issued one per cycle.  If you use
] the target register of a load <= 3 cycles later, there is a pipeline stall
] while waiting for the Operand Memory to supply the data.
] 
] Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
] constraints that prevent 1 store per cycle always, nor did they compare
] and contrast loads vs. stores. **

You can do Stores every cycle, or Loads every cycle, if nothing else
interferes. And there's a lot that does. For instance, LD and ST don't
use the D bus during the same pipestage. This of course leads to a
pipeline hazard when a ST follows a LD by particular distances...

] 9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
]     opcode plus 12 bit coprocessor instruction type) are passed through
]     the CPU, and sent over the address bus to the coprocessor(s).  They
] can be stored in the branch target address cache.  So it _appears_ that
] two cycles are required to do a coprocessor op, one to communicate it
] from the CPU to the coprocessor and one to do it **.

It does take more than two cycles of LATENCY to do an XP op. However,
the RATE at which they can be done is one per cycle, as the
communication and execution of XP ops is pipelined. Everything in
RPM40 is pipelined : CPU, I-Cache, Memories, Coprocessors.

] GE didn't say
] whether there were architecturally-visible register files on the
] coprocessors **, but there _appears_ to be an "Xternal Processor Load"
] instruction **.  The Floating Point coprocessor is in fab now and is
] expected out this month.

XP architecture is transparent to the CPU. You want visible registers
in the XPs ? No problem. The FPU does have them. But the FPU has NOT
been "published" yet, so shouldn't be discussed.

] 10. The CPU chip contains 92,000 transistors and is housed in a 132 pin
]     package.  The design style is fully static which is helpful for
]     achieving radiation-hard parts.  40 pins are inputs, 46 pins are
] outputs, 32 pins are bidirectional (I/O), and there are 7 Power pins &
] 7 Ground pins.  No mention was made of whether this package configuration
] had been "certified" to run at 40 MHz, nor what agency would perform such
] certifications. **  The fab process is 1.2 micron bulk CMOS.

The package has run in excess of 75MHz. It's a leadless ceramic chip
carrier. It was chosen early in '86 because it had already been
certified (by someone GE trusts, I guess : maybe VHSIC ?) to run at
these speeds. The CPU IS running at 40MHz on a Mupac wire-wrap board,
executing the entire instruction set w/out problems ( once the two
clock phases were brought to the correct values )
 
] 11. A simple virtual memory scheme called "most significant bit replacement"
]     is used.  A process-id is appended to the MSB's of an address before
]     sending it out of the CPU.  A special case occurs when those bits
] are all-0's or all-1's.... ** **

0 to 23 of the MSb's are replaced, but if all the replaced bits aren't
allthe same as the most significant NON-replaced bit, an exception occurrs.

] -Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
] UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
] US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

This and the other DARPA MIPS processors are descendants,
philosophically anyway, of the original Stanford MIPS processor.
--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

jesup@pawl20.pawl.rpi.edu (Randell E. Jesup) (02/23/88)

In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
=These USENET articles mention that the chip, called the "GE RPM-40",
=runs a reduced instruction set, operates from 40 MHz clocks, and will
=be described at ISSCC (International Solid State Ciruits Conference)
=on February 17th.

=The
=most noticeable unknowns are marked with a double asterisk **;
=perhaps others can fill in these gaps (if the data isn't secret).

	To my knowlege, every thing I say in this article is public
information.  (I was on the RPM-40 software team for 1 year, until July 87.)

=1.  The chip was built under a DOD contract.  It is one of several
=    implementations under this contract.  There are at least three:
=    General Electric (CMOS bulk), McDonnell-Douglas (GaAs MESFET), and
=Texas Instruments (GaAs Bipolar).  Interestingly, they have each chosen a
=different pipeline: GE == 4 stages, McDonnell == 5 stages, TI == 6 stages.

	Also there's Sperry/UniSys (also CMOS).  It's not suprising that the
GaAs people use longer pipelines, they can't do much in that time, and are
restricted on transistors.

=2.  The instruction set is "DARPA MIPS, core ISA (instruction set
=    archictecture)".  In the GE chip, instructions are 16 bits long.
=    They are fetched from Instruction Memory two-at-a-time (making
=32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.

	All the machines listed above are designed so that 'Core ISA' (a
generic RISC assembly language, designed by Dr Gross of CMU) can be translated
to their native assembly languages.

=The ALU format has two register specifiers; presumably you can code
="R3 := R4 + R3"  but you cannot code "R3 := R4 + R1".

	Correct, r3 = r4 + r1 becomes r3 = r4; r3 = r3 + r1.

=The Store format has a source register, a base register, and a 4-bit
=offset field.  Loads have a dest reg, a base reg, and a 4-bit offset.

=Branch instructions _seem_ to have only a 12-bit displacement field;
=there doesn't appear to be a "Branch Register", "Branch And Link",
=or "Conditional Branch" instruction.  Perhaps the "COND" instruction
=is the conditional-skip instruction recently mentioned on the net**.

	Any of those displacements can be prefixed by PFX instruction(s)
to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
the next instruction, they can be 'stacked' to provide complex conditionals.

=ALU ops can have a 4-bit immediate field.  If this is too small, the
="PREFIX" instruction contains a 12-bit prefix that can be concatenated
=to the immediate, to create a 16-bit immediate value.  Perhaps the
=PREFIX instruction can be used with loads, stores, and conditionals
=too. **

	Yes, but you can use up to 3 prefixes to get 32 bit constants (in
reality, 32 bits are not used very often.)

=There are 21 32-bit registers; I _believe_ these are arranged as
=16 general-purpose registers, plus 5 hardware stacks/queues (used in
=exception processing) that are mapped into the register space. **

	Minor error, there are 21 gp registers, plus a number of special
purpose registers, mostly reserved to supervisor mode.  Several are stacks
for internal state mapped into register slots.  User available registers
are the PC, Trap register, sr2 (has various flags), and the Size register
(determines the size of non-word LD/ST, allows some register remapping,
and a bit for doing 16-bit overflow detection instead of 32).

=8-bit and 16-bit external data are converted into the internal 32-bit
=format by zero-fill (unsigned) or sign-extend (signed).  This is to
=fulfill the DOD requirement for byte and halfword support.  With only
=a single "s" bit in the opcode it is difficult to see how these
=instructions are encoded (load byte, load haldword, load word) "cross"
=(signed, unsigned). **

	There are state bits in the size register that control some of
this.  The 's' bit specifies "load word" or "load not word" (type defined
by size bits, usually you're only playing with one non-word type).

=4.  Performance with 40 MHz clocks is 40 million native RPM-40 opcodes
=    per second.  For the DOD, they benchmarked on a standard US Air Force
=    mix of instrictions called the `DAIS Mix'.  "The most pessimistic
=value on that mix is 14 MIPS", the speaker said.

	DAIS is a 1750a (Air Force Standard CPU) mix of instructions, the
DAIS timings are heavily FPU dependant and are in 1750a MIPS, not RPM-40!

=5.  The GE implementation uses a Harvard bus structure, with completely
=    seperate Instruction Memory and Operand Memory.  GE currently is
=    using a total of 128Kbytes of memory: 16KWords of static RAM, each,
=for the IMem and OMem.  Imem needs 50ns chips and Omem needs 25ns chips.
=At present there is no way to increase the amount of physical memory
=(e.g. with dynamic RAM).  The speaker said that the CPU chip is intended
=for "embedded applications".

	Well.... The current board has 128K, but the CPU supports full
32-bit addressing.  Nothing says you can't put more than 128K on it, or use
some sort of external cache.  The only limits are the amount of capacitance
the CPU can drive at 40 Mhz.

=8.  Loads take 7 cycles while ALU operations take 4 cycles.  If a program
=    doesn't use the target register of a load until > 3 instructions after
=    the load ("3 load delay slots" in some folks' parlance), then there
=is no interlock and instructions are issued one per cycle.  If you use
=the target register of a load <= 3 cycles later, there is a pipeline stall
=while waiting for the Operand Memory to supply the data.

	That is only a software stall, eg NOP-insertion.  Of course, the
reorganizer will try to fill it.  Note that the 7 & 4 cycle figures include
all pipe stages, including the illusionary IF stage.

=Stores "can" operate at "up to" 1 per cycle.  GE didn't discuss the
=constraints that prevent 1 store per cycle always, nor did they compare
=and contrast loads vs. stores. **

	There are some interlocks with other address-bus using instructions.
You can string as many stores in a row you want, or as many loads.

=9.  Coprocessor instructions (16 bits: 4 bit "Xternal Processor Instruction"
=    opcode plus 12 bit coprocessor instruction type) are passed through
=    the CPU, and sent over the address bus to the coprocessor(s).  They
=can be stored in the branch target address cache.  So it _appears_ that
=two cycles are required to do a coprocessor op, one to communicate it
=from the CPU to the coprocessor and one to do it **.  GE didn't say
=whether there were architecturally-visible register files on the
=coprocessors **, but there _appears_ to be an "Xternal Processor Load"
=instruction **.  The Floating Point coprocessor is in fab now and is
=expected out this month.

	The CPU doesn't have to wait, it just issues the instruction over
the address bus.  There is an XPLoad instruction, coprocessor dependant.

=11. A simple virtual memory scheme called "most significant bit replacement"
=    is used.  A process-id is appended to the MSB's of an address before
=    sending it out of the CPU.  A special case occurs when those bits
=are all-0's or all-1's.... ** **

	Tasks can be allocated memory under this scheme in power-of-two
sized chunks == 256 bytes.  Of course, instructions and data have different
mappings.

=++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
=-Mark Johnson	*** DISCLAIMER: The opinions above are personal. ***	
=UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mark   TEL: 408-991-0208
=US mail: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

	I hate to admit this, but it was decided that Core ISA mandated
little-endian memory layout, since several other Core ISA users had implemented
their CPUs that way already when we questioned it.  (Will little-endianism
dog out heels forever? :-)

	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
unix box envirionment 68020.)

{ WARNING:  this is VERY ROUGH, and though I have calulations available that
            say this, they are very back-of-napkin style!  However, it's
	    probably not TOO far off.  Maybe we'll have real performance
	    figures at some point from GE (I don't work there anymore). }

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/24/88)

An article by bcase@apple.UUCP (Brian Case) says:
] In article <1642@mips.mips.COM> mark@mips.COM (Mark G. Johnson) writes:
] >
] >   
] >Several articles have recently appeared, alluding to a CMOS  uP
] >built by General Electric, e.g. <9629@steinmetz.steinmetz.UUCP>,
] ><9631@steinmetz.steinmetz.UUCP>, and <375@imagine.PAWL.RPI.EDU>.
] >
] >Branch instructions _seem_ to have only a 12-bit displacement field;
] >there doesn't appear to be a "Branch Register", "Branch And Link",
] >or "Conditional Branch" instruction.  Perhaps the "COND" instruction
] >is the conditional-skip instruction recently mentioned on the net**.
] 
] Allen Baum (who attended the conference) told me that the single branch
] instruction is only available in the conditional form.  Thus, for
] an unconditional branch, you must make sure that you know the state of
] the single boolean bit (compares test a condition and set the state of
] the boolean bit).

Allen Baum has misinterpretted. Branches are conditional-ized just
like any other instruction (except PREFIX). If and only if the branch
(and its PREFIXes, if any) are preceeded by one or more COND
instructions (and there PREFIXes, if any) is the branch conditional.
 
] >11. A simple virtual memory scheme called "most significant bit replacement"
] >    is used.  A process-id is appended to the MSB's of an address before
] >    sending it out of the CPU.  A special case occurs when those bits
] >are all-0's or all-1's.... ** **
] 
] Isn't this the original Stanford MIPS scheme?

It's an enhancement of the original Stanford scheme.


--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

mash@mips.COM (John Mashey) (02/24/88)

In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
...
>=or "Conditional Branch" instruction.  Perhaps the "COND" instruction
>=is the conditional-skip instruction recently mentioned on the net**.

>	Any of those displacements can be prefixed by PFX instruction(s)
>to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
>the next instruction, they can be 'stacked' to provide complex conditionals.

I assume that cond skips the next instruction, including the PFX's??

>	Minor error, there are 21 gp registers, plus a number of special
>purpose registers, mostly reserved to supervisor mode.  Several are stacks
>for internal state mapped into register slots.  User available registers
>are the PC, Trap register, sr2 (has various flags), and the Size register
>(determines the size of non-word LD/ST, allows some register remapping,
>and a bit for doing 16-bit overflow detection instead of 32).

How do you address more than 16 gp regs, given the encoding?

>	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
>16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
>unix box envirionment 68020.)

I.e., assuming that such 68020s are around 2 vax-mips, this sounds like
about 14-18 vax-mips, roughly.

>{ WARNING:  this is VERY ROUGH, and though I have calulations available that
>            say this, they are very back-of-napkin style!  However, it's
>	    probably not TOO far off.  Maybe we'll have real performance
>	    figures at some point from GE (I don't work there anymore). }
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

zs01+@andrew.cmu.edu (Zalman Stern) (02/24/88)

There are a few things I don't quite get about the RPM-40 info posted here 
recently. Maybe somebody can clarify these for me.

First, in the "DARPA MIPS core ISA", which binding does "MIPS" take? Is this 
MIPS the corporation, MIPS the Stanford research project, or MIPS the vacuous 
abbreviation?

In some instructions, it appears that the second register argument is only 4 
bits long. Does this limit one to the first 16 registers or am I missing 
something? Can this be extended with a PREFIX instruction even though it isn't 
an immediate operand?

Could you post the ALU opcodes available? Notably do they include multiply an 
divide? (I know this is probably a stupid question.)

How does a coprocessor load work? Does the main processor calculate the 
address and then tell the coprocessor to pick up the data off the bus when it 
arrives? Also, how would one send a main processor register to the 
coprocessor?
I don't understand the virtual memory implementation at all. I would 
appreciate it if someone could elaborate on how this works.

I find this a very interesting architecture. My first impression is that the 
prefix instruction makes the 16 bit instructions reasonable. (A big win in my 
opinion.) It is rather like having an extra register for holding immediate 
operands only. I guess it might complicate exception handling though.

Sincerely,
Zalman Stern
Internet: zs01+@andrew.cmu.edu     Usenet: I'm soooo confused...
Information Technology Center, Carnegie Mellon, Pittsburgh, PA 15213-3890

oconnor@sunset.steinmetz (Dennis M. O'Connor) (02/25/88)

An article by zs01+@andrew.cmu.edu (Zalman Stern) says:
] There are a few things I don't quite get about the RPM-40 info posted here 
] recently. Maybe somebody can clarify these for me.
] 
] First, in the "DARPA MIPS core ISA", which binding does "MIPS" take? Is this 
] MIPS the corporation, MIPS the Stanford research project, or MIPS the
] vacuous abbreviation?

None of the above, sort of. Closest is the Stanford research project.
It has the same (overly cute and contorted :-) meaning here as it did
there :	Microprocessor (without) Interlocked PipestageS. It really is
a historical refernce to the original Stanford work, which inspired
the DARPA program. And also birthed MIPS, Inc.

] In some instructions, it appears that the second register argument is only 4 
] bits long. Does this limit one to the first 16 registers or am I missing 
] something? Can this be extended with a PREFIX instruction even
] though it isn't an immediate operand?

Yes, this limits the non-destination register to be one of R0 through
R15.Usually. There are exceptions, mainly two instructions that reverse
which register is considered the destination.

] Could you post the ALU opcodes available? Notably do they include
] multiply an divide? (I know this is probably a stupid question.)

At this time, I can clarify anything presented at ISSCC. I am not sure
I can divulge new information. Sorry, not my decision to make.
But the multiply/divide support is NOT a stupid question. Beleive me,
we thought of maybe four fundamentally different approaches before
picking the fastest one we could affords to implement in the system.
Maybe more details later ...

] How does a coprocessor load work? Does the main processor calculate the 
] address and then tell the coprocessor to pick up the data off the bus
] when it arrives? 

Yes, and also tells the coprocessor which register to put it in.

[ ... some stuff I'm unsure about responding to omitted ... ]

] I find this a very interesting architecture. My first impression is
] that the prefix instruction makes the 16 bit instructions reasonable.
] (A big win in my opinion.) It is rather like having an extra register
] for holding immediate operands only. I guess it might complicate
] exception handling though.

Thanks for the complement. WE like it too. PREFIX instructions, BTW,
is adapted from the InMOS transputer concept of building instructions
a byte at a time. IT's essentially a simplified version. An
interesting point : it makes the RPM40 instruction set totally
oblivious to data word size. The registers could be any number of
bits, and the instruction set would cope gracefully. It also makes the
instruction set "context independent", a feature of MANY if not ALL
RISC architectures : once you have an instruction in hand, you don't
need to know where it came from to know what it is. This is unlike,
say, the VAX and 68K families, which have immediate values imbedded
within the instruction stream which are only distinguishable if the
previous "instruction" word is known. Makes dis-assembly tougher.
Probably makes instruction decoding tougher too.

Back to PREFIX : using our PREFIX scheme, we handle commonly occuring
small constants quickly, and larger constants more slowly, but without
using additional user-visible resources. The cost of a large constant
is "step-wise proportional" to the constants length in bits.  Given
what we could find in the research about constants, this is the
correct "RISC philosophy" thing to do.

All you 32-bit instruction advocates : how many of your 32-bits of
instruction are usually wasted ( like by leading zeroes or ones, or
unused register specifications ) ? If it sounds like I'd welcome a
debate on the merits of 16 vs 32 bit instructions : sure. Isn't that
what comp.arch is for ? And I said a DEBATE, not a fire-fight :-)

] Sincerely,
] Zalman Stern


--
	Dennis O'Connor 	oconnor@sunset.steinmetz.UUCP ??
				ARPA: OCONNORDM@ge-crd.arpa
    "Nuclear War is NOT the worst thing people can do to this planet."

jesup@pawl3.pawl.rpi.edu (Randell E. Jesup) (02/29/88)

In article <1666@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>In article <409@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>>	Any of those displacements can be prefixed by PFX instruction(s)
>>to extend the displacement up to 32 bits.  Yes, Cond conditionally skips
>>the next instruction, they can be 'stacked' to provide complex conditionals.

>I assume that cond skips the next instruction, including the PFX's??

	Yup.  The COND instruction actually skips the next (non-PFX,non-COND)
instruction.  Essentially, it acts as though PFX is part of the instruction
after it.  Example:

		COND GT,.r1,.r2
		PFX  #$xxx
		COND GE,.r1,#$yy
		PFX  #$qqq
		ADD  .r1,#zz
		MOV  .r2,.r1

If the either cond fails, control goes to the MOV instruction.  Of course,
you would write PFX's in yourself, the assembler does them for you auto-
magicly.

>>	Minor error, there are 21 gp registers, plus a number of special
>>purpose registers, mostly reserved to supervisor mode.  Several are stacks
>>for internal state mapped into register slots.  User available registers
>>are the PC, Trap register, sr2 (has various flags), and the Size register
>>(determines the size of non-word LD/ST, allows some register remapping,
>>and a bit for doing 16-bit overflow detection instead of 32).
>
>How do you address more than 16 gp regs, given the encoding?

	In general, the destination of ALU ops can be any register 0-31.
However, for most ALU ops the source must be in regs 0-15.  There are
two ways around this:
	1)  There are two instructions that reverse the meanings of "source"
	    and "destination".  These are RMOV (reverse move) and RADD (reverse
	    add).  These allow moving the higher registers to the lower
	    or adding them into the lower (two high-freqency ops).
	2)  There is a bit that allows swapping of the regs 8-13 and regs
	    16-21.
Note that loads and stores also must only use regs 0-15.

	There is no guarantee the higher registers will be extremely useful,
but they are very useful for things like temps, or passing args, or
accumulators, etc.  The swap feature can make them much more useful, but
requires more work to use.

>>	VERY rough figures is 1 rpm-40 @ 40Mhz is about equal to 7-9
>>16Mhz 68020's with 0 wait-state memory and no MMU delay.  (Not your standard
>>unix box envirionment 68020.)
>
>I.e., assuming that such 68020s are around 2 vax-mips, this sounds like
>about 14-18 vax-mips, roughly.

	That seems to jibe fairly well.  Of course, only real benchmarks
will tell the story, and those depend on compiler tech quite a bit.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

jesup@pawl3.pawl.rpi.edu (Randell E. Jesup) (02/29/88)

In article <kW8iFty00Vs8EnE0ij@andrew.cmu.edu> zs01+@andrew.cmu.edu (Zalman Stern) writes:
	    ^^^^  interesting user name

>First, in the "DARPA MIPS core ISA", which binding does "MIPS" take? Is this 
>MIPS the corporation, MIPS the Stanford research project, or MIPS the vacuous 
>abbreviation?

	The vacuous abbrev.

>Could you post the ALU opcodes available? Notably do they include multiply an 
>divide? (I know this is probably a stupid question.)

	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
course they are not there.  No direct support on CPU for them either.  I will
say more on this issue when the FPU is formally announced.  You can do them
in the CPU in software if you want, takes a few cycles though.

>I don't understand the virtual memory implementation at all. I would 
>appreciate it if someone could elaborate on how this works.

	It's not virtual memory, but memory protection and address translation.
It allows you to declare power-of-2 sized allocations for a task in both
instruction & data memories.  I really don't want to go into the bit-banging
here & now, however.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

dennisr@ncr-sd.SanDiego.NCR.COM (Dennis Russell) (03/01/88)

In article <443@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
>course they are not there.  No direct support on CPU for them either.  I will
>say more on this issue when the FPU is formally announced.  You can do them
>in the CPU in software if you want, takes a few cycles though.
>
The MIPS R2000 supports 32-bit integer signed and unsigned multiply and
divide.  Multiply and divide operations are performed by a separate,
autonomous execution unit within the R2000.  After a multiply or divide
operation is started, execution of other instructions may continue in
parallel.


-- 
Dennis Russell                               |      NCR Corp., M/S 4720
phone:    619-485-3214                       |      16550 W. Bernardo Dr.     
UUCP:  ...{ihnp4|pyramid}!ncr-sd!dennisr     |      San Diego, CA 92128       

bcase@Apple.COM (Brian Case) (03/02/88)

In article <443@imagine.PAWL.RPI.EDU< beowulf!lunge!jesup@steinmetz.UUCP writes:
<In article <kW8iFty00Vs8EnE0ij@andrew.cmu.edu< zs01+@andrew.cmu.edu (Zalman Stern) writes:
<	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
<course they are not there.  No direct support on CPU for them either.  I will
<say more on this issue when the FPU is formally announced.  You can do them
<in the CPU in software if you want, takes a few cycles though.

Well, I have seen such instructions (they only trap; they were included in
anticipation of on-chip multiply/divide resources).

<<I don't understand the virtual memory implementation at all. I would 
<<appreciate it if someone could elaborate on how this works.
<
<	It's not virtual memory, but memory protection and address translation.
<It allows you to declare power-of-2 sized allocations for a task in both
<instruction & data memories.  I really don't want to go into the bit-banging
<here & now, however.

Isn't the RPM40 scheme just the good-old Stanford MIPS scheme?  Sounds
like it to me.  If so, I can quote some references.

bs@linus.UUCP (Robert D. Silverman) (03/02/88)

In article <2065@ncr-sd.SanDiego.NCR.COM: dennisr@ncr-sd.SanDiego.NCR.COM (0000-Dennis Russell) writes:
:In article <443@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
:>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
:>course they are not there.  No direct support on CPU for them either.  I will
:>say more on this issue when the FPU is formally announced.  You can do them
:>in the CPU in software if you want, takes a few cycles though.
:>
:The MIPS R2000 supports 32-bit integer signed and unsigned multiply and
:divide.  Multiply and divide operations are performed by a separate,
:autonomous execution unit within the R2000.  After a multiply or divide
:operation is started, execution of other instructions may continue in
:parallel.
:
:
:-- 
:Dennis Russell                               |      NCR Corp., M/S 4720
:phone:    619-485-3214                       |      16550 W. Bernardo Dr.     
:UUCP:  ...{ihnp4|pyramid}!ncr-sd!dennisr     |      San Diego, CA 92128       
 
RISC architectures are great for some things and uniformly lousy for others.
That's all I've been trying to point out.



What about 32 x 32 bit multiplies or 64 by 32 bit divides? emul and ediv
are marvelous for this. How else would one compute

A*B/C
A*B mod C

where A, B, C are all 32 bit quantities? The result always fits in 32 bits
but unless you can compute A*B without overflow you can not get a correct
answer to these computations.

Bob Silverman

kers@otter.hple.hp.com (Christopher Dollin) (03/04/88)

Version 2 of the Acorn Risc Machine has two multiply instructions (one with,
one without, accumulate), but no divide instruction.

At a seminar I attended, the designer* said that (a) they could fit it on the
chip, and (b) it afforded enough performance increase to be an acceptable
overhead (rather than having a multiply-step, or doing it with shift-and-add).

Mildly surprising, considering the shiftable-register-source in the data
manipulation instructions (gives you multiplies by constants of the form 2^n,
2^(n+1), 2^(n-1) in one instruction). Could it be something to do with having
interpreted BBC Basic as a principal language, so there isn't a compiler to
notice that E*K can be done speedily?

Regards,
Kers.

* well, one of the designers.