[comp.arch] RPM-40 microprocessor @ 40 MHz; dat

aglew@ccvaxa.UUCP (02/25/88)

>=2.  The instruction set is "DARPA MIPS, core ISA (instruction set
>=    architecture)".  In the GE chip, instructions are 16 bits long.
>=    They are fetched from Instruction Memory two-at-a-time (making
>=32 bit xfrs) at a 20 MHz rate, totalling 40M instructions per sec.
>
>	All the machines listed above are designed so that 'Core ISA' (a
>generic RISC assembly language, designed by Dr Gross of CMU) can be translated
>to their native assembly languages.

Okay, what about this MIPS-like ISA? Will it be assembly language only,
or binary? Will it be possible to run some form of program intermediate
between C and actual assembly through a translator to move between these
families - and will third party software vendors distribute that portable
form?

aglew@ccvaxa.UUCP (02/27/88)

..> Prefix instructions in the GE RPM-40

I like this idea.

(I should - I used it in a school project back in '84, before I knew details
of the Transputer - I think I got it from an earlier architecture, melded
with the 8088's PREFIX instructions.)

I particularly like how it begins to let the instruction set get independent
of the register size (so long as people do not expect 1<<32 == 0)

A question, though: how would you compare PREFIX to an instruction SHIFT and
OR --  SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually
require a specification for one of several literal fields it is extending,
plus it requires state to be saved on interrupts, which leans towards 
assembling the constant in a register.
    On the other hand, you can always build a decoder that never puts prefix
into a register at all, but takes prefix and the prefixed instruction as
one packet. This is nice, and makes it a pity to require the register write.

What do people (particularly the RPM-40 people) feel on this?



Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   
    aglew@gould.com     	- preferred, if you have nameserver
    aglew@gswd-vms.gould.com    - if you don't
    aglew@gswd-vms.arpa 	- if you use DoD hosttable
    aglew%mycroft@gswd-vms.arpa - domains are supposed to make things easier?
   
My opinions are my own, and are not the opinions of my employer, or any
other organisation. I indicate my company only so that the reader may
account for any possible bias I may have towards our products.

jesup@pawl3.pawl.rpi.edu (Randell E. Jesup) (02/29/88)

In article <28200110@ccvaxa> aglew@ccvaxa.UUCP writes:
>>	All the machines listed above are designed so that 'Core ISA' (a
>>generic RISC assembly language, designed by Dr Gross of CMU) can be translated
>>to their native assembly languages.
>
>Okay, what about this MIPS-like ISA? Will it be assembly language only,
>or binary? Will it be possible to run some form of program intermediate
>between C and actual assembly through a translator to move between these
>families - and will third party software vendors distribute that portable
>form?

	Core ISA is an assembly language for a non-existant machine.  It
is fairly 'RISCy', but includes things like multiply (integer and FP) as
single ops, etc.  It has no relation to ANY existant hardware at all, and was
designed explicitly for the Darpa MIPS project.

	Anything distributed in Core ISA is portable (at least potentially).
All the machines mentioned have Core_ISA->their_assembler translators.
However, I suspect most stuff will be distributed in source (the compilers
produce Core ISA, that's the point of it).  Assembler modules should all be
written in Core as well.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

jesup@pawl3.pawl.rpi.edu (Randell E. Jesup) (02/29/88)

In article <28200112@ccvaxa> aglew@ccvaxa.UUCP writes:
>..> Prefix instructions in the GE RPM-40

>A question, though: how would you compare PREFIX to an instruction SHIFT and
>OR --  SHOR r,lit ::== r := (r<<14)|lit? PREFIX always seems to eventually
>require a specification for one of several literal fields it is extending,
>plus it requires state to be saved on interrupts, which leans towards 
>assembling the constant in a register.

	Pipelining!  You can't use the result of an op in the next
instruction!  So you'd have to devote both a register AND intersperse NOPs
between SHORs.  However, on a machine with loopback of ALU results (may
slow things down) it only costs a register, so it doesn't hurt TOO much
(if you have registers to spare, which you very well might not).

	What are these 'several fields' you refer to?  RPM-40 can only have
1 value that might be extended via prefix in any instruction (immediates
for ALU and COND ops, offset for load/store/branch, xp instruction field
for XPINST, etc.)

>    On the other hand, you can always build a decoder that never puts prefix
>into a register at all, but takes prefix and the prefixed instruction as
>one packet. This is nice, and makes it a pity to require the register write.

	RPM-40 does that now, but handles each prefix as it comes along
(there are some hidden resources being used).  What you imply would complicate
the decoder a lot.

>Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

oconnor@sunset.steinmetz (Dennis M. O'Connor) (03/01/88)

An article by aglew@ccvaxa.UUCP says:
] 
] ..> Prefix instructions in the GE RPM-40
] 
] I like this idea.
] [...]
] I particularly like how it begins to let the instruction set get independent
] of the register size (so long as people do not expect 1<<32 == 0)
] 
] A question, though: how would you compare PREFIX to an instruction SHIFT and
] OR --  SHOR r,lit ::== r := (r<<14)|lit?

PREFIX builds immidiate values that can then be added, ored,
subtracted or whatever to anything you like. It does not use
a user register to do this (minor win). And it does NOT access
the register file, or use the ALU. In a pipelined system
this is significant : PREFIX as implimented in RPM40 have no latency
problems (major win). SHOR would have latency problems.

] PREFIX always seems to eventually
] require a specification for one of several literal fields it is extending,
] plus it requires state to be saved on interrupts, which leans towards 
] assembling the constant in a register.

RPM40 instructions only have one field that can possibly be an
immediate operand, why more ? Any operations on two constants should
be done at compile or load time, I think. Given you have a
reverse-subtract instruction ( normal = op1-op2, reverse = op2-op1 )
I don't see the need for two "immidiate-able" fields.

Yes, the prefix register needs to be saved on a context switch, and in
fact has to have a old value saved. This is not really a big deal.

]     On the other hand, you can always build a decoder that never puts prefix
] into a register at all, but takes prefix and the prefixed instruction as
] one packet. This is nice, and makes it a pity to require the register write.

This is a good idea, especially if you can fetch instructions faster
than you can execute (non-PREFIX) instructions.

] Andy "Krazy" Glew. Gould CSD-Urbana.    1101 E. University, Urbana, IL 61801   


--
    Dennis O'Connor			      UUNET!steinmetz!sunset!oconnor
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

mash@mips.COM (John Mashey) (03/01/88)

In article <9727@steinmetz.steinmetz.UUCP> sunset!oconnor@steinmetz.UUCP writes:
...
>] A question, though: how would you compare PREFIX to an instruction SHIFT and
>] OR --  SHOR r,lit ::== r := (r<<14)|lit?
>
>PREFIX builds immidiate values that can then be added, ored,
>subtracted or whatever to anything you like. It does not use
>a user register to do this (minor win). And it does NOT access
>the register file, or use the ALU. In a pipelined system
>this is significant : PREFIX as implimented in RPM40 have no latency
>problems (major win). SHOR would have latency problems.

Why would it have latency problems? None of the popular RISCs have
latency problems with r = r op literal for the usual ops.
I.e., any high-performance system is likely to make use of
register-bypassing anyway, so that:
	r = r op literal
	r = r op r
has zero intervening latency (the performance penalty of a
cycle's latency for such things is large).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

aglew@ccvaxa.UUCP (03/02/88)

>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
>course they are not there.  No direct support on CPU for them either.  I will
>say more on this issue when the FPU is formally announced.  You can do them
>in the CPU in software if you want, takes a few cycles though.

If your customers spend time doing multiplies or divides, then your RISC
designer will put them in. Cray is the only "RISCy" machine that is widely
known with multiply that springs to mind, though. Same for floating point.

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (03/02/88)

In article <444@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
| [...]
| 	Core ISA is an assembly language for a non-existant machine.  It
| is fairly 'RISCy', but includes things like multiply (integer and FP) as
| single ops, etc.  It has no relation to ANY existant hardware at all, and was
| designed explicitly for the Darpa MIPS project.
| 
| 	Anything distributed in Core ISA is portable (at least potentially).
| All the machines mentioned have Core_ISA->their_assembler translators.

If it clarifies the situation, ISA is functionally similar to the old
UCSD P-system, and I don't see any technical reason why it couldn't be
interpreted instead of translated and compiled.

For history bufs, the original "B" language compiler compiled to P-code,
which was then used to generate assembler. We had a P-code interpreter
on several machines.
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

tim@amdcad.AMD.COM (Tim Olson) (03/02/88)

In article <445@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
| 	Pipelining!  You can't use the result of an op in the next
| instruction!  So you'd have to devote both a register AND intersperse NOPs
| between SHORs.  However, on a machine with loopback of ALU results (may
| slow things down) it only costs a register, so it doesn't hurt TOO much
| (if you have registers to spare, which you very well might not).

Interesting... this is the first RISC processor I have heard of that did
not implement operand {forwarding/bypassing/other names?} around the
ALU.  What prompted the elimination of this feature?  Do you have any
statistics on how many additional nops/stalls are required?

Thanks for any info...


	-- Tim Olson
	Advanced Micro Devices
	(tim@amdcad.amd.com)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/02/88)

An article by mash@winchester.UUCP (John Mashey) says:
] In article <...> sunset!oconnor@steinmetz.UUCP writes:
] ...
] >] [...] how would you compare PREFIX to an instruction SHIFT and
] >] OR --  SHOR r,lit ::== r := (r<<14)|lit?
] >
] > [...] PREFIX as implimented in RPM40 have no latency
] >problems (major win). SHOR would have latency problems.
] 
] Why would it have latency problems? None of the popular RISCs have
] latency problems with r = r op literal for the usual ops.

Then the RPM40 and its GaAs brethren aren't "popular RISCs".

] I.e., any high-performance system is likely to make use of
] register-bypassing anyway, so that:
] 	r = r op literal
] 	r = r op r
] has zero intervening latency (the performance penalty of a
] cycle's latency for such things is large).

Who said we don't use register bypassing ? But that's not
the point. "Popular RISCs" don't have any latency on
ALU ops because they ARE ( No Dennis don't say it, no, no ... )
SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)
An explanation follows :

IMHO, a pipelined processor should run as fast as the its ALU 
lets it. Some RISC processors DO NOT do this. Instead, they
perform either the operand-read or the result-write for an
instruction in the same pipestage as the ALU op. This results
in a BIG increase in cycle time, and therefore a BIG decrease
in performance.

E.G : say your ALU latency is 25ns, and your register read or write
takes 10ns. Combine a register access with the ALU operation and
you have a 28MIPS machine. Seperate them and you have a 40MIPS
machine. But you have higher latency. So which is the win ?

Even a simple bypass path adds to this delay. It means
that whatever the setup and delay times of this path,
it must be added to the basic machine cycle time, IF
that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
This is LESS of a penalty than adding a register access,
but still a penalty. So is it a win ?

To be honest, I don't know. Although I have read plenty of
research on BRANCH latency, I haven't seen much research on
how often ALU result latency would result in interlocks, or
even on how often LOAD latency would result in interlocks.
Perhaps John Mashey has. If so, I'd like to see the
references. Until then, I don't know what John means when he
says "any high-performance system" will :likely" have zero latency.
CRAYs don't. They're high performance. Aren't they ?

] -john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

Yes, I'm still smiling. Forgive my, uh, "SLOW" outburst : Sorry !


--
    Dennis O'Connor			      UUNET!steinmetz!sunset!oconnor
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/03/88)

An article by tim@amdcad.UUCP (Tim Olson) says:
] In article <445@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
] Interesting... this is the first RISC processor I have heard of that did
] not implement operand {forwarding/bypassing/other names?} around the
] ALU.  What prompted the elimination of this feature?  Do you have any
] statistics on how many additional nops/stalls are required?
] 
] Thanks for any info...
] 	-- Tim Olson

Well, CRAYs are KINDA "RISC"y, and they don't loop results directly back :-)

But seriously, this will be one of the things that the RPM40 team hope
to present in a paper submitted to ICCD, if it gets accepted. Until
then, our DARPA contract prohibits disclosure. It could be worse,
at least it's not ITARS restricted. But believe me, it was NOT
a decision we made likely. OR neccesarily correctly :-)
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

bron@olympus.SGI.COM (Bron C. Nelson) (03/04/88)

In article <28200116@ccvaxa>, aglew@ccvaxa.UUCP writes:
> If your customers spend time doing multiplies or divides, then your RISC
> designer will put them in. Cray is the only "RISCy" machine that is widely
> known with multiply that springs to mind, though. Same for floating point.

FYI, The Cray XMP machines do NOT have hardware support for a general
integer (64bit) multiply.  They can do address length (24bit) integer
multiplies.  It has no hardware for integer divide (of any length).
If you need these operations, you have to convert to floating point.
------
Bron Nelson   bron@sgi.com
Don't blame my employers for my opinions.

bcase@Apple.COM (Brian Case) (03/04/88)

In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>"Popular RISCs" don't have any latency on
>ALU ops because they ARE ( No Dennis don't say it, no, no ... )
>SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)

Boy, I must say I don't know what you are thinking.  Do you mean they are
slow because they don't have 40 MHz versions?  Or do you mean that they
are slow in terms of VAX-equivalent MIPS?  If the former, then just wait
a little while.  There are probably more 40 MHz RISC machines in most
other companies labs than there are in yours (I strongly suspect the MIPS
guys have them, for example), but they won't let them out because of
characterization and specification limitations (that is, they may only
be 40 MHz (or even more) at room temperature).  If the latter, I think
you are wrong.  To be less opaque, I think that the RPM40 VAX-equivalent
MIPS is no better than, say, a 25 MHz Am29000 or a 16 MHz MIPS (both
with caches, you understand; and I am not saying that the 25 MHz 29000 is
the same as a 16 MHz MIPS).  We're talking integer here.

>IMHO, a pipelined processor should run as fast as the its ALU 
>lets it. Some RISC processors DO NOT do this. Instead, they
>perform either the operand-read or the result-write for an
>instruction in the same pipestage as the ALU op.

Er, which ones do this?  I don't know of any among MIPS, SPARC, Am29000,
ARM (but it does have a shifter in there, which could be bad), even
CLIPPER.  In fact, I do know of one, but no one else out there probably
does (it's still vaporware).

>Even a simple bypass path adds to this delay. It means
>that whatever the setup and delay times of this path,
>it must be added to the basic machine cycle time, IF
>that cycle time is determined by the ALU, as it SHOULD BE (IMHO).
>This is LESS of a penalty than adding a register access,
>but still a penalty. So is it a win ?

I still agree that the ALU should govern cycle time (but I would always
include bypassing; in my experience, there just isn't enough stuff to move
around to spearate the computations from the uses with useful work a
significant fraction of the time), but I now know that a much more
probable cycle time determiner is cache cycle time.  This can be either
the instruction cache, or the TLB, or whatever.  I suspect that omitting
bypassing is a bad choice, but like you say, there isn't much "proof."

>To be honest, I don't know. Although I have read plenty of
>research on BRANCH latency, I haven't seen much research on
>how often ALU result latency would result in interlocks, or
>even on how often LOAD latency would result in interlocks.
>Perhaps John Mashey has. If so, I'd like to see the

The folklore to which I have been exposed goes like this:  First load
delay slot probability of being filled:  0.7; second load delay slot: 0.3;
third delay slot:  0.1; thereafter, not significant.

>references. Until then, I don't know what John means when he
>says "any high-performance system" will :likely" have zero latency.
>CRAYs don't. They're high performance. Aren't they ?

For single-thread, integer computations, they're not "high performance"
(or at least not "highest performance") by state-of-the-art RISC
standards (at least our CRAY XMP isn't).  Perhaps the CRAY 3 will be
quite a bit ahead when it comes out, I dunno.

wcs@ho95e.ATT.COM (Bill.Stewart.<ho95c>) (03/04/88)

In article <28200116@ccvaxa> aglew@ccvaxa.UUCP writes:
:
:>	Ever seen a multiply or divide as 1 instruction in a RISC?  No, of
:>course they are not there.  No direct support on CPU for them either.  I will
:>say more on this issue when the FPU is formally announced.  You can do them
:>in the CPU in software if you want, takes a few cycles though.
:
:If your customers spend time doing multiplies or divides, then your RISC
:designer will put them in. Cray is the only "RISCy" machine that is widely
:known with multiply that springs to mind, though. Same for floating point.

The AT&T Digital Signal Processor chips are RISCy, and do single-instruction
multiplies, because that's what the chips' customers do.  The DSP-32 does
32-bit floating point - each cycle does an add and a multiply if you want them,
and/or 16-bit integer ops; I think the pipeline is 4 deep for multiplies.
The original chip did 4 Million cycles/sec (16MHz clock?); the current version
does 6 Million.  The next generation will be faster.  The current chip also
includes serial and parallel I/O hardware, but only 64K address space;
the next will be more general.

The DSP-16 does 16-bit integers (multiplies into 36 bits); it's got very
limited memory (1-4K on chip), and has a more limited instruction set, but the
16 - 19 million cycles/sec do a multiply and/or add as well as separate integer
ops for address calculation.
-- 
#				Thanks;
# Bill Stewart, AT&T Bell Labs 2G218, Holmdel NJ 1-201-949-0705 ihnp4!ho95c!wcs

oconnor@sunset.steinmetz (Dennis M. O'Connor) (03/05/88)

An article by bcase@apple.UUCP (Brian Case) says:
] In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
] >"Popular RISCs" don't have any latency on
] >ALU ops because they ARE ( No Dennis don't say it, no, no ... )
] >SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)
] 
] Boy, I must say I don't know what you are thinking.  Do you mean they are
] slow because they don't have 40 MHz versions?  Or do you mean that they
] are slow in terms of VAX-equivalent MIPS?

At least the former, and perhaps the latter, but I obviosly mainly
MEANT it HUMOUROUSLY. Couldn't you tell.

] If the former, then just wait a little while.
]  There are probably more 40 MHz RISC machines in most
] other companies labs than there are in yours (I strongly suspect the MIPS
] guys have them, for example), but they won't let them out because of
] characterization and specification limitations (that is, they may only
] be 40 MHz (or even more) at room temperature).

The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-),
at up to 85C and down to 4.5V. It's currently running 40MIPS on
a wire-wrap board. We haven't said it won't or doesn't run
faster, 40MIPS is what it was designed to do, using conservative
design rules. What, do you think we designed it willy-nilly and
then cranked up the clock till it stopped working ? Sorry,
that trick doesn't cut it at GE.

Speculating about how fast other people are running in the labs is
hogwash : until I see it in ISSCC or another credible forum, I don't care.
So, name another 32-bit CMOS micro from ISSCC. Or anywhere.
Besides, how do you know WE're not running something faster in our lab.
But we don't compare an existing, published device with vaporware.

]  If the latter, I think
] you are wrong.  To be less opaque, I think that the RPM40 VAX-equivalent
] MIPS is no better than, say, a 25 MHz Am29000 or a 16 MHz MIPS (both
] with caches, you understand; and I am not saying that the 25 MHz 29000 is
] the same as a 16 MHz MIPS).  We're talking integer here.

I think YOUR wrong, and I am in a better position to say so.
Don't you think you're being a little quick with your evaluation ?
How much do you know about RPM40 ?

Look, even I don't know how much real "VAX-equivalent" performance
this chip has, except that it has LOTS, so how can anyone else ?
If anyone knew, I'd know, either because I determined it or
because someone would tell me (hi Pete, hi Phil, if you're there :-)

I hope to have dhrystone numbers soon. Not the greatest benchmark,
but I'm doing it on my own time, so there's a limit on what I can
do. But at least we'll have something, so we'll see. Small benchmarks,
like acker, show performance as just over 4 times a Sun3/260, and
over 25 times the performance I got on our VAX11/785. I'M NOT IMPLYING THAT
ACKER IS A VALID MEASURE OF ANYTHING. Just the facts here, ma'am,
no conclusions. ACKER is probably NOT a fair evaluation of a VAX.

But please : I know "MIPS" is in disrepute as a measure of
perfromance, but please DON'T use MHz as a measure of instruction
execution rate. RPM40 @ 40MHz = 40MIPS. MC68020 @ 16MHz < 4MIPS.
MIPS R2000 @ 16MHz = 8MIPS. AMD29000 @ 25MHz = ....
Get the point ? MHz is a frequency, not an execution rate.
Why don't we all use MIPS ("average" or "peak" qualified as needed)
for instruction execution rate, MHz for clock speed, and
VIPS ( VAX-11/780-Indexed Performance Standard :-) or whatever
for some measure of general-purpose computing power ?
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

davidsen@steinmetz.steinmetz.UUCP (William E. Davidsen Jr) (03/05/88)

In article <9792@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>Why don't we all use MIPS ("average" or "peak" qualified as needed)
>for instruction execution rate, MHz for clock speed, and
>VIPS ( VAX-11/780-Indexed Performance Standard :-) or whatever
>for some measure of general-purpose computing power ?

  I like it!! Now we have VIPS, which will make it a lot easier to tell
what is meant by the numbers. Way to go! Put that idea on your monthly
report ;^>
-- 
	bill davidsen		(wedu@ge-crd.arpa)
  {uunet | philabs | seismo}!steinmetz!crdos1!davidsen
"Stupidity, like virtue, is its own reward" -me

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/05/88)

An article by bcase@apple.UUCP (Brian Case) says:
] In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
] >IMHO, a pipelined processor should run as fast as the its ALU 
] >lets it. Some RISC processors DO NOT do this. Instead, they
] >perform either the operand-read or the result-write for an
] >instruction in the same pipestage as the ALU op.
] 
] Er, which ones do this?  I don't know of any among MIPS, SPARC, Am29000,
] ARM (but it does have a shifter in there, which could be bad), even
] CLIPPER.  In fact, I do know of one, but no one else out there probably
] does (it's still vaporware).

You may be right, I could be wrong. In fact, I think I was. Sorry..

] I still agree that the ALU should govern cycle time (but I would always
] include bypassing; in my experience, there just isn't enough stuff to move
] around to spearate the computations from the uses with useful work a
] significant fraction of the time), but I now know that a much more
] probable cycle time determiner is cache cycle time.  This can be either
] the instruction cache, or the TLB, or whatever.  I suspect that omitting
] bypassing is a bad choice, but like you say, there isn't much "proof."

Smaller caches, like the RPM40 TIB cache, are faster. And can
be just as effective. But I'm not sure the details of the TIB
have been released. I'll expand on it if it has been. What happened
with RPM40 was : whenever anything was too slow to make 40MHz,
additional design effort was thrown at it until it was fast enough.
This means we have several near-critical paths. This would have been
impossible without a good process model and simulation tools, of course.
Did I ever mention that the ALU was laid out about a half-dozen times
so it would make speed ? 
--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) (03/05/88)

In article <1729@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes:
>I.e., any high-performance system is likely to make use of
>register-bypassing anyway, so that:
>	r = r op literal
>	r = r op r
>has zero intervening latency (the performance penalty of a
>cycle's latency for such things is large).

>-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>

	Two reasons why one might not have register bypassing:

1)  Slows down critical path.  Any finely tuned risc CPU will most probably
have it's cycle time determined by the latency through the ALU.  Using a
loopback of ALU results might result (depending on layout, tech, etc) in up
to a 20% slowdown in the ALU, plus increase the chip area and layout
problems.  This doesn't mean a loopback is a loss necessarily, but that it
does have a measurable cost which must be weighed against the benefits.

2)  In combination with (1) above, I'm not sure that having a one-cycle delay
in ALU results causes any large loss.  A good reorganizer can fill those
latencies, or move the ALU op forward into, for example, a load delay.  In
high-speed (> 15 Mhz) RISCs (and maybe slower ones as well), load delays
are usually the determining factor, or a large part of it.  What studies do you
have that compare RISC's with 1 cycles ALU delays and 0-cycle?  I'd like to
see anything you can drag up.

3)  If one is doing much FP, the CPU is usually waiting on results from the
FPU anyway, so you may not lose anything.  (I know I said 2, but....)

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jesup@pawl23.pawl.rpi.edu (Randell E. Jesup) (03/05/88)

In article <9800@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>An article by bcase@apple.UUCP (Brian Case) says:
>] I still agree that the ALU should govern cycle time (but I would always
>] include bypassing; in my experience, there just isn't enough stuff to move
>] around to spearate the computations from the uses with useful work a
>] significant fraction of the time), but I now know that a much more
>] probable cycle time determiner is cache cycle time.  This can be either
>] the instruction cache, or the TLB, or whatever.  I suspect that omitting
>] bypassing is a bad choice, but like you say, there isn't much "proof."

	Another point: even without bypassing, if you're using the ALU for
address computation you can store the results of a ALU op in the next
instruction.  This removes a lot of whatever loss you have in not having
bypassing, since <modify;store> and <load;modify;store> are fairly
frequent operations.  Once again, you must look at your assumptions with
skepticism in RISC design: calculate what it will cost you to implement
a feature, then how much you gain.  Also, remember that other peoples 
figures/assumptions may not match yours, especially if they are focusing
on a specific part of performance (like integer-only, or FP-only, etc).

>Smaller caches, like the RPM40 TIB cache, are faster. And can
>be just as effective. But I'm not sure the details of the TIB
>have been released. I'll expand on it if it has been. What happened
>with RPM40 was : whenever anything was too slow to make 40MHz,
>additional design effort was thrown at it until it was fast enough.
>This means we have several near-critical paths. This would have been
>impossible without a good process model and simulation tools, of course.
>Did I ever mention that the ALU was laid out about a half-dozen times
>so it would make speed ? 

	Dennis, I think the title of the ISSCC talk was "40 Mhz CMOS CPU with
instruction cache", so I think it's ok.  Not like it's a patentable idea,
anyway. :-)

	There were quite a few go-rounds on some of the paths, if I remember.
But running at 40 Mhz in wirewrap shows we designed plenty conservatively.
I wonder how fast we could crank it up in the high speed board at room temp?

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

mash@mips.COM (John Mashey) (03/06/88)

In article <9794@steinmetz.steinmetz.UUCP> davidsen@kbsvax.steinmetz.UUCP (William E. Davidsen Jr) writes:
>In article <9792@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>>Why don't we all use MIPS ("average" or "peak" qualified as needed)
>>for instruction execution rate, MHz for clock speed, and
>>VIPS ( VAX-11/780-Indexed Performance Standard :-) or whatever
>>for some measure of general-purpose computing power ?

>  I like it!! Now we have VIPS, which will make it a lot easier to tell
>what is meant by the numbers. Way to go! Put that idea on your monthly
>report ;^>

Some of us have been consistently careful to always say that we
mean vax-equivalent-mips when we use mips.  As I understand it, the
truly correct term (from DEC itself) is VUPS.  As far as I can tell,
that's normalized to
	VAX 11/780 = 1
	VAX/VMS compilers (i.e., good optimizers), on VMS or Ultrix
	includes mix of integer and floating-point
Note: that is NOT 4.3BSD compilers (C: VMS is better by 1.1-1.5,
				    FORTRAN: VMS often 2X better)
	and it is NOT MicroVAX II's (which are more like .8-.9 of the 780)

Perhaps someone from DEC can correct me if I have the wrong definition of VUPS.

I agree with Bill 100%: MIPS-ratings by themselves are vacuous, and the
only thing that really helps people understand relative performance is
relative performance numbers.  The only pain of it is that VUPS can well
be a moving target (it's like having the meter-bar grow on you).
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	{ames,decwrl,prls,pyramid}!mips!mash  OR  mash@mips.com
DDD:  	408-991-0253 or 408-720-1700, x253
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

bcase@Apple.COM (Brian Case) (03/07/88)

In article <476@imagine.PAWL.RPI.EDU> beowulf!lunge!jesup@steinmetz.UUCP writes:
>In article <9800@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>	Another point: even without bypassing, if you're using the ALU for
>address computation you can store the results of a ALU op in the next
>instruction.  This removes a lot of whatever loss you have in not having
>bypassing, since <modify;store> and <load;modify;store> are fairly
>frequent operations.

Yes, this works if the TLB or address bus is in the pipe stage following
the ALU.  In the Am29000, this is not the case:  the TLB is in the same
stage as the ALU.  Thus, without bypassing, things would be more difficult.
(The TLB is alongside the ALU to make simple pointer dereferences go fast.)
Also, it may not be the case that <modify;store> and <load;modify;store>
are as frequent as you might like (with fewer registers, they are more
frequent).

>Once again, you must look at your assumptions with
>skepticism in RISC design: calculate what it will cost you to implement
>a feature, then how much you gain.  Also, remember that other peoples 
>figures/assumptions may not match yours, especially if they are focusing
>on a specific part of performance (like integer-only, or FP-only, etc).

A very good point:  features/organizations are usually very interdependent
so that changing one thing can have significant effects on others.  Trivial
example:  change the instruction size on the Am29000 to 16 bits.

Re:  no bypassing.  Probably the most important thing is to get your compiler
to produce great code for inner loops.  If the lack of bypassing adds
a cycle to a 10 cycle loop, then you are hurt unless you have a 10% faster
cycle time because of no bypassing.  I looked at one inner loop (in sieve,
so this is proabably not representative of everything else :-) and it
seemed that omitting bypassing was OK, i.e. it didn't force no-ops to be
added.  Gosh, there really ought to be some data somewhere on this....

>>But I'm not sure the details of the TIB
>>have been released. I'll expand on it if it has been.
>
>	Dennis, I think the title of the ISSCC talk was "40 Mhz CMOS CPU with
>instruction cache", so I think it's ok.  Not like it's a patentable idea,
>anyway. :-)

(SIGH.  Yet another example of my foot in my mounth.  A patent can still be
issued for the implementation, I think.  And I didn't mention the patent
application to be antagonistic in anyway; I was just trying to point out
that there were earlier incarnations.)

bcase@Apple.COM (Brian Case) (03/08/88)

In article <9792@steinmetz.steinmetz.UUCP> oconnor%sungod@steinmetz.UUCP writes:
>An article by bcase@apple.UUCP (Brian Case) says:
>] In article <9758@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>] >"Popular RISCs" don't have any latency on
>] >ALU ops because they ARE ( No Dennis don't say it, no, no ... )
>] >SLOW SLOW SLOW ! (ARRGGHH he said it ! BAD DENNIS, BAD <whack>)
>] 
>] Boy, I must say I don't know what you are thinking.  Do you mean they are
>] slow because they don't have 40 MHz versions?  Or do you mean that they
>] are slow in terms of VAX-equivalent MIPS?
>
>At least the former, and perhaps the latter, but I obviosly mainly
>MEANT it HUMOUROUSLY. Couldn't you tell.

No, really, I have to say I couldn't tell.  You have to remember that some
of us are judging your statements now by the standard that was set a while
ago; I didn't save a copy of those early postings, but I do remember you
(or someone there) saying things like:  "Just wait until ISSCC, then you'll
see how to do it right."  I can say that I am not the only one who took
offense at this and other remarks of yours (or someone there).  Those
postings don't make you a bad person and they don't make the RPM40 a bad
chip, BUT, they did set a rather bad tone and put the burden of proof on you.

As for my comments about who has 40 MHz in what quantity in what temperature
range, you are right:  there was no verifiable substance in my remarks
(you, justifiably, used the word "hogwash").  I certainly didn't mean to
imply any "willy-nilliness" of design on the part of GE.  I shouldn't have
said what I said.  I probably shouldn't have said anything here.  The RPM40
is a great achievment, no one is saying otherwise.

My complaints about the RPM40 are architectural:  having 16-bit
instructions may be a slight advantage now, but I predict it will come
back to haunt.  In my opinion, and according to the information I have
gotten from postings here and a friend who attended the ISSCC, I have
seen little to convince me that the RPM40 is showing us how to do it
right, in the architectural sense.  That, in a nutshell, is my beef.
Running UNIX or dhrystone quickly is not the main issue; this is a 
forum concerned with architectural issues.  Of course, architecture can
influence how fast dhrystone is run, and implementation can often mean
more than anything else.

I honestly think the absolute best thing you could do right now is to
post a bullet-list of "features" of your machine.  This will put an end
to questions like "well, how much do you really know about the RPM40?"

>The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-),

With the memory system you assume, the Am29000 and I guess the R2000 would
run MIPS at their clock rates as well.  The question is how long it takes
to get from start of program to finish of program.  If the RPM40 is
exeucting more loads and stores and more register to register moves
to make up for the relatively small number of registers and lack of
three-address instructions, etc., then you aren't getting all the bang
out of your 40 MHz.  On the other hand, if it *is perfect for your
application* then great.

>]  If the latter, I think
>] you are wrong.  To be less opaque, I think that the RPM40 VAX-equivalent
>] MIPS is no better than, say, a 25 MHz Am29000 or a 16 MHz MIPS (both
>] with caches, you understand; and I am not saying that the 25 MHz 29000 is
>] the same as a 16 MHz MIPS).  We're talking integer here.
>
>I think YOUR wrong, and I am in a better position to say so.
>Don't you think you're being a little quick with your evaluation ?
>How much do you know about RPM40 ?

This is where I don't appologize for saying something.  I'll confine the
following discussion to the Am29000 since I know it well:  With similar
very fast memories (as the RPM40 assumes), the Am29000 will have the
advantage in fewer loads/stores, faster procedure calls, and fewer
instructions executed (three-address instructions and lots of registers).
The RPM40 has the advantage in clock rate.  Who wins? I don't know, but
I doubt the difference is tremendous, especially if the RPM40 is required
to have a TLB as the Am29000 does.

>Look, even I don't know how much real "VAX-equivalent" performance
>this chip has, except that it has LOTS, so how can anyone else ?
>If anyone knew, I'd know, either because I determined it or
>because someone would tell me (hi Pete, hi Phil, if you're there :-)

It isn't always necessary to know absolute numbers.  But you are right,
I certainly don't know the VAX-equivalent peformance.  Architectural
issues aside, the RPM40 must be evaluated with a TLB in order to be
compared to most other chips.

Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS
machine at 16 MHz (not the 8 MIPS you quoted).  The Am29000 is designed to
work with more than one memory configuration, so its Vax MIPS at 25 MHz is
not a single number.

In your reponse to my response, you go on to say that we should not judge
performance by either peak native instructions per second or MHz.  I don't
know anyone here who would dissagree with you (except marketing people:
what else can they say?).  In my claim above, I adhered to just that
philosophy.  This also is what most manufacturers of concern to us here
strive for (esp. MIPS Co.).

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/09/88)

An article by bcase@apple.UUCP (Brian Case) says:
] In article <...> oconnor%sungod@steinmetz.UUCP writes:
] >An article by bcase@apple.UUCP (Brian Case) says:
] My complaints about the RPM40 are architectural:  having 16-bit
] instructions may be a slight advantage now, but I predict it will come
] back to haunt.  In my opinion, and according to the information I have
] gotten from postings here and a friend who attended the ISSCC, I have
] seen little to convince me that the RPM40 is showing us how to do it
] right, in the architectural sense.  That, in a nutshell, is my beef.


] Running UNIX or dhrystone quickly is not the main issue; this is a 
] forum concerned with architectural issues.  Of course, architecture can
] influence how fast dhrystone is run, and implementation can often mean
] more than anything else.

Architecture limits implementation. Therefor, an architecture
that's relatively fast for 100-transistor-per-chip technology will
probably not be realtively fast for VLSI. But we all know this.

] I honestly think the absolute best thing you could do right now is to
] post a bullet-list of "features" of your machine.  This will put an end
] to questions like "well, how much do you really know about the RPM40?"

********** In My "Humble" Opinion *********************************
Things done right on RPM40, tho not neccesarily for the first time :

	Harvard architecture, but with a shared address bus
        that DOES NOT need to place more than one address
        on the bus per cycle.

	Sending only branch target addresses off chip, instead of
	sending EVERY instruction address off chip.

	Using a pipelined look-ahead to provide the instruction stream.

	Using the cache only to fill in for the look-ahead system's
	latency on branches.

	Forwarding coprocessor instructions from the CPU I-cache.

	Prefix instructions.

	COND (also called "SKIP") instructions.

	Pipelined Operand Memory system

	Fast interupt handling

	Using only a two-phase clock.

	Using a shorter pipeline for non-load instructions.

        21 g.p. registers, up to 15 of which can be used as
        base registers at any one time.

	No time-division multiplexing of pins during a clock cycle.

] >The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-),
] 
] With the memory system you assume, the Am29000 and I guess the R2000 would
] run MIPS at their clock rates as well.

Well, you are incorrect. The MIPS chip, correct me if I am wrong,
needs a four-phase 32-MHz clock to execute 16MIPS (native,peak).
The Am29000, I beleive, uses 25ns RAM just to make 25MHz,
I don't know how many phases, and therefor I believe 25MIPS.

Putting 25ns RAM on an R2000, it would still only execute at 16MIPS.
The processor is not fast enough to take advantage of it. The
Am29000 needs 25ns RAM just to run at 25MIPS. 

] The question is how long it takes to get from start of program to
] finish of program.  If the RPM40 is exeucting more loads and stores
] and more register to register moves to make up for the relatively
] small number of registers and lack of three-address instructions,
] etc., then you aren't getting all the bang out of your 40 MHz.  On the
] other hand, if it *is perfect for your application* then great.

"Small number of registers"?? 21 G.P. registers is small ? Says who ?
Talk to compiler writers : they tell us that 16 is just fine.

Or maybe your thinking of the Berkelly(sp?)-style register window concept ?
The R2000 doesn't have that. I think maybe the Am29000 does ??

] This is where I don't appologize for saying something.  I'll confine the
] following discussion to the Am29000 since I know it well:  With similar
] very fast memories (as the RPM40 assumes), the Am29000 will have the
] advantage in fewer loads/stores, faster procedure calls, and fewer
] instructions executed (three-address instructions and lots of registers).
] The RPM40 has the advantage in clock rate.  Who wins? I don't know, but
] I doubt the difference is tremendous, especially if the RPM40 is required
] to have a TLB as the Am29000 does.

WEll, beyond arguing that a TLB may not slow it down, which contract
prevents me from discussing, I'll say this : applications that
don't need a TLB shouldn't pay for a TLB. 

] ... the RPM40 must be evaluated with a TLB in order to be
] compared to most other chips.

Like the MC680[012]0 family ??  1750A processors ?? AN/YUK-14's ??
None of these have TLBs.

] Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS
] machine at 16 MHz (not the 8 MIPS you quoted).

Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for
their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which
places addresses on the address bus once every 30ns. THAT's why
"MHz" is TOTALLY inappropriate, WORSE than native-peak MIPS, even.
An RPM40 at 32MHz would also place addresses on the address bus once
every 30ns, but would execute 32-native-peak-MIPS.

What's the smallest signal interval on a 25MHz Am29000 ? In the RPM40,
NO signal ever assumes more than one valid state during a cycle.
This is not true of the R2000. Is it true of Am29000 ? 

] In your reponse to my response, you go on to say that we should not judge
] performance by either peak native instructions per second or MHz.  I don't
] know anyone here who would dissagree with you (except marketing people:
] what else can they say?).  In my claim above, I adhered to just that
] philosophy.  This also is what most manufacturers of concern to us here
] strive for (esp. MIPS Co.).

All three need to be paid attention to. They make big differences.
For instance, native-MIPS-per-MHz can range from 5 or less
in a CISC machine, to about 1 for a RISC, to 65K or more for
a big parrallel machine. And there's only so fast any particular
technology will let you run the clock, so it DOES matter.



--
    Dennis O'Connor			      oconnor%sungod@steinmetz.UUCP
		   ARPA: OCONNORDM@ge-crd.arpa
   (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

bcase@Apple.COM (Brian Case) (03/10/88)

In article <9852@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>An article by bcase@apple.UUCP (Brian Case) says:
>********** In My "Humble" Opinion *********************************
>Things done right on RPM40, tho not neccesarily for the first time :

Thanks for the list.  I won't point out the startling similarities between
the RPM40 and the Am29000; most people will know I think.

>] >The RPM40 runs 40MIPS, all the time, all instructions (even NOPS :-),
>] 
>] With the memory system you assume, the Am29000 and I guess the R2000 would
>] run MIPS at their clock rates as well.
>
>Well, you are incorrect. The MIPS chip, correct me if I am wrong,
>needs a four-phase 32-MHz clock to execute 16MIPS (native,peak).
>The Am29000, I beleive, uses 25ns RAM just to make 25MHz,
>I don't know how many phases, and therefor I believe 25MIPS.
>
>Putting 25ns RAM on an R2000, it would still only execute at 16MIPS.
>The processor is not fast enough to take advantage of it. The
>Am29000 needs 25ns RAM just to run at 25MIPS. 

Using a four phase clock has nothing to do with my point.  The R2000 can
issue instructions continuously at a 16 MHz rate given the memory system
you assume (when I said clock rate, I didn't mean raw clock rate but
intenal instruction issue rate; sorry for the confusion).  The Am29000
has single-phase 25 MHz clock input (or 30 MHz if you buy that version).

You believe incorrectly.  The Am29000 can execute 25 native MIPS with
video DRAMs; 25 ns SRAMs everywhere would let it execute 25 MIPS all the
time regardless of other factors, but VDRAMs with proper scheduling of
loads and stores and sufficient reuse of jump targets will permit peak
performance (real programs don't run at peak but its acceptable for some
people given the cost savings since the performance is still good).

>] The question is how long it takes to get from start of program to
>] finish of program.  If the RPM40 is exeucting more loads and stores
>] and more register to register moves to make up for the relatively
>] small number of registers and lack of three-address instructions,
>] etc., then you aren't getting all the bang out of your 40 MHz.  On the
>] other hand, if it *is perfect for your application* then great.
>
>"Small number of registers"?? 21 G.P. registers is small ? Says who ?
>Talk to compiler writers : they tell us that 16 is just fine.

Well, I am a compiler writer too.  I say 16 (or 21) is too few.  This
arguement doesn't prove anything.  There is plenty of research (and even
a significant amount of practice; e.g. the MetaWare compiler for the
Am29000 does some pretty neat things!) describing how to use lots of
registers (see David Wall's (of DECWRL) research into register
allocation at link time, various stack cache implmementations, papers
on procedure integration, interprocedural register allocation, etc. etc.).

>Or maybe your thinking of the Berkelly(sp?)-style register window concept ?
>The R2000 doesn't have that. I think maybe the Am29000 does ??

It's Berkeley (and "you're" not "your" but I misspell things too).

Yes, the Am29000 has a more general register window implementation, but,
as pointed out above, that is not the only way to quite profitably use
lots of registers.

>] [Me argueing that the RPM40 will lose some performance due to some
>] architectural things and that the lack of a TLB makes comparisons
>] slightly unfair.]
>WEll, beyond arguing that a TLB may not slow it down, which contract
>prevents me from discussing, I'll say this : applications that
>don't need a TLB shouldn't pay for a TLB. 

I fully agree.  However, you shouldn't then turn around and say that the
RPM 40 will make a fine UNIX box until you can prove that a TLB will not
cause performance loss.  Look, if I can't claim that your 40 MHz in the lab
is not special because I can't disclose what I know, then you can't sit
there and claim that you know something but can't disclose it.  Saying
that "contract prevents me" is not substantiation for your claim.  Contract
prevents me from saying what I know about other people's 40 MHz chips,
so what?

>] ... the RPM40 must be evaluated with a TLB in order to be
>] compared to most other chips.
>
>Like the MC680[012]0 family ??  1750A processors ?? AN/YUK-14's ??
>None of these have TLBs.

No, I meant the Am29000 and the R2000, but let's not forget the SPARC
(as in SUN 4s).  I really believe that the RPM40 is top dog in its
world (MC680[012] family, 1750A processors, AN/YUK-14s).  Maybe the
R2000 and the Am29000 wouldn't make it there, or maybe they would.  But
don't say the RPM 40 doesn't need a TLB because its world is 1750As and
AN/YUK-14s and then complain when John Mashey (for example) says that
it won't make the best UNIX box.

>] Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS
>] machine at 16 MHz (not the 8 MIPS you quoted).
>
>Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for
>their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which

Right, that's the R2000 in the fastest version currently available.

>places addresses on the address bus once every 30ns. THAT's why
>"MHz" is TOTALLY inappropriate, WORSE than native-peak MIPS, even.
>An RPM40 at 32MHz would also place addresses on the address bus once
>every 30ns, but would execute 32-native-peak-MIPS.

Again, I always assume MHz to be the peak instruction issue rate.  I
think most people do too, but my assumption has caused confusion once
again.  Sorry.

Yes, I agree that the bus strategy used by MIPS is questionable at very
high clock rates (read:  instruction issue rates).  We've been through
that issue before.  But it buys them something too!  Since they are
willing to pay for the external cache, it means that they don't have
to put a branch target cache or other instruction cache on chip.  They
were betting (I guess) that clock rates wouldn't get astronomical before
density would let them put a decent sized instruction cache on chip.  It's
a tradeoff, that's all it is.  Sure, they pay a cost, but they get a
benefit too.  You assume SRAMs.  You pay a cost, you get a benefit.  An
Am29000 system can be built with VDRAMs (so could the RPM 40, I bet, but
not at 40 MHz unless someone makes 40 MHz VDRAMs that I don't know of
(the Am29000 will run into this wall soon too)): you pay a cost
(performance loss compared with the max.) but you get a benefit (lower
system cost when you want more memory than SRAMs will let you afford).

Now, as to who has better performance (which is the crux of this
arguement, I think):  it can't be decided until we all agree on a system
environment:  if you want to use your SRAMs, then let us use them too.
If you want to talk about multi-tasking, then we should all have TLBs.

>What's the smallest signal interval on a 25MHz Am29000 ? In the RPM40,
>NO signal ever assumes more than one valid state during a cycle.
>This is not true of the R2000. Is it true of Am29000 ? 

I'm not sure I understand exactly what you mean; but I think the smallest
signal interval is one clock cycle (i.e., the channel is synchronized to
the rising clock edge).  If there is a signal that doesn't satisfy your
definition, then it would probably be the "bus invalid" signal which is
determined by the success or failure of address translation (which isn't
known until about half-way through the cycle, I think).

>] In your reponse to my response, you go on to say that we should not judge
>] performance by either peak native instructions per second or MHz.  I don't
>] know anyone here who would dissagree with you (except marketing people:
>] what else can they say?).  In my claim above, I adhered to just that
>] philosophy.  This also is what most manufacturers of concern to us here
>] strive for (esp. MIPS Co.).
>
>All three need to be paid attention to. They make big differences.
>For instance, native-MIPS-per-MHz can range from 5 or less
>in a CISC machine, to about 1 for a RISC, to 65K or more for
>a big parrallel machine. And there's only so fast any particular
>technology will let you run the clock, so it DOES matter.

I don't understand "so it DOES matter."  I thought you were, at first,
trying to say that we should compare based on VAX-equivalents (or
some other universal "meter bar").  I tried to say that everyone agrees.
So, now, I don't understand what is the "it" in "so it DOES matter."
I thought you were trying to say "just buy the one that runs my program
fastest" (and I would add "in my price range" but that's another matter).
I don't really need to care what the native-MIPS-per-MHz is ("if any word
is innappropriate at the end of a sentence, a linking verb is.").  On
the other hand, it'll tell you something about the machine, that's for
sure.

I am growing weary.  It is not my goal to slander the RPM40.  I am just
trying for accuracy.  I just want arguements to be well constructed.
We need to all be talking about reasonably similar system environments
and compiler generated code (or not, but we need to agree).  The problem
gets started when deficiencies, or call them "design decisions," are
pointed out and then blindly refuted.

For example, the Am29000 ain't no perfect being.  Features/design
decisions were reported and discussed here.  Much to my dismay, things
that I thought were great maybe aren't so great in every situation.
I, in my naive way, thought the compare-bytes instruction would make
every C string-handling program blazingly fast.  Oops, although I
fought it at first, some nice statistics, though not absolutely
conclusive, from John Mashey's simulation showed that really significant
improvements would be the exception rather than the rule (at least for
UNIX utilities).  That is the kind way to hold a discussion.  The
recent postings of stats about forwarding usage are also extremely
interesting.

earl@mips.COM (Earl Killian) (03/10/88)

In article <9852@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes:

   An article by bcase@apple.UUCP (Brian Case) says:
   ] Incidentally, I think MIPS would rather have the R2000 known as a 10 MIPS
   ] machine at 16 MHz (not the 8 MIPS you quoted).

   Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for
   their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which
   places addresses on the address bus once every 30ns. THAT's why
   "MHz" is TOTALLY inappropriate, WORSE than native-peak MIPS, even.
   An RPM40 at 32MHz would also place addresses on the address bus once
   every 30ns, but would execute 32-native-peak-MIPS.

Actually MIPS claims that the M/1000 system product, which uses the
R2000 chip at 15.0MHz with an 8-cycle cache refill penalty, is 10 VUPS
(aka mips).  The R2000 cpu and R2010 fpu chips are 16.7MHz chips.  At
16.7MHz with a faster cache refill, the they make a system with 12
VUPS of performance.  Just wanted to get the record straight.

The MHz figures are for cycle times, not the clock input.  We call the
R2000 16.7MHz because its cycle time is 60.0ns.  The 2x frequency of
the clock input is simply a convenience for generating two phase CMOS
clocking.  Call it a 33.3MHz machine if you want, but that seems
rather silly given that the phase clocks could theoretically be
generated internally from a 1x clock input.

As for judging machines by the rate they put addresses on an address
bus: that's a new one to me.  I think we'll be better off sticking to
how fast machines execute programs.  By that metric, and from what
descriptions of the architecture (which is all there is to go on given
the lack of any hard data), the 2yr old, 2 micron R2000 may well today
outperform the new 1.2 micron RPM40.

phil@amdcad.AMD.COM (Phil Ngai) (03/10/88)

In article <7613@apple.Apple.Com> bcase@apple.UUCP (Brian Case) writes:
>In article <9852@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
<<What's the smallest signal interval on a 25MHz Am29000 ? In the RPM40,
<<NO signal ever assumes more than one valid state during a cycle.
<<This is not true of the R2000. Is it true of Am29000 ? 
<
<I'm not sure I understand exactly what you mean; but I think the smallest
<signal interval is one clock cycle (i.e., the channel is synchronized to

Brian, I believe you are correct in your understanding of Dennis'
question. Saying no signal ever assumes more than one valid state
during a cycle is the same as saying no signal makes more than one
transition per cycle. Except for the clock signal, this is true
for the 29000.
-- 
 300 Mb on a Sun-3/60 for $2,300, quantity 1!

I speak for myself, not the company.
Phil Ngai, {ucbvax,decwrl,allegra}!amdcad!phil or phil@amd.com

jesup@pawl10.pawl.rpi.edu (Randell E. Jesup) (03/10/88)

In article <7613@apple.Apple.Com> bcase@apple.UUCP (Brian Case) writes:
>In article <9852@steinmetz.steinmetz.UUCP> sungoddess!oconnor@steinmetz.UUCP writes:
>No, I meant the Am29000 and the R2000, but let's not forget the SPARC
>(as in SUN 4s).  I really believe that the RPM40 is top dog in its
>world (MC680[012] family, 1750A processors, AN/YUK-14s).  Maybe the
>R2000 and the Am29000 wouldn't make it there, or maybe they would.  But
>don't say the RPM 40 doesn't need a TLB because its world is 1750As and
>AN/YUK-14s and then complain when John Mashey (for example) says that
>it won't make the best UNIX box.

	I think Dennis and I have been saying that can be used for Unix, not
that it's the fastest Unix processor around.  Whether or not it's the fastest
at Unix remains to be seen (if it ever will be at all), but it at least
stands a good chance.  It will not be, however, as fast at Unix as it would
be if that were the only design criteria.  Certain parts of it are optimized
for a different envirionment, but everything needed for unix is there.  For
example, it is optimized for a fair amount of FP stuff (in conjunction with
the not-formally-announced FP coprocessor), which is not the standard Unix
profile (except for things like Crays).
	No it doesn't have a TLB; yes, it can have an external one and has
been designed to work very well with such.

>benefit too.  You assume SRAMs.  You pay a cost, you get a benefit.  An
>Am29000 system can be built with VDRAMs (so could the RPM 40, I bet, but
>not at 40 MHz unless someone makes 40 MHz VDRAMs that I don't know of
>(the Am29000 will run into this wall soon too)): you pay a cost
>(performance loss compared with the max.) but you get a benefit (lower
>system cost when you want more memory than SRAMs will let you afford).

	Which is one of the reasons (along with bandwidth, etc) that we use
16-bit instructions, and fetch them 2 at a time at 20Mhz.  This means we
only need 50ns memory for the instruction side.
	When one designs a processor, one usually has to at least consider
system cost.  In some ways, we pay more attention to it than some others,
because one of our targets is embedded systems.  People building minis
usually have a lot more leeway on CPU and cache costs, and can throw hardware
(like custom multi-hundred Megabyte/sec busses, BIG SRAM caches, etc) at the
problems of max performance.
	The Am29000's ability to use VDRams is also a nice solution to the
problem of system cost.  As you said, I think the Rpm-40 is closer to the
Am29000 philosophically than the R2000.  It seems the R2000 is meant for 1
main purpose: running Unix REAL FAST.  It does a pretty good job at it.  The
Am29000 and Rpm-40 seem to have several purposes: classical micro applications,
embedded systems, running UNIX fast, etc.  So they might not beat the R2000
on a Unix-Mips (UIPS? :-) per Mhz, but they probably win in other applications.

>Now, as to who has better performance (which is the crux of this
>arguement, I think):  it can't be decided until we all agree on a system
>environment:  if you want to use your SRAMs, then let us use them too.
>If you want to talk about multi-tasking, then we should all have TLBs.

	To be most honest, it should be a system to do the application needed,
then measure performance/cost.  Exactly what goes into which system isn't
important, just overall system cost vs performance.  The parts exist, if you
pay for them, you use them.

	Folks, lets be careful to avoid the "my processor better than yours"
type of wars, and lets be careful to realize no 2 CPUs are designed for
the same constraints/applications/envirionments.  It also helps to try to
avoid defensiveness about one's 'baby', whatever that might be.  The
discussions here can be enlightening for all involved (witness the stuff on
Am29000 compare bytes, or RPM-40 leaving out Alu bypassing), so let's
stick to architecture.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

jesup@pawl10.pawl.rpi.edu (Randell E. Jesup) (03/10/88)

In article <9852@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes:
>   An RPM40 at 32MHz would also place addresses on the address bus once
>   every 30ns, but would execute 32-native-peak-MIPS.

	Actually, Dennis, the RPM-40 only puts an address out when it does
a branch.  And since it fetches two instructions at a time, ....  :-)

Note that one saves a lot a bus traffic by letting the I-mem feed you
sequentially until you do a branch.

     //	Randell Jesup			      Lunge Software Development
    //	Dedicated Amiga Programmer            13 Frear Ave, Troy, NY 12180
 \\//	beowulf!lunge!jesup@steinmetz.UUCP    (518) 272-2942
  \/    (uunet!steinmetz!beowulf!lunge!jesup) BIX: rjesup

(-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

oconnor@sungoddess.steinmetz (Dennis M. O'Connor) (03/10/88)

Brian Case wrote a very good article. I understand that
he is NOT trying to "slander" the RPM40. No problem.
I haven't taken it that way.

The only time, I think, I've gotten defensive was
when John Mashey claimed the the RPM40 would NOT
make a "good UNIX box". I don't beleive RPM40 would
be the "best", or even a "better", UNIX box than
a MIPS Inc. R2000, or an Am29000. I just objected
to him saying it WOULD NOT make a good one.

In reference to my VUPS-MIPS-MHz comment, please
allow me to clarify :

Although I do NOT think peak-native-MIPS is a good
"marketing" figure, or a figure-of-merit. But that
doesn't mean it isn't an important consideration in
architecture. However, I'm not particularly sure
why I feel it is important.

Clock speed, in MHz, is an important factor. There's
only so fast any particular technology can be pushed,
so if two processors have the same performance, but
one is using a 25MHz single-phase clock, while the
other is using a 50MHz four-phase clock, well, which
one do you think will eventually run faster ? But is
this an architecture or an implementation question ?
Seems to me the influence goes both ways across the
architecture/implementation line.

More inportant than "clock speed" is, of course,
the width of the smallest valid state of any external
signal. CMOS, anyway, has a HELL of a hard time
driving external loads ( damn that 50pf load, anyway ),
so a CMOS chip that puts "n" state changes on some
external signal per cycle is gonna "hit the wall"
before a 1-state-change-per-cycle ( which includes
when the state change is out-of-phase ) chip.

As Brian Case correctly points out, as transistor
counts go up, that "external" signal with n-changes
per cycle may become internal with the next design.
This is true, but is a cop-out, I think. If the current
chip has this property, it has it. No drop-in replacement
for the current chip will ever be able to get around it.

Of course, with today's nearly-disposable systems ( "You
don't really want to keep all that SLOW 200ns RAM, anyway,
so replace the whole system!" ), the "drop-in replacement"
part may not be relevant anyway. After all making money
mmay no be everything, but if you don't you'll go out
of business. IBM knows this well. Do you think they
care about technological leadership of released products ?

Look, how about we not worry so much about getting our
feelings hurt ( yeah, I ben guilty o 'dat me-self ) and
just have a GOOD TIME discussing/debating architecture ?
I'm not out to dump on anybody, I have no "chip" on my ...
well now that you mention it I do kinda ... well anybody,
I'd like to be able to discuss what is wrong with a
particular design, or right, without people taking it
personally. I try not too : you people don't even KNOW me !
Remeber, on the NET, no one will ever comment on what
you say unless they think it's WRONG WRONG WRONG! :-)

BTW, I read that someone's proved Fermat's last theorum.
Well, actually, the proof's being checked now.
--
 Dennis O'Connor   oconnor%sungod@steinmetz.UUCP  ARPA: OCONNORDM@ge-crd.arpa
         ( I wish I could be civil all the time, like Eugene Miya )
  (-: The Few, The Proud, The Architects of the RPM40 40MIPS CMOS Micro :-)

aglew@ccvaxa.UUCP (03/12/88)

>Which is not the standard Unix
>profile (except for things like Crays).

Ayoi! You don't have to buy a multimillion dollar
supercomputer to get a floating point oriented
system that runs UNIX. Consider Gould (and, to be fair,
Alliant, Convex, etc.)

przemek@gondor.cs.psu.edu (Przemyslaw Klosowski) (03/12/88)

In article <1820@gumby.mips.COM> earl@mips.COM (Earl Killian) writes:
>In article <9852@steinmetz.steinmetz.UUCP> oconnor@sungoddess.steinmetz (Dennis M. O'Connor) writes:
>
>   Actually, I think MIPS Inc. actually claims a 10 Vax-MIPS rating for
>   their 16-native-peak-MIPS processor, that uses a 32MHz clock. Which
>   places addresses on the address bus once every 30ns. THAT's why
>The MHz figures are for cycle times, not the clock input.  We call the
>R2000 16.7MHz because its cycle time is 60.0ns.  The 2x frequency of
>the clock input is simply a convenience for generating two phase CMOS
>clocking.  Call it a 33.3MHz machine if you want, but that seems
>rather silly given that the phase clocks could theoretically be
>generated internally from a 1x clock input.

I strongly disagree. You could internally generate its clock from 1 Hz external
clock ;^) the performance issue is how fast are the signals in the chip running,
and what is on the I/O pins. To be (over)precise: what is the fundamental 
frequency in the inside and  on the edge of the chip.


				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek

				przemek@psuvaxg.bitnet
				psuvax1!gondor!przemek