[comp.arch] r4000

ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/02/91)

I absolutely loathe reading press releases, so for those of you too
lazy to go to comp.sys.mips, here's the capsule summary.

- Expected to be available later this year.

- Will be available, in pin-comaptible versions, from IDT, LSI Logic,
  NEC, Performance Semiconductor, and Siemens.

- Talk of a range of performances, with references to faster versions but
  not slower ones.

- Runs existing binaries.

- Single chip.

- 64-bit integer ALU, 64-bit FP ALU, 8K I-cache, 8K D-cache, MMU,
  primary and secondary cache control on-chip.  It is not clear to me
  if the primary cache being controlled is the on-board one, or they
  control two levels of off-chip cache.

- 64-bit addresses as well as data.  John Mashey has been talking about
  the address space crunch for a few years now, so this isn't a huge
  surprise, but it's still interesting.

- "Superpipelined", issuing two instructions per cycle.  I don't understand
  this, as they distinguish it from the usual superscalar implmentation
  (i960CA, RS6000) so I quote from the press release:

> Superpipelining overlaps the execution of multiple instructions, so
> that while the first step of an instruction is performed, the second
> step of the previous instruction is also executed.

> Both the integer and floating point units are superpipelined.
> Superpipelining requires less circuitry than other multiple-instruction
> issue techniques, so it leaves room on the chip for other functions.
> Further, it provides greater integer processing than most other
> techniques, whose benefits are confined mainly to floating-point
> operations.  Superpipelining, therefore, is particularly important in
> commercial applications, where balanced integer and floating-point
> performance are desirable.

- "Foundation" multiprocessing support hardware.  No useful details.
  Lots of talk about flexibility.
-- 
	-Colin

cprice@mips.COM (Charlie Price) (02/05/91)

In article <1991Feb1.223326.18683@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes:
>I absolutely loathe reading press releases, so for those of you too
>lazy to go to comp.sys.mips, here's the capsule summary.

Thanks.

>- "Superpipelined", issuing two instructions per cycle.  I don't understand
>  this, as they distinguish it from the usual superscalar implmentation
>  (i960CA, RS6000) so I quote from the press release:
>
>> Superpipelining overlaps the execution of multiple instructions, so
>> that while the first step of an instruction is performed, the second
>> step of the previous instruction is also executed.
>
>> Both the integer and floating point units are superpipelined.
>> Superpipelining requires less circuitry than other multiple-instruction
>> issue techniques, so it leaves room on the chip for other functions.
>> Further, it provides greater integer processing than most other
>> techniques, whose benefits are confined mainly to floating-point
>> operations.  Superpipelining, therefore, is particularly important in
>> commercial applications, where balanced integer and floating-point
>> performance are desirable.

A superscaler implementation has separate "units" and actually
issues two or more instructions at "the same time" to separate units.
The classic separation is an FP unit and an integer unit.
The stuff inside the chip runs at the same rate as the external clock.
This is logic-design intensive.

A superpipelined implementation issues one instruction at a time,
but issues two or more of them per "external clock".
To run the machine at a higher clock rate probably requires a
deeper pipeline (more stages) than a simple pipelined unit.
The insides of the chip runs fast.
This is circuit-design intensive.

Both approaches require that the CPU fetch two-or-more
instructions per external clock and typically require that
the CPU be able to load/store more than one data item per
external clock.

A reasonable short article about this by Brian Case is:
"Design Issues for Next-Generation Processors",
Microprocessor Report, September 19, 1990.
-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

mh2f+@andrew.cmu.edu (Mark Hahn) (02/08/91)

isn't MIPS's "superpiplining" just the common trick 
of sticking in a clock doubler?

would anyone care to comment on the argument
that "the amount of instruction-level parallelism 
that is available limits the benefit" of superscalar?
It seems like I saw simulation papers claiming
superscalar speedups of 2-3x.  besides, wouldn't
MIPS still have to resolve the same dependencies,
and thus be subject to the same limits?  
(assuming the r4000 isn't just a clock speedup.)

regards, mark

mash@mips.COM (John Mashey) (02/12/91)

In article <obgVpm200VpeQ4VmIa@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes:
>isn't MIPS's "superpiplining" just the common trick 
>of sticking in a clock doubler?
No, it's more complicated than that.  The typical bottleneck is the
latency of the caches.  From 5-stage R3000 pipeline, we went to
an 8-stage pipeline for the R4000, not 10.  The cache-access phases
got split into several phases; some others didn't because they
weren't bottlenecks.  The "superpipelined" comes from issuing
2 instructions in the cache-latency period, by issuing 1 every half-cycle,
rather than in 2-way superscalar, where you fetch and issue 2 in parallel.
>
>would anyone care to comment on the argument
>that "the amount of instruction-level parallelism 
>that is available limits the benefit" of superscalar?
>It seems like I saw simulation papers claiming
>superscalar speedups of 2-3x.  besides, wouldn't
>MIPS still have to resolve the same dependencies,
>and thus be subject to the same limits?  
>(assuming the r4000 isn't just a clock speedup.)
As a simple example why they are different, suppose you have
a chunk of code that looks like:
	add	a,b,c	a = b+c
	sub	a,a,d	a = a-d
A typical 2-way superscalar would have a stall on the sub, because
it needs data from the first, IF the 2 instructions were fetched together,
and IF the compilers couldn't rearrange things.
A superpipelined design would not have that stall,
if the pipeline managed to get the ALU stage to be a half-cycle
(as the R4000).  
-- 
-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086

cprice@mips.COM (Charlie Price) (02/12/91)

In article <obgVpm200VpeQ4VmIa@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes:
>isn't MIPS's "superpiplining" just the common trick 
>of sticking in a clock doubler?

Superscalar is pretty easy to define, but
what *does* superpipelining really mean?

At least one definition is that it is an implementation in which
extra stages are added to a "normal" pipeline simply to decrease
the clock interval and increase the issue rate.
The R4000 qualifies by this measure.

Another view is that a regular pipeline issues one instruction
per I-cache access latency period.
A superpipeline issues two or more instructions during the
cache access latency.
The R4000 also qualifies by this measure.

One superpipeline "feature" that the R4000 does NOT have,
is a multi-stage ALU.
The designers squeezed very hard to get the ALU into one clock.
This is a "good thing" and an important detail of the design.
It makes it possible for the result of an ALU operation
to be available (by bypassing) to the ALU stage of the following operation.
This means that the R4000 has no issue restrictions;
this instruction sequence can be issued in the same external cycle:
	sub	r1 from r2, result in r3
	or	r3 with r4, result in r5


Pipeline details for the curious ( "|" denotes parallel operation):

The R3000 pipeline, has 5 stages:

IF	Instruction fetch from I-cache
RF	Register Fetch | instruction decode
ALU	ALU op or load/store address computation
MEM	D-cache access
WB	WriteBack results to register file

The cache access time is one clock period and
an instruction is issued in each clock period.
This is an incomplete description, and parts of the processor are
used twice per cycle in a first-half, second-half staggered fashion,
but note that the ALU occupies a whole clock.

The R4000 has an 8-stage pipeline that takes 4 EXTERNAL clocks:

IF	I-fetch, First cycle || instr address translation
IS	I-fetch, Second cycle || instr address translation

RF	Register Fetch |  instruction decode | tag check of I-cache entry
EX	ALU or load/store address computation

DF	D-cache access, First cycle  | data address translation
DS	D-cache access, Second cycle | data address translation

TC	Tag Check of D-cache entry
WB	WriteBack to register file

Two instructions are issued per EXTERNAL clock,
this is the same period as the on-chip cache latency.
To do this, an internal clock runs at double the external clock and
one instruction is issued per internal clock so
This requires that cache access is pipelined.

This is much like the 3K pipeline except that the cache access
was chopped into two stages, and the D-cache tag check
needed a separate stage before writeback.
The RegisterFetch, EXecute, and WriteBack stages do roughly the same
work as before, just faster.
Squeezing the ALU into one clock required a faster adder.

-- 
Charlie Price    cprice@mips.mips.com        (408) 720-1700
MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

wolfe@vw.ece.cmu.edu (Andrew Wolfe) (02/13/91)

Aha!

As an purist I must insist that the R4000 is NOT Superpipelined.

I believe the term is defined by Jouppi (IEEE TC Dec. 1989) - A superpipelined
machine must have a multistage ALU.  This defines the critical distinguishing
characteristic of Superpipelined machines - Independence between sucessive ALU
operations (Unless hardware stalls are used).

The R4000 is an interesting heavily pipelined machine - but not
Superpipelined.  The pipeline stages are just defined differently than in a
traditional RISC architecture.


(I would expect that this is the correct pipelining solution for todays
technology - I only object to the use of the term Superpipelined  - marketing
hype) 


 

danw@pbird14.prime.com (Dan Westerberg) (02/13/91)

In article <WOLFE.91Feb12153257@vw.ece.cmu.edu>, wolfe@vw.ece.cmu.edu (Andrew Wolfe) writes:
|> 
|> 
|> Aha!
|> 
|> As an purist I must insist that the R4000 is NOT Superpipelined.
|> 
|> I believe the term is defined by Jouppi (IEEE TC Dec. 1989) - A superpipelined
|> machine must have a multistage ALU.  This defines the critical distinguishing
|> characteristic of Superpipelined machines - Independence between sucessive ALU
|> operations (Unless hardware stalls are used).
|> 
|> The R4000 is an interesting heavily pipelined machine - but not
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

As a CPU development engineer here at Prime, and still a babe in the industry,
I find all this controversy and talk about pipeline architectures rather
interesting.  I mean, so the R4000 uses an 8 stage pipe, so what?  We at Prime
have been building pipelines in our proprietary machines for YEARS!  I grant
you that our pipes have no where near the (predicted) performance of the R4000,
but then again we don't have multiple foundries building silicon for us and we
don't use full-custom design.  Also, our architecture is *FAR* from RISC, more
like super-CISC.

I mean, c'mon, aren't pipelines old hat?  What I find rather impressive is a
64-bit single-cycle integer ALU!  Doing a full 64-bit operation in only 10ns
(100MHz) in CMOS is one hell of a feat!

dan

-- 
===============================================================================
|                                                                             |
| Daniel I. Westerberg                 email:  danw@tbird9.prime.com (or)     |
| Prime Computer Inc.                          danw@s49.prime.com             |
| MS 10-9                                                                     |
| 500 Old Connecticut Path             phone:  508-620-2800  x3644            |
| Framingham,  MA  10701                 fax:  508-879-9098                   |
|                                                                             |
===============================================================================

pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/14/91)

On 13 Feb 91 01:33:37 GMT, danw@pbird14.prime.com (Dan Westerberg) said:

danw> I mean, so the R4000 uses an 8 stage pipe, so what? [ ... ] I
danw> mean, c'mon, aren't pipelines old hat?

Very old indeed. In fact I doubt very much that the R4000 pipeline, in
both its 8 stage and issue rate being double the external clock aspects,
is new.

Rumour has it that the ill-fated Z80,000 had a very long pipeline and
used similar tricks.

In any case long pipelines have big problems. Average interjump distance
is often quite a bit smaller than 8 instructions. It is longer than that
in very straight numerical codes.

Long pipelines are poor man's vector processors, tailored for FIFO
access patterns. Non numerical codes normally benefit a bit more from
superscalar than from long pipelines, whether their issue rate is twice
the external clock or not.

My usual reference, Ibbett&Morris "The MU5 computer system", MacMillan,
comes in handy when thinking about long pipelines.

danw> What I find rather impressive is a 64-bit single-cycle integer
danw> ALU!  Doing a full 64-bit operation in only 10ns (100MHz) in CMOS
danw> is one hell of a feat!

Agreed...

As a final unrelated note: I don't find the R4000 architecture
announcement premature, and I don't think that MIPS can be compared in
any way to IBM as to ambitions to freeze their competitor's market.

I think that they should have published more details and in a neutral
scientific journal, say SIGARCH. As things stands, they have done press
releases over something fuzzy. Seems designed to whet the appetite
without committing oneself too much.

Well, if the intention was to give a statement of direction, a more
detailed paper in SIGARCH would have given it, and everybody would have
understood that it was not about product announcements.

All in all I find the fuzzy press release a very venial sin. Admirable
restraint, compared to some much more trumpeted vapourware we have seen
before.
--
Piercarlo Grandi                   | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk
Dept of CS, UCW Aberystwyth        | UUCP: ...!mcsun!ukc!aber-cs!pcg
Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk

carters@ajpo.sei.cmu.edu (Scott Carter) (02/20/91)

References: <45448@mips.mips.COM> <1991Feb1.223326.18683@watdragon.waterloo.edu> <45525@mips.mips.COM> <obgVpm200VpeQ4VmIa@andrew.cmu.edu> <45792@mips.mips.COM>
Reply-To: carter%csvax.decnet@mdcgwy.mdc.com
Organization: McDonnell Douglas Electronic Systems Company

In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes:
<description of R4000 pipeline - a graphic description is in e.g. 2/6/91 uP
Report ...>

Looking at this pipeline, it looks as if a branch has three microcycle
delay slots.  Is this correct (assuming that branch condition is resolved
at the end of RF) ?

What about load delays?   When does the load bypass to the source register?
(assuming the stall case)?  Is the load aligner in the tag check or DS
phase (i.e. is the latency of partial or unaligned load the same as an integer
load [although with a 64-bit datapath there's a 2-way mux and sign extend (??)
even for an integer load]).
Are there bypasses for the weirder cases (like bypass after TC if the primary 
bypass is at DS, load pipe to store pipe bypass, etc.)?

How do stores work?  I can think of lots of possibilities ...
Is the pipelining of word stores and partial word stores the same?

uPReport (p 9, 2nd paragraph) says the O-cache freezes during write of a dirty
line.  Given the line-wide write buffer, why the freeze?

Minds with nothing better to do want to know :)

>Charlie Price    cprice@mips.mips.com        (408) 720-1700
>MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA   94086-23650

Scott Carter - McDonnell Douglas Electronic Systems Company
carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or -
carters@ajpo.sei.cmu.edu		 (714)-896-3097
The opinions expressed herein are solely those of the author, and are not
necessarily those of McDonnell Douglas.

mash@winchester.mips.com (John Mashey) (02/28/91)

In article <1991Feb25.202912.28140@oakhill.sps.mot.com>, mitch@oakhill.sps.mot.com (Mitch Alsup) writes:

|>    After gathering some numbers from Comp.Arch, I would like to guess at
|> the performance of the MIPS R4000 processor.  If any of these numbers are
|> incorrect, I would be greatful if someone at MIPS would correct and repo

|>     As a comparison, (from memory) R3000 CPI = 1.25 at 33 MHz
|> 
|> 	 33 MHz / 1.25 CPI  = 26.4 MIPS or nearly 19 SPECmarks.
|
Well, I don't  know about the rest of that, but the RC3360 gets 26.5 SPECmarks (27.1 SPECint,
26.2 SPECfloat) @ 33 MHz.   25MHz R3000s get around 18-19 SPECmarks.

-john mashey	DISCLAIMER: <generic disclaimer, I speak for me only, etc>
UUCP: 	 mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash 
DDD:  	408-524-7015, 524-8253 or (main number) 408-720-1700
USPS: 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086