ccplumb@rose.uwaterloo.ca (Colin Plumb) (02/02/91)
I absolutely loathe reading press releases, so for those of you too lazy to go to comp.sys.mips, here's the capsule summary. - Expected to be available later this year. - Will be available, in pin-comaptible versions, from IDT, LSI Logic, NEC, Performance Semiconductor, and Siemens. - Talk of a range of performances, with references to faster versions but not slower ones. - Runs existing binaries. - Single chip. - 64-bit integer ALU, 64-bit FP ALU, 8K I-cache, 8K D-cache, MMU, primary and secondary cache control on-chip. It is not clear to me if the primary cache being controlled is the on-board one, or they control two levels of off-chip cache. - 64-bit addresses as well as data. John Mashey has been talking about the address space crunch for a few years now, so this isn't a huge surprise, but it's still interesting. - "Superpipelined", issuing two instructions per cycle. I don't understand this, as they distinguish it from the usual superscalar implmentation (i960CA, RS6000) so I quote from the press release: > Superpipelining overlaps the execution of multiple instructions, so > that while the first step of an instruction is performed, the second > step of the previous instruction is also executed. > Both the integer and floating point units are superpipelined. > Superpipelining requires less circuitry than other multiple-instruction > issue techniques, so it leaves room on the chip for other functions. > Further, it provides greater integer processing than most other > techniques, whose benefits are confined mainly to floating-point > operations. Superpipelining, therefore, is particularly important in > commercial applications, where balanced integer and floating-point > performance are desirable. - "Foundation" multiprocessing support hardware. No useful details. Lots of talk about flexibility. -- -Colin
cprice@mips.COM (Charlie Price) (02/05/91)
In article <1991Feb1.223326.18683@watdragon.waterloo.edu> ccplumb@rose.uwaterloo.ca (Colin Plumb) writes: >I absolutely loathe reading press releases, so for those of you too >lazy to go to comp.sys.mips, here's the capsule summary. Thanks. >- "Superpipelined", issuing two instructions per cycle. I don't understand > this, as they distinguish it from the usual superscalar implmentation > (i960CA, RS6000) so I quote from the press release: > >> Superpipelining overlaps the execution of multiple instructions, so >> that while the first step of an instruction is performed, the second >> step of the previous instruction is also executed. > >> Both the integer and floating point units are superpipelined. >> Superpipelining requires less circuitry than other multiple-instruction >> issue techniques, so it leaves room on the chip for other functions. >> Further, it provides greater integer processing than most other >> techniques, whose benefits are confined mainly to floating-point >> operations. Superpipelining, therefore, is particularly important in >> commercial applications, where balanced integer and floating-point >> performance are desirable. A superscaler implementation has separate "units" and actually issues two or more instructions at "the same time" to separate units. The classic separation is an FP unit and an integer unit. The stuff inside the chip runs at the same rate as the external clock. This is logic-design intensive. A superpipelined implementation issues one instruction at a time, but issues two or more of them per "external clock". To run the machine at a higher clock rate probably requires a deeper pipeline (more stages) than a simple pipelined unit. The insides of the chip runs fast. This is circuit-design intensive. Both approaches require that the CPU fetch two-or-more instructions per external clock and typically require that the CPU be able to load/store more than one data item per external clock. A reasonable short article about this by Brian Case is: "Design Issues for Next-Generation Processors", Microprocessor Report, September 19, 1990. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
mh2f+@andrew.cmu.edu (Mark Hahn) (02/08/91)
isn't MIPS's "superpiplining" just the common trick of sticking in a clock doubler? would anyone care to comment on the argument that "the amount of instruction-level parallelism that is available limits the benefit" of superscalar? It seems like I saw simulation papers claiming superscalar speedups of 2-3x. besides, wouldn't MIPS still have to resolve the same dependencies, and thus be subject to the same limits? (assuming the r4000 isn't just a clock speedup.) regards, mark
mash@mips.COM (John Mashey) (02/12/91)
In article <obgVpm200VpeQ4VmIa@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes: >isn't MIPS's "superpiplining" just the common trick >of sticking in a clock doubler? No, it's more complicated than that. The typical bottleneck is the latency of the caches. From 5-stage R3000 pipeline, we went to an 8-stage pipeline for the R4000, not 10. The cache-access phases got split into several phases; some others didn't because they weren't bottlenecks. The "superpipelined" comes from issuing 2 instructions in the cache-latency period, by issuing 1 every half-cycle, rather than in 2-way superscalar, where you fetch and issue 2 in parallel. > >would anyone care to comment on the argument >that "the amount of instruction-level parallelism >that is available limits the benefit" of superscalar? >It seems like I saw simulation papers claiming >superscalar speedups of 2-3x. besides, wouldn't >MIPS still have to resolve the same dependencies, >and thus be subject to the same limits? >(assuming the r4000 isn't just a clock speedup.) As a simple example why they are different, suppose you have a chunk of code that looks like: add a,b,c a = b+c sub a,a,d a = a-d A typical 2-way superscalar would have a stall on the sub, because it needs data from the first, IF the 2 instructions were fetched together, and IF the compilers couldn't rearrange things. A superpipelined design would not have that stall, if the pipeline managed to get the ALU stage to be a half-cycle (as the R4000). -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
cprice@mips.COM (Charlie Price) (02/12/91)
In article <obgVpm200VpeQ4VmIa@andrew.cmu.edu> mh2f+@andrew.cmu.edu (Mark Hahn) writes: >isn't MIPS's "superpiplining" just the common trick >of sticking in a clock doubler? Superscalar is pretty easy to define, but what *does* superpipelining really mean? At least one definition is that it is an implementation in which extra stages are added to a "normal" pipeline simply to decrease the clock interval and increase the issue rate. The R4000 qualifies by this measure. Another view is that a regular pipeline issues one instruction per I-cache access latency period. A superpipeline issues two or more instructions during the cache access latency. The R4000 also qualifies by this measure. One superpipeline "feature" that the R4000 does NOT have, is a multi-stage ALU. The designers squeezed very hard to get the ALU into one clock. This is a "good thing" and an important detail of the design. It makes it possible for the result of an ALU operation to be available (by bypassing) to the ALU stage of the following operation. This means that the R4000 has no issue restrictions; this instruction sequence can be issued in the same external cycle: sub r1 from r2, result in r3 or r3 with r4, result in r5 Pipeline details for the curious ( "|" denotes parallel operation): The R3000 pipeline, has 5 stages: IF Instruction fetch from I-cache RF Register Fetch | instruction decode ALU ALU op or load/store address computation MEM D-cache access WB WriteBack results to register file The cache access time is one clock period and an instruction is issued in each clock period. This is an incomplete description, and parts of the processor are used twice per cycle in a first-half, second-half staggered fashion, but note that the ALU occupies a whole clock. The R4000 has an 8-stage pipeline that takes 4 EXTERNAL clocks: IF I-fetch, First cycle || instr address translation IS I-fetch, Second cycle || instr address translation RF Register Fetch | instruction decode | tag check of I-cache entry EX ALU or load/store address computation DF D-cache access, First cycle | data address translation DS D-cache access, Second cycle | data address translation TC Tag Check of D-cache entry WB WriteBack to register file Two instructions are issued per EXTERNAL clock, this is the same period as the on-chip cache latency. To do this, an internal clock runs at double the external clock and one instruction is issued per internal clock so This requires that cache access is pipelined. This is much like the 3K pipeline except that the cache access was chopped into two stages, and the D-cache tag check needed a separate stage before writeback. The RegisterFetch, EXecute, and WriteBack stages do roughly the same work as before, just faster. Squeezing the ALU into one clock required a faster adder. -- Charlie Price cprice@mips.mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650
wolfe@vw.ece.cmu.edu (Andrew Wolfe) (02/13/91)
Aha! As an purist I must insist that the R4000 is NOT Superpipelined. I believe the term is defined by Jouppi (IEEE TC Dec. 1989) - A superpipelined machine must have a multistage ALU. This defines the critical distinguishing characteristic of Superpipelined machines - Independence between sucessive ALU operations (Unless hardware stalls are used). The R4000 is an interesting heavily pipelined machine - but not Superpipelined. The pipeline stages are just defined differently than in a traditional RISC architecture. (I would expect that this is the correct pipelining solution for todays technology - I only object to the use of the term Superpipelined - marketing hype)
danw@pbird14.prime.com (Dan Westerberg) (02/13/91)
In article <WOLFE.91Feb12153257@vw.ece.cmu.edu>, wolfe@vw.ece.cmu.edu (Andrew Wolfe) writes: |> |> |> Aha! |> |> As an purist I must insist that the R4000 is NOT Superpipelined. |> |> I believe the term is defined by Jouppi (IEEE TC Dec. 1989) - A superpipelined |> machine must have a multistage ALU. This defines the critical distinguishing |> characteristic of Superpipelined machines - Independence between sucessive ALU |> operations (Unless hardware stalls are used). |> |> The R4000 is an interesting heavily pipelined machine - but not ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ As a CPU development engineer here at Prime, and still a babe in the industry, I find all this controversy and talk about pipeline architectures rather interesting. I mean, so the R4000 uses an 8 stage pipe, so what? We at Prime have been building pipelines in our proprietary machines for YEARS! I grant you that our pipes have no where near the (predicted) performance of the R4000, but then again we don't have multiple foundries building silicon for us and we don't use full-custom design. Also, our architecture is *FAR* from RISC, more like super-CISC. I mean, c'mon, aren't pipelines old hat? What I find rather impressive is a 64-bit single-cycle integer ALU! Doing a full 64-bit operation in only 10ns (100MHz) in CMOS is one hell of a feat! dan -- =============================================================================== | | | Daniel I. Westerberg email: danw@tbird9.prime.com (or) | | Prime Computer Inc. danw@s49.prime.com | | MS 10-9 | | 500 Old Connecticut Path phone: 508-620-2800 x3644 | | Framingham, MA 10701 fax: 508-879-9098 | | | ===============================================================================
pcg@cs.aber.ac.uk (Piercarlo Grandi) (02/14/91)
On 13 Feb 91 01:33:37 GMT, danw@pbird14.prime.com (Dan Westerberg) said: danw> I mean, so the R4000 uses an 8 stage pipe, so what? [ ... ] I danw> mean, c'mon, aren't pipelines old hat? Very old indeed. In fact I doubt very much that the R4000 pipeline, in both its 8 stage and issue rate being double the external clock aspects, is new. Rumour has it that the ill-fated Z80,000 had a very long pipeline and used similar tricks. In any case long pipelines have big problems. Average interjump distance is often quite a bit smaller than 8 instructions. It is longer than that in very straight numerical codes. Long pipelines are poor man's vector processors, tailored for FIFO access patterns. Non numerical codes normally benefit a bit more from superscalar than from long pipelines, whether their issue rate is twice the external clock or not. My usual reference, Ibbett&Morris "The MU5 computer system", MacMillan, comes in handy when thinking about long pipelines. danw> What I find rather impressive is a 64-bit single-cycle integer danw> ALU! Doing a full 64-bit operation in only 10ns (100MHz) in CMOS danw> is one hell of a feat! Agreed... As a final unrelated note: I don't find the R4000 architecture announcement premature, and I don't think that MIPS can be compared in any way to IBM as to ambitions to freeze their competitor's market. I think that they should have published more details and in a neutral scientific journal, say SIGARCH. As things stands, they have done press releases over something fuzzy. Seems designed to whet the appetite without committing oneself too much. Well, if the intention was to give a statement of direction, a more detailed paper in SIGARCH would have given it, and everybody would have understood that it was not about product announcements. All in all I find the fuzzy press release a very venial sin. Admirable restraint, compared to some much more trumpeted vapourware we have seen before. -- Piercarlo Grandi | ARPA: pcg%uk.ac.aber.cs@nsfnet-relay.ac.uk Dept of CS, UCW Aberystwyth | UUCP: ...!mcsun!ukc!aber-cs!pcg Penglais, Aberystwyth SY23 3BZ, UK | INET: pcg@cs.aber.ac.uk
carters@ajpo.sei.cmu.edu (Scott Carter) (02/20/91)
References: <45448@mips.mips.COM> <1991Feb1.223326.18683@watdragon.waterloo.edu> <45525@mips.mips.COM> <obgVpm200VpeQ4VmIa@andrew.cmu.edu> <45792@mips.mips.COM> Reply-To: carter%csvax.decnet@mdcgwy.mdc.com Organization: McDonnell Douglas Electronic Systems Company In article <45792@mips.mips.COM> cprice@mips.COM (Charlie Price) writes: <description of R4000 pipeline - a graphic description is in e.g. 2/6/91 uP Report ...> Looking at this pipeline, it looks as if a branch has three microcycle delay slots. Is this correct (assuming that branch condition is resolved at the end of RF) ? What about load delays? When does the load bypass to the source register? (assuming the stall case)? Is the load aligner in the tag check or DS phase (i.e. is the latency of partial or unaligned load the same as an integer load [although with a 64-bit datapath there's a 2-way mux and sign extend (??) even for an integer load]). Are there bypasses for the weirder cases (like bypass after TC if the primary bypass is at DS, load pipe to store pipe bypass, etc.)? How do stores work? I can think of lots of possibilities ... Is the pipelining of word stores and partial word stores the same? uPReport (p 9, 2nd paragraph) says the O-cache freezes during write of a dirty line. Given the line-wide write buffer, why the freeze? Minds with nothing better to do want to know :) >Charlie Price cprice@mips.mips.com (408) 720-1700 >MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086-23650 Scott Carter - McDonnell Douglas Electronic Systems Company carter%csvax.decnet@mdcgwy.mdc.com (preferred and faster) - or - carters@ajpo.sei.cmu.edu (714)-896-3097 The opinions expressed herein are solely those of the author, and are not necessarily those of McDonnell Douglas.
mash@winchester.mips.com (John Mashey) (02/28/91)
In article <1991Feb25.202912.28140@oakhill.sps.mot.com>, mitch@oakhill.sps.mot.com (Mitch Alsup) writes: |> After gathering some numbers from Comp.Arch, I would like to guess at |> the performance of the MIPS R4000 processor. If any of these numbers are |> incorrect, I would be greatful if someone at MIPS would correct and repo |> As a comparison, (from memory) R3000 CPI = 1.25 at 33 MHz |> |> 33 MHz / 1.25 CPI = 26.4 MIPS or nearly 19 SPECmarks. | Well, I don't know about the rest of that, but the RC3360 gets 26.5 SPECmarks (27.1 SPECint, 26.2 SPECfloat) @ 33 MHz. 25MHz R3000s get around 18-19 SPECmarks. -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086