slackey@bbn.com (Stan Lackey) (07/11/89)
In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >But, RISC machines are easier to pipeline I don't see how this can be true, other than possibly in the scoreboarding logic. If it has it. >easier to speed up the clock for Unless your cycle time is limited by memory chips (for the cache), in which case it doesn't matter. >easier to provide staged functional units for, etc.. I don't get this one >I don't >know of any CISC machines with 'hardwired' instruction sets. Micro- >coding slows the machine down. This is an interesting statement. As I recall hearing, Cray started this perception back in the 70's. I thought it had been proven wrong. For example, the Alliant executes the instruction: add.d (an)+, fp0 in one cycle (yes, that's double precision memory-to-register add, auto increment), and it's microcoded. Are you saying that it would be done in zero cycles if we got rid of the microcode? Gee, and after spending so much real estate on those microcode RAM's... :-) Stan's opinions only.
petolino@joe.Sun.COM (Joe Petolino) (07/12/89)
>>>But, RISC machines are easier to pipeline >I don't see how this can be true, other than possibly in the scoreboarding >logic. If it has it. One of the most fundamental principles of the RISC philosophy (if there is one) is to *not* include anything in the architecture specification if it will make a pipelined implementation difficult. Examples of this are numerous: delayed branches, lack of interlocks in various cases, etc. What these have in common is that they relax the requirement for the processor to follow a strictly sequential model of execution in cases where it would slow down and/or complicate a pipelined implementation. >>>I don't >>>know of any CISC machines with 'hardwired' instruction sets. Micro- >>>coding slows the machine down. >>I can think of a couple of old ones. The first pdp11, the 11/20, was >>hardwired. (This accounts for some of the little irregularities in the >>11 instruction set, in fact, like the way INC isn't quite the same as >>ADD #1...) And I seem to recall that the 360/75 was mostly hardwired, >>for speed. No need to go back that far. None of the Amdahl 470-series machines (V6, V7, V8) had microcode per se. They couldn't find RAMs fast enough (yes, mainframes often use RAM, not ROM, for microcode, since it's faster). Instead, something that looked very much like microcode was transformed into PLA equations and implemented in gate arrays. -Joe
jlg@lanl.gov (Jim Giles) (07/12/89)
From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey): < [...] <>I don't <>know of any CISC machines with 'hardwired' instruction sets. Micro- <>coding slows the machine down. < < This is an interesting statement. As I recall hearing, Cray started < this perception back in the 70's. I thought it had been proven wrong. < For example, the Alliant executes the instruction: < < add.d (an)+, fp0 < < in one cycle (yes, that's double precision memory-to-register add, < auto increment), and it's microcoded. Are you saying that it would be < done in zero cycles if we got rid of the microcode? Gee, and after < spending so much real estate on those microcode RAM's... And, how many microcycles does 'one cycle' on the Alliant correspond to? You don't suppose that a smaller instruction set would allow instructions to run closer to the gate delay times rather than be multiple microcycles long? Seems to me that a RISC machine might have _cycle_ times equal to the _microcycle_ of your CISC machine. The real estate for the your microcode rom could better be used as a high speed instruction buffer. With the instruction set hardwired, the individual instructions would operate at gate delay speeds. This could all be done for machine with _fewer_ instructions. And, as everone seems to agree, compilers for CISCs don't use all those extra instructions anyway. Seems like a good idea to get rid of them and speed up the machine! Alliant is obviously fairly slow, since it can do something to an arbitrary memory location in one cycle. The cycle time is aparently longer than the memory delay time.
dik@cwi.nl (Dik T. Winter) (07/12/89)
In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: > From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey): > < For example, the Alliant executes the instruction: > < > < add.d (an)+, fp0 Nonsense. It is either addd (an)+,d0 or faddd (an)+,fp0 > < > < in one cycle (yes, that's double precision memory-to-register add, > < auto increment), and it's microcoded. > > And, how many microcycles does 'one cycle' on the Alliant correspond > to? I do not know about microcycles, but seeing that the cycle time on the Alliant is 170 nsec. it is clear that one cycle execution is required to get any performance. And, yup, the Alliant will outperform the SPARCstation 1 by a factor of up to 20 on some benchmarks, but if you try to compile something the Alliant is clearly inferior. (I have just been doing some benchmarking, a SPARC: 1.5 Megaflop single precision; Alliant: up to 30 (4 processor FX4). Compilation: SPARC at least 2.5 times as fast.) Moral: use the tool you have at hand to do the task you have at hand. -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
hascall@atanasoff.cs.iastate.edu (John Hascall) (07/12/89)
In article <8263@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: > > From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey): > > < For example, the Alliant executes the instruction: >(I have just been doing some benchmarking, a SPARC: 1.5 Megaflop single >precision; Alliant: up to 30 (4 processor FX4). Compilation: SPARC at >least 2.5 times as fast.) >Moral: use the tool you have at hand to do the task you have at hand. Moral2: if all you have is a hammer, everything looks like a nail. (use the *correct* tool for the job!!) John Hascall ISU Comp Center Ames IA
slackey@bbn.com (Stan Lackey) (07/12/89)
In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <42550@bbn.COM>, by slackey@bbn.com (Stan Lackey): >< [...] ><>I don't ><>know of any CISC machines with 'hardwired' instruction sets. Micro- ><>coding slows the machine down. >< >< This is an interesting statement. As I recall hearing, Cray started >< this perception back in the 70's. I thought it had been proven wrong. > >And, how many microcycles does 'one cycle' on the Alliant correspond >to? One. The reason many, even memory-to-register, operations take one microcycle is because it has a scalar pipeline. Even though pipelines "can't-be-done" on CISC's. The cycle time is fairly long, 170ns, but that was typical for when the machine was designed, 1983. Cycle time was set by cache/memory/bus tradeoffs, and by the register read-modify-write time you could get with CMOS gate arrays of that era. Had nothing to do with instruction decode, which is done in parallel with other operations in the first and second pipeline stages. Microcode access time is done in parallel with normal address calculation time. Note that CMOS has gotton like 3 times faster since then. >compilers for CISCs don't use all those extra >instructions anyway. Seems like a good idea to get rid of them and >speed up the machine! The Alliant compiler really really does use the memory-to-register operations, auto-inc/dec addressing modes, vector instructions, and concurrency instructions. All to advantage. >Alliant is obviously fairly slow, since it can do something to an >arbitrary memory location in one cycle. The cycle time is aparently >longer than the memory delay time. As I hope I clarified above, the pipeline allows a very long sequence of operations, including a memory access, to consume effectively one cycle of execution time. Specifically, memory-to-register floating point takes six cycles from front to back, but with the pipeline really consumes only one cycle. :-) Stan
jlg@lanl.gov (Jim Giles) (07/13/89)
From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): > In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >>And, how many microcycles does 'one cycle' on the Alliant correspond >>to? > > One. The reason many, even memory-to-register, operations take one > microcycle is because it has a scalar pipeline. Even though pipelines > "can't-be-done" on CISC's. You are either using pipelines (in which case the instruction _issues_ in one clock, but the result is not delivered for several more), or you aren't (in which case, I don't believe your claim that the instruction has no microcycles). RISCs can also be pipelined (easier than CISCs), and the several simple instructions may execute as fast or faster than the one big one. And (back to the original subject), it is easier to _compile_ for a RISC machine. Now that you've said that the Alliant is pipelined, you have to tell be what the _real_ instruction timing for the given example is. What is the minimum number of clocks between issuing the given instruction and issuing the next instruction which uses one of the results of the one given? Bet it ain't 1.
jlg@lanl.gov (Jim Giles) (07/13/89)
From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): > [...] > As I hope I clarified above, the pipeline allows a very long sequence > of operations, including a memory access, to consume effectively one > cycle of execution time. Specifically, memory-to-register floating > point takes six cycles from front to back, but with the pipeline > really consumes only one cycle. Or it really consumes six!! Depends upon whether there is anything independent to do while this instruction runs. If the next instruction depends on the result of this one, the next gets delayed six clocks. Period. With a RISC instruction set, you can move the individual components of this complex "instruction" around to get maximum overlap from your pipeline. Splitting the functionality of the instruction requires more instruction issues, but it also allows better flexibility in instruction scheduling optimizations. It would require a _very_ smart compiler to tell which way to go. This is exactly one of the points I made originally about CISCs being harder to compile for.
hascall@atanasoff.cs.iastate.edu (John Hascall) (07/13/89)
In article <13985@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): >> [...] >> Specifically, memory-to-register floating >> point takes six cycles from front to back, but with the pipeline >> really consumes only one cycle. >Or it really consumes six!! Depends upon whether there is anything >independent to do while this instruction runs. If the next instruction >depends on the result of this one, the next gets delayed six clocks. Period. I doubt it is delayed all six, surely the first part of the next instruction can be done (at least the fetch and decode). John Hascall ISU Comp Center
sbf10@uts.amdahl.com (Samuel Fuller) (07/13/89)
In article <13985@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): >> [...] >> As I hope I clarified above, the pipeline allows a very long sequence >> of operations, including a memory access, to consume effectively one >> cycle of execution time. Specifically, memory-to-register floating >> point takes six cycles from front to back, but with the pipeline >> really consumes only one cycle. > >Or it really consumes six!! Depends upon whether there is anything >independent to do while this instruction runs. If the next instruction >depends on the result of this one, the next gets delayed six clocks. Period. If a RISC has data dependencies then its stuck too, right? > >With a RISC instruction set, you can move the individual components of >this complex "instruction" around to get maximum overlap from your pipeline. I hardly consider a memory-to-register multiply a complex instruction. For an example of a complex instruction see the TRT instruction in the IBM 370 POO. These are the instructions that RISC rightfully throws out. >Splitting the functionality of the instruction requires more instruction >issues, but it also allows better flexibility in instruction scheduling >optimizations. It would require a _very_ smart compiler to tell which >way to go. This is exactly one of the points I made originally about >CISCs being harder to compile for. Look at it this way. To perform a floating point multiply on two operands which exist in memory this machine will take two slots down the pipe to perform the operation. Prev Inst DATBXW LOAD OP1 to REG1 DATBXWload can be bypassed back into X for Mul Mult REG1 by OP(mem) DATBXW Multiply is finished after the X Next Inst DATBXW All RISC machines that I know about are Load/Store machines. So given the same pipeline they would take at least three slots to perform the operation. Prev Inst DATBXW LOAD OP1 to REG1 DATBXW LOAD OP2 to REG2 DATBXW Mult REG1 by REG2 DATBXW Multiply is finished after the X Next Inst DATBXW A pipeline is a pipeline. The pipelines on our 370 machines have a shorter cycle time than any RISC processor on the market. 370 is definitely not RISC. RISC is wonderful stuff. But it is not necessary to make a fast computer. RISC just allows you to make a fast computer quickly (read design time) and cheaply (read single chip CPU). Our machines are fast but they take forever to design and cost a fortune. But people buy them :). Sam Fuller / Amdahl System Performance Architecture
slackey@bbn.com (Stan Lackey) (07/13/89)
The discussion continues between jlg@lanl.gov (Jim Giles) and me. If you are bored with it, "Type 'n' now" In article <13984@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >From article <42621@bbn.COM>, by slackey@bbn.com (Stan Lackey): >> In article <13982@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >>>And, how many microcycles does 'one cycle' on the Alliant correspond >>>to? >> >> One. The reason many, even memory-to-register, operations take one >> microcycle is because it has a scalar pipeline. Even though pipelines >> "can't-be-done" on CISC's. I aplogize for the sarcasm. I have seen too many "can't be done in a CISC" or "is too hard to do in a CISC" statements, referring to things I have done in a CISC. >You are either using pipelines (in which case the instruction _issues_ >in one clock, but the result is not delivered for several more), or >you aren't (in which case, I don't believe your claim that the instruction >has no microcycles). The basic clock to the Alliant CE is 170ns. One new microword is accessed every 170ns cycle. Many instructions consume one 170ns cycle. Some consume more. FADD.D (ay)+, fp0 consumes one. FDIV.D <ea>, fp0 consumes more than one, like 3 or 4. >Now that you've said that the Alliant is pipelined, you have to tell >be what the _real_ instruction timing for the given example is. What >is the minimum number of clocks between issuing the given instruction >and issuing the next instruction which uses one of the results of the >one given? Bet it ain't 1. Bet it is, for lots of cases. The CE has a fixed six-stage pipeline. The stages are: 1. Instruction cache access and instruction decode 2. Address calculation and microcode access 3. Address translation and passing the address through the crossbar 4. Cache access and returning the data through the crossbar (on a read, send data on a write) 5. Integer execution or pass operands to floating point unit 6. Floating point execution and writing of results So, the full execution time of a FP instruction is 6*170. A new instruction can be started every 170. Dependencies cause dead cycles to be inserted. These dependencies include an integer operation being used as an address in the next instruction, but do not include integer or floating point dependencies; we used the 50ns BIT (Bipolar Integrated Technology) functional units, and wired the data paths efficiently so that dependent operands could be routed fast enough. In the implementation, only one microword is accessed for the entire instruction. It is a very wide microinstruction, and fields of it that are destined to control operations later in the instruction are delayed by "pipeline registers". The technique was called "data stationary control" in the textbook we got it out of. Lore has it that IBM has used this style in their mainframes, and calls it "delayed microinstructions" or something similar. Note: Because condition codes are not available to a branch instruction following a compare, branch prediction is employed. Also note: the above strategy seems to work real well. Compare its Whets with the 1989 50ns RISC machines. My opinions, and not necessarily those of Alliant Computer Systems, Internation Business Machines, BBN, the publisher of the textbook we got "data stationary" out of, and anybody else, living or dead, whom I may have mentioned. :-) Stan
seanf@sco.COM (Sean Fagan) (07/14/89)
In article <42550@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <13980@lanl.gov> jlg@lanl.gov (Jim Giles) writes: >>But, RISC machines are easier to pipeline >I don't see how this can be true, other than possibly in the scoreboarding >logic. If it has it. "Most" "CISC" machines have those wonderful complex addressing modes we all know and love, which oftimes means a variable length instruction (best example is, of course, the VAX). This is rather difficult to pipeline, although not impossible. One of the features most "RISC" chips have in common is a limited amount of instruction formats. For example, the Cyber 170 machines (knew I'd put this in somewhere, didn't you? 8-)) had two formats, 15 and 30 bits wide. If I remember correctly, the i860 has 5 or so (but, sizewise, it only works out to 2 or 3), and the 88k looks like it's similar. >>Micro- >>coding slows the machine down. >This is an interesting statement. As I recall hearing, Cray started >this perception back in the 70's. I thought it had been proven wrong. Bad thing to say. As far as I knew, it had been proven that microcoding *did* slow the system down, since you could always, worst case, make your microcode be your instruction set, and use a cache to execute the "real" code normally, and execute the ucode directly when you wanted more speed. >For example, the Alliant executes the instruction: > add.d (an)+, fp0 >in one cycle (yes, that's double precision memory-to-register add, >auto increment), and it's microcoded. Are you saying that it would be >done in zero cycles if we got rid of the microcode? Gee, and after >spending so much real estate on those microcode RAM's... As somebody else said, in regards to Elxsi: if you most complex instructions take as much time as your simplest ones, then your cycle time is too long (or something like that). Usually, the reason cycle times are chosen to be longer than they need to be is a) the hardware isn't up to snuff (e.g., ETA), or b) you need a longer cycle to get more work done. RISC advocates (and I find myself in that group this time) claim that, if b) is chosen, then you should simply make your cycle time shorter and go with a different instruction set (or algorithm in the hardware). -- Sean Eric Fagan | "Uhm, excuse me..." seanf@sco.UUCP | -- James T. Kirk (William Shatner), ST V: TFF (408) 458-1422 | Any opinions expressed are my own, not my employers'.
jlg@lanl.gov (Jim Giles) (07/14/89)
> The discussion continues between slackey@bbn.com (Stan Lackey) and me. > If you are bored with it, "Type 'n' now" > >>[...] What >>is the minimum number of clocks between issuing the given instruction >>and issuing the next instruction which uses one of the results of the >>one given? Bet it ain't 1. > > Bet it is, for lots of cases. Obviously, it _never_ is. I want the time from instruction issue to writing of results. Even by _your_ calculation, this is _always_ six clocks (for the instruction at issue). > The CE has a fixed six-stage pipeline. The stages are: > [...] > 6. Floating point execution and writing of results > > So, the full execution time of a FP instruction is 6*170. A new > instruction can be started every 170. This is just like any other pipelined machine (CRAY, for example). I would _never_ claim that the divide approximate on the Cray was one clock (even though that's its issue time). The instruction time is number of clocks from issue to completion - nothing else. When someone says a machine is pipelined, I _assume_ that issue time is shorter than the whole instruction time (for most instructions). > Dependencies cause dead cycles to be inserted. [...] Finally, this discussion gets back to _my_ point about compiler construction. CISC machines usually have a superset of the instructions found in a RISC machine. The compiler must determine whether to use the simpler instructions (in which case, you pay more for instruction issue - but might find a improved scheduling) or whether to use the complex instruction (which pays less for instruction issue, but might cause more delays to be generated). RISC doesn't have the choice, so it's _OBVIOUSLY_ easier to compile for! This still leaves the question of whether RISC or CISC is faster. This question is independent of the compiler discussion I am talking about. My bet would be that RISC could be made faster if the compiler for the CISC is assumed _not_ to do the optimizations I am talking about. For this reason I claim that CISC compilers _are_ more complex than RISC compilers (at least if they generate competitive code).