Will@cup.portal.com (Will E Estes) (03/18/90)
What is the MIPS rating of these microprocessors: 386SX-15 386-20 386-25 386-33 Also, since the 80386 has a more complex instruction set and does more work in a given instruction than does a typical RISC chip, does comparing MIPS figures between RISC and non-RISC architectures really tell you anything of worth? Finally, why is everyone so excited about RISC? Why the move to simplicity in microprocessor instruction sets? You would think that the trend would be just the opposite - toward more and more complex instruction sets - in order to increase the execution speed of very high-level instructions by putting them in silicon and in order to make implementation of high-level language constructs easier. Thanks, Will (sun!portal!cup.portal.com!Will)
henry@utzoo.uucp (Henry Spencer) (03/21/90)
In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes: >What is the MIPS rating of these microprocessors: > >386SX-15 >386-20 >386-25 >386-33 With what memory systems, and running what workload? And which flavor of "MIPS" are you talking about? >Also, since the 80386 has a more complex instruction set and does >more work in a given instruction than does a typical RISC chip, >does comparing MIPS figures between RISC and non-RISC >architectures really tell you anything of worth? Comparing MIPS figures tells you nothing of worth even when those complications aren't present. MIPS numbers are marketing nonsense, not useful performance measures. >Finally, why is everyone so excited about RISC? Why the move to >simplicity in microprocessor instruction sets? You would think >that the trend would be just the opposite - toward more and more >complex instruction sets - in order to increase the execution >speed of very high-level instructions by putting them in silicon >and in order to make implementation of high-level language >constructs easier. Oh my, a newcomer to the group, I'd say... RISC is exciting because it generally leads to computers that run real workloads faster. That is the meaningful measure of performance. The fact is, trying to bundle zillions of instructions onto the chip usually makes them slower, and compilers find it very difficult to effectively exploit all the bizarre silliness that CISC designers throw in. About a decade ago, it started to become clear that executing simple instructions very quickly works much better. -- MSDOS, abbrev: Maybe SomeDay | Henry Spencer at U of Toronto Zoology an Operating System. | uunet!attcan!utzoo!henry henry@zoo.toronto.edu
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (03/21/90)
In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes: >...since the 80386 has a more complex instruction set and does >more work in a given instruction than does a typical RISC chip, >does comparing MIPS figures between RISC and non-RISC >architectures really tell you anything of worth? Yes, but only if you understand the basic situation. A MIPS rating should be treated as give-or-take about a factor of two. So, if one machine has twice the MIPS of another (on a given compute-bound task), the machines could still be about equal (on that task). This isn't true within a family, of course: different 386 boxes really can be compared by their MIPS ratings. Note, however, that I said "box". Different boxes containing the same chip, at the same clock rate, can still have different MIPS ratings. (This is because of caches and buses and other significant non-CPU items.) So, to ask for the MIPS of a CPU chip is mostly to ask for an upper bound. As for "complex" instructions, it is worth noting that the complexity may be potential rather than actual. Sometimes, a given machine runs faster when the compilers avoid generating the more complex cases. It was this observation that led us to explore the RISC idea. >Finally, why is everyone so excited about RISC? Why the move to >simplicity in microprocessor instruction sets? The excitement is because the better RISC machines have genuinely high throughput. The simplicity is only relative - they're actually quite complex machines. The important point is that the designs are carefully tuned, so that complexity is only used where it pays its way. So, rather than ask "Why simplicity?", it would be better to ask about specific aspects, such as subroutine calling. -- Don D.C.Lindsay Carnegie Mellon Computer Science
shj@ultra.com (Steve Jay) (03/21/90)
henry@utzoo.uucp (Henry Spencer) writes: >About a decade ago, it started >to become clear that executing simple instructions very quickly works >much better. Apparently this was clear to Seymour Cray a lot earlier than that: the 6600 in 1964. Steve Jay shj@ultra.com ...ames!ultra!shj Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA (408) 922-0100 x130 "Home of the 1 Gigabit/Second network"
seanf@sco.COM (Sean Fagan) (03/21/90)
In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes: >Finally, why is everyone so excited about RISC? You would think >that the trend would be just the opposite - toward more and more >complex instruction sets - in order to increase the execution >speed of very high-level instructions by putting them in silicon >and in order to make implementation of high-level language >constructs easier. Well, first of all, that should be "I would think," as, obviously, not everybody thinks like you do. Second of all, my immediate reaction on reading this was, "And thus the VAX is born." I think that says it all 8-). Doubtless dozens of people will post and flood your mailbox about this, but, if they don't, I'll be glad to 8-). -- -----------------+ Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
seanf@sco.COM (Sean Fagan) (03/21/90)
In article <1990Mar21.004840.6473@ultra.com> shj@ultra.com (Steve Jay) writes: >Apparently this was clear to Seymour Cray a lot earlier than that: >the 6600 in 1964. Yes, but Seymour is God, and it took awhile for IBM to acknowledge that 8-). -- -----------------+ Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
seanf@sco.COM (Sean Fagan) (03/21/90)
In article <1990Mar20.175843.2612@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >The fact is, trying to bundle >zillions of instructions onto the chip usually makes them slower, and >compilers find it very difficult to effectively exploit all the bizarre >silliness that CISC designers throw in. Time to throw myself into the fray (and mention Seymour and CDC later, too 8-)). Simple, non-hardware engineer reason why what Henry says is true (and, for the most part, it *is* true): you have only a finite amount of silicon space on a chip. Given that, you have a couple of options: you can make a small amount of instructions *really* fast (through the brute force method of just throwing silicon at it), or you can make a large amount of instructions (which might be fast, or might not; since you now have less silicon, they will probably be slower). You can, within limits, make any instruction faster by throwing more silicon at it. For example, you can do a 32x32->64 (bit) multiply in 2 cycles if you use enough silicon, maybe even one cycle. This will, however, take up *lots* of chip space, so you might just keep it down to somewhere between 2 and 5 cycles, or get rid of it entirely (since you can do any multiply with shifts and adds, and a large amount of multiplies in certain test sets use mostly constants). If you've made all of your instructions execute as fast as possible, and have more space available, you can have, oh, an on-board MMU, on-board FPU, on-board cache, second processor, etc. With larger instruction sets, you don't have that option as much. After you've done all that, btw, you can throw in pipelining if you don't already have it, multiple functional units, scoreboarding (either full or the simpler one most people use), etc. Meanwhile, the CISC chip is still trying to make the POLY instruction execute in something less than 100 cycles... >About a decade ago, it started >to become clear that executing simple instructions very quickly works >much better. Well, I'd say more than that, about 25 years. Seymour Cray and the CDC Cyber 6600, a truly wonderful machine with less than 74 instructions, a load-store architecture, and three-operand instructions. Just beautiful. -- -----------------+ Sean Eric Fagan | "Time has little to do with infinity and jelly donuts." seanf@sco.COM | -- Thomas Magnum (Tom Selleck), _Magnum, P.I._ (408) 458-1422 | Any opinions expressed are my own, not my employers'.
hrich@emdeng.Dayton.NCR.COM (George.H.Harry.Rich) (03/21/90)
.In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes: .>architectures really tell you anything of worth? ... .> .>Finally, why is everyone so excited about RISC? Why the move to .>simplicity in microprocessor instruction sets? You would think .>that the trend would be just the opposite - toward more and more .>complex instruction sets - in order to increase the execution .>speed of very high-level instructions by putting them in silicon .>and in order to make implementation of high-level language .>constructs easier. .> >Thanks, >Will (sun!portal!cup.portal.com!Will) > I want to first state that I'm not an expert on RISC architecture, but the experts seem not to have replied, or given oversimplified explanations, so I'll make an attempt. My area is software, not hardware, so I hope those who are knowledable will be quick to correct me if I'm wrong. First of all, what you save on a complex instruction versus several simple ones is the fetch and decode time. If the processor has good prefetch and caching what you are generally talking about is decode time. However, a really simple instruction set takes less time to decode, so it is conceivable that you could have a net savings without taking other factors into account. However, the real point is that you pay the penalty for the complex decode even for very simple instructions where you don't get a benefit for your increase in decoding time. While compilers may take advantage of complex instructions for such things as stack frame management, most complex instructions seem to be designed for the convenience of assembler programmers rather than compiler code generation, and the bulk of the code generated by compilers involves relatively simple instructions. Even where instructions are designed for compiler code generation, the designers miss fairly often, and the instruction sits there in silicon, unused by the compiler writers. It might even be true that a complex instruction set designed ideally for compiler code generation might beat RISC. However, ideal designers are very rare. There are always some design flaws in a complex system. A RISC design has the benefit of targeting toward a simple thing done well, and removes from the designer the burden of knowing some of the more arcane details of compiler writing. It is not really surprising that it is a good approach in practice. Regards, Harry Rich
colwell@mfci.UUCP (Robert Colwell) (03/22/90)
In article <5303@scolex.sco.COM> seanf@sco.COM (Sean Fagan) writes: > >In article <1990Mar20.175843.2612@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >>The fact is, trying to bundle >>zillions of instructions onto the chip usually makes them slower, and >>compilers find it very difficult to effectively exploit all the bizarre >>silliness that CISC designers throw in. > >...you have a couple of options: you can >make a small amount of instructions *really* fast (through the brute force >method of just throwing silicon at it), or you can make a large amount of >instructions (which might be fast, or might not; since you now have less >silicon, they will probably be slower). [a moment please while I find my soapbox...oh here it is...] And once again I claim that the NUMBER OF INSTRUCTIONS is very nearly meaningless as a measure of RISCyness. It isn't the number of instructions that makes a VAX hard to speed up, it's their architecturally-required semantic content. Ok, so indirectly you have a point, in that an architecture as complicated as that probably needs microcode for its implementation. But that's not what you said. Do you want to hear my arguments for why our VLIW with 2**1024 possible instructions is nevertheless a RISC? >After you've done all that, btw, you can throw in pipelining if you don't >already have it, multiple functional units, scoreboarding (either full or >the simpler one most people use), etc. Meanwhile, the CISC chip is still >trying to make the POLY instruction execute in something less than 100 >cycles... And those 100 cycles might well be considerably faster than the equivalent RISC code sequence (including the icache misses the RISC will incur in executing it). But RISC probably still wins. Why? Because those RISC operations can be overlapped with other useful ops, and while the CISC is running its POLY, nothing else can usefully proceed. The pipelining, multiple FUs, and scoreboarding can be done on RISCs or CISCs (and has been) so it doesn't seem especially relevant here. >>About a decade ago, it started >>to become clear that executing simple instructions very quickly works >>much better. > >Well, I'd say more than that, about 25 years. Seymour Cray and the CDC >Cyber 6600, a truly wonderful machine with less than 74 instructions, a >load-store architecture, and three-operand instructions. Just beautiful. An amazing machine. But unless we know more about how it shared responsibility for performance with its compiler, I refuse to call it a RISC. From what I've read, it was designed so that the hardware would extract whatever parallelism it could use, and all the compiler did was convert the high level source into sequential machine ops. Close, but not much different in principle from what drives the CISC design philosophy. (Oh yes, there was one. The CISC design philosophy was to "make the compiler writer's job easier"; yes, it probably failed at that, too, but that's one of the reasons for all those complicated instruction sets in the first place.) Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
shj@ultra.com (Steve Jay) (03/23/90)
colwell@mfci.UUCP (Robert Colwell) writes: >>Well, I'd say more than that, about 25 years. Seymour Cray and the CDC >>Cyber 6600, a truly wonderful machine with less than 74 instructions, a >>load-store architecture, and three-operand instructions. Just beautiful. >An amazing machine. But unless we know more about how it shared >responsibility for performance with its compiler, I refuse to call >it a RISC. From what I've read, it was designed so that the hardware >would extract whatever parallelism it could use, and all the compiler >did was convert the high level source into sequential machine ops. The original compiler for the 6600, called "RUN", made no attempt to optimize instruction sequences. By 1970, however, CDC had a new compiler, FTN, which did rearrange instructions to optimize usage of the multiple functional units. The technology of both local and global optimization in the FTN compiler was continously improved, and by mid to late 70's, it was difficult to beat the compiler even with hand tuned assembly language. I don't think the unavailability of an optimizing compiler when the 6600 first came out in any way detracts from the RISCness of the machine. You can read articles written around 1965 which justify the design decisions for the 6600 in terms almost identical to those used today to justify RISC over CISC. I suspect that experiences with the 6600/7600 were important in teaching later architects how important compiler technology is. Steve Jay shj@ultra.com ...ames!ultra!shj Ultra Network Technologies / 101 Dagget Drive / San Jose, CA 95134 / USA (408) 922-0100 x130 "Home of the 1 Gigabit/Second network"
johnl@esegue.segue.boston.ma.us (John R. Levine) (03/23/90)
In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes: >It might even be true that a complex instruction set designed ideally for >compiler code generation might beat RISC. ... I doubt it. The IBM 801 project included some of the best compiler people around and they came up with the original RISC machine which was quite stripped down, and an extremely fancy compiler named PL.8 which generates fantastic code for it. There are some slightly exotic instructions, e.g. shift register N and put the result in register N+1, but nothing very complicated. -- John R. Levine, Segue Software, POB 349, Cambridge MA 02238, +1 617 864 9650 johnl@esegue.segue.boston.ma.us, {ima|lotus|spdcc}!esegue!johnl "Now, we are all jelly doughnuts."
baum@Apple.COM (Allen J. Baum) (03/23/90)
[] >In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes: >.In article <28012@cup.portal.com> Will@cup.portal.com (Will E Estes) writes: >.>architectures really tell you anything of worth? >... >.> >.>Finally, why is everyone so excited about RISC? Why the move to >.>simplicity in microprocessor instruction sets? You would think >.>that the trend would be just the opposite - in order to increase >.> the speed of very high-level instructions by putting them in silicon > Actually, the problem with cmoplex stuff is that it isn't used, so why put it in. The higher the semantic content, the less often it is used. RISC attempts to put the highest semantic content in that gets used a lot- which isn't very high, it turns out. >First of all, what you save on a complex instruction versus several simple >ones is the fetch and decode time. If the processor has good prefetch and >caching what you are generally talking about is decode time. However, >a really simple instruction set takes less time to decode, Yes, but if your critical paths are not decode related, then it just doesn't matter. Reducing critical paths (both in hardware, where it is generally load/store or branch related, and software, which is '# of inst.s to perform some function'. CISCs attempt to reduce the second (software) factor. Unfortunately, they often do this by increasing the first, and they can't do it often enough to make up for this. You can make instructions that perform the same actions as a series of simpler instructions. I can make n^i variations of the latter, and few variations of the former. Experience has shown that lots of variations get used, especially after optimization, so that it is impossible to pick a small set of complex insts. that get used enough to make them worthwhile. Besides, these complex insts. often get executed as a series of microsteps, and often go no faster than the series of simple instructions. Finally, it is possible to re-arrange the order of the simpler ones to avoid interlocks, which can't happen inside a complex instruction. On the flip side, complex instructions can run a deeper pipeline. If the instructions can truly be piped (a very big if, when interlocks are taken into account), then this is equivalent to a cheap 'superscalar' implementation. For example, a series of "Add Mem to Reg" instructions, which can be piped at one per cycle, will run twice as fast as the simpler "Load Mem to Reg", "Add Reg to Reg" series. The pipeline is more complex, but is simpler than the full superscalar implementation. The question is, with good register allocation does it happen enough to make it worthwhile? -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
baum@Apple.COM (Allen J. Baum) (03/23/90)
[] >In article <1990Mar22.190941.1184@esegue.segue.boston.ma.us> johnl@esegue.segue.boston.ma.us (John R. Levine) writes: >In article <289@emdeng.Dayton.NCR.COM> hrich@emdeng.UUCP (George.H.Harry.Rich) writes: >>It might even be true that a complex instruction set designed ideally for >>compiler code generation might beat RISC. ... > >I doubt it. The IBM 801 project included some of the best compiler people >around and they came up with the original RISC machine which was quite >stripped down, and an extremely fancy compiler named PL.8 which generates >fantastic code for it. I'm afraid I no longer buy arguments of the form "x didn't do it, and x is omnipotent, so it can't/shouldn't be done." That work was done 10+ years ago and the state of the art has improved, and will continue to improve. In fact, the IBM patent suite includes one patent that describes how to optimally choose instruction forms including mem->reg insts. For example, if something was to be added to a register, and nothing else was done with it, then the "Add mem to Reg" form would be selected, not "Load Mem to Reg", "Add reg to reg". The latter might be used if the value was going to be used again shortly, or be modified, etc. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
dik@cwi.nl (Dik T. Winter) (03/23/90)
In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes: > By 1970, however, CDC had a new > compiler, FTN, which did rearrange instructions to optimize usage > of the multiple functional units. The technology of both local and > global optimization in the FTN compiler was continously improved, > and by mid to late 70's, it was difficult to beat the compiler even > with hand tuned assembly language. And then came the problem. CDC came with newer versions of their machine, and newer versions of their compiler. The problem was that different machines had different requirements with respect to scheduling. So a program fully optimized for a 7600 was not optimal for a 170/750. There were switches in the compiler to tune for the different models, but at least on the 170/750 it was possible to take the compiler generated assembler code, hand tune it by simple peep-hole optimization, and gain a factor of 2 (but of course not for all programs). This is in general a problem if the compiler has too much to do. Newer models of the machine require a different compiler. And not only newer models, but if you have a range of models differing only in price and performance, you may have introduced different scheduling requirements for the different models. Although your architecture can be such that object code compiled for one model is valid for another model, it may be sub-optimal. And think next about the hassle to maintain different versions of the compiler! > I don't think the unavailability of an optimizing compiler when the > 6600 first came out in any way detracts from the RISCness of the > machine. You can read articles written around 1965 which justify > the design decisions for the 6600 in terms almost identical to those > used today to justify RISC over CISC. I agree here. And do not take me wrong; I like the (60 bit) Cybers and the Crays. Although this belongs more to comp.compilers it is also of significance in this group, because there is a strong interaction between compiler and machine. -- dik t. winter, cwi, amsterdam, nederland dik@cwi.nl
mash@mips.COM (John Mashey) (03/23/90)
In article <8912@boring.cwi.nl> dik@cwi.nl (Dik T. Winter) writes: >In article <1990Mar22.184122.7917@ultra.com> shj@ultra.com (Steve Jay) writes: > > By 1970, however, CDC had a new > > compiler, FTN, which did rearrange instructions to optimize usage > > of the multiple functional units. The technology of both local and > > global optimization in the FTN compiler was continously improved, > > and by mid to late 70's, it was difficult to beat the compiler even > > with hand tuned assembly language. >And then came the problem. CDC came with newer versions of their machine, >and newer versions of their compiler. The problem was that different >machines had different requirements with respect to scheduling. So a >program fully optimized for a 7600 was not optimal for a 170/750. There >were switches in the compiler to tune for the different models, but at... >This is in general a problem if the compiler has too much to do. >Newer models of the machine require a different compiler. And not >only newer models, but if you have a range of models differing only in >price and performance, you may have introduced different scheduling >requirements for the different models. Although your architecture can >be such that object code compiled for one model is valid for another >model, it may be sub-optimal. And think next about the hassle to >maintain different versions of the compiler! This issue, of course, is almost certainly true for every line of computers that a) Has multiple distinct implementations at the same time. b) Evolves over time by anything but clock-rate changes to the same implementation. Product families for which optimal code differs among models includes at least: a) IBM S/360 and derivatives. Even amongst the first round of S/360s, optimal code differed. (Note that IBM compiler folks observed that pipeline scheduling was useful on some machines...) b) DEC VAXen c) Intel 80x86 d) Motorola 680x0 e) SPARC (different FPU timings already, for example, and if the next generation has multiple different styles of pipelines...) f) MIPS Rx000 (R2000s always had 1-cycle writes; R3000s with approp. mode bit use 2-cycle write-partial-words; R6000s have different FP timings, etc). Fortunately for the simpler architectures: a) Integer instructions are fairly simple, understandable, and maybe even the same with regard to timing amongst different implementations. b) Floating point operations are much more likely to vary, but they're probably less likely to be interchangeable, so you do what you can. c) If you're lucky, the pipeline constraints may be such that you: 1) Want to work harder for things with deeper pipelines, in terms of spreading operations apart to lessen stalls. 2) Want to work harder for more aggressive machines that have more concurrency. Fortunately, at least in some cases, there are optimizations for the more aggressive machiensthat help them, but certainly don't hurt the less aggressive machines much, if at all. For instance, if machine (n+1) has longer-latency loads than (n), trying harder to move references to the data later probably won't hurt (n). At least you don't have to fight with issues like: -Model A has a (multi-cycle) serial shifter, and every shift position costs a cycle, but B has a barrel shifter, where the cost is constant, regardless of shift count, and both have multipliers of differing speeds, so the optimal sequences to do multiplies by constants are completely different, and the cutover from shifts+add/subtract to actual multiply is completely different. -On Model A, to copy 8 bytes from here to there, use a move-character, because it has narrow data paths anyway and microcode, but on model B, use load/store, because THOSE are hardwired, and go faster than doing move-character, because the startup time dominates.... Anyway, CDC was hardly alone in this...it's a fact of life for everybody that does multiple implementations. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253, or 408-524-7015 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086