greg@sce.carleton.ca (Greg Franks) (02/01/90)
In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: > >Here's a rhetorical question: > >When was the last time someone introduced a new CISC architecture? >How many years has it been? New versions of old chips ('486, '040, >etc) do not count as "new architectures". The big players in the microprocessor wars are busy souping up their existing CISC processors all of the time, so why would they bother concocting new ones. People sure like having the latest CPU on their desk to run lotus 123 or MacDraw. Furthermore, with the lastest CISC processors reaching into the domain of the RISC processors in terms of performance (eg, 68040 @ 25 MHz being faster than SPARC @ 25 MHz according to Byte), who needs most of the RISC processors floating around these days? Just imagine, 100MIPs and the ability to run an anchient version Word Perfect all in one box! Introducing *any* new architecture, be it RISC or CISC, is likely going to be exceptionally difficult these days unless that processor can demonstrate clear superiority over all others in some way or another (perhaps a cray-on-a-chip??) Sign me - I want an upgrade :-) -- Greg Franks (613) 788-5726 Carleton University, uunet!mitel!sce!greg (uucp) Ottawa, Ontario, Canada K1S 5B6. greg@sce.carleton.ca (bitnet) (we're on the internet too. (finally)) Overwhelm them with the small bugs so that they don't see the big ones.
mash@mips.COM (John Mashey) (02/04/90)
In article <771@sce.carleton.ca> greg@sce.UUCP (Greg Franks) writes: >In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: >> >>Here's a rhetorical question: >> >>When was the last time someone introduced a new CISC architecture? >>How many years has it been? New versions of old chips ('486, '040, >>etc) do not count as "new architectures". > >The big players in the microprocessor wars are busy souping up their >existing CISC processors all of the time, so why would they bother >concocting new ones. People sure like having the latest CPU on their >desk to run lotus 123 or MacDraw. Furthermore, with the lastest CISC >processors reaching into the domain of the RISC processors in terms of >performance (eg, 68040 @ 25 MHz being faster than SPARC @ 25 MHz >according to Byte), My Byte's are in the middle of the input stack. Could someone please post the DATA that shows a 68040 @ 25Mhz to be faster than a SPARC @ 25Mhz? (yes, I've seen the Motorola ads that show the 68040 to be 20 mips versus a SPARC's 18.... :-) On the good side of reality, plaudits to UNIX/World, which has started publishing SPEC numbers (full charts) in some of its workstation comparisons. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) (02/04/90)
In article <35456@mips.mips.COM> mash@mips.COM (John Mashey) writes: >a 68040 @ 25Mhz to be faster than a SPARC @ 25Mhz? >(yes, I've seen the Motorola ads that show the 68040 to be 20 mips versus >a SPARC's 18.... :-) In my understanding of RISC vs CISC, you can't directly compare RISC MIPS against CISC MIPS, because the risc instructions are simple, whereas the cisc instructions are complex. It may take several risc instructions to perform one cisc instruction. Originally the trick was to get the several risc instructions to execute in less time than the one complex instruction. Now the tables are turned, because the cpu designers have figured out how to get the cisc chips to perform the complex instruction in the same clock cycle that the risc chip takes to perform the simple instruction... In a DEC Professional editorial I saw the expression CRISCO (complex risc) architecture to refer to this. -- John Dudeck "You want to read the code closely..." jdudeck@Polyslo.CalPoly.Edu -- C. Staley, in OS course, teaching ESL: 62013975 Tel: 805-545-9549 Tanenbaum's MINIX operating system.
lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) (02/04/90)
In article <25cb6b65.702c@polyslo.CalPoly.EDU> jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) writes: >In my understanding of RISC vs CISC, you can't directly compare RISC MIPS >against CISC MIPS, because the risc instructions are simple, whereas the >cisc instructions are complex. It may take several risc instructions to >perform one cisc instruction. Originally the trick was to get the several >risc instructions to execute in less time than the one complex instruction. >Now the tables are turned, because the cpu designers have figured out >how to get the cisc chips to perform the complex instruction in the same clock >cycle that the risc chip takes to perform the simple instruction... I think that this overstates the advantage of CISC. The recent CISC chips aren't getting a complex instruction to run in one clock. More correctly, they are getting the most commonly used simple instructions to run in one clock. They are also getting the most commonly used complex instructions to run in fewer clocks. The whole RISC thing came about because several compiler people found that they could get better performance out of CISCs by ignoring many of the complex instructions, thus treating the machines as RISC. The hardware people responded by building machines that did only the simple things. To my surprise, the payoff was fairly big. RISC reduced the design time - an advantage that a fast CISC doesn't have. It also reduced the silicon area, but as all the players add onchip caches and whatnot, that matters little. Finally, RISC increased the clock rate, but advanced CISC should come close. So, is it a wash? More-or-less, yes - if RISC designs stand still. But they aren't. RISC is moving to ECL and GaAs, where transistors are scarce. They are also moving to superscalar designs, where the RISC/CISC difference is between incredible complexity and stupefying complexity. -- Don D.C.Lindsay Carnegie Mellon Computer Science
pkr@maddog.sgi.com (Phil Ronzone) (02/04/90)
In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >RISC reduced the design time - an advantage that a fast CISC doesn't >have. It also reduced the silicon area, but as all the players add >onchip caches and whatnot, that matters little. Finally, RISC >increased the clock rate, but advanced CISC should come close. > >So, is it a wash? More-or-less, yes - if RISC designs stand still. >But they aren't. RISC is moving to ECL and GaAs, where transistors >are scarce. They are also moving to superscalar designs, where the >RISC/CISC difference is between incredible complexity and stupefying >complexity. I see that as the ONLY large advantage that RISC has. It simply has been able to reduce the design time. The second argument (gate scarcity) is interesting, but does it not also have a limit? If gates are "typical" in the 10,000-100,000 range, yes, but how about when gates are "typical" in the 1,000,000-10,000,000. ------Me and my dyslexic keyboard---------------------------------------------- Phil Ronzone Manager Secure UNIX pkr@sgi.COM {decwrl,sun}!sgi!pkr Silicon Graphics, Inc. "I never vote, it only encourages 'em ..." -----In honor of Minas, no spell checker was run on this posting---------------
preston@titan.rice.edu (Preston Briggs) (02/05/90)
In article <3562@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes: >In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >>So, is it a wash? More-or-less, yes - if RISC designs stand still. >>But they aren't. RISC is moving to ECL and GaAs, where transistors >>are scarce. They are also moving to superscalar designs, where the >>RISC/CISC difference is between incredible complexity and stupefying >>complexity. >I see that as the ONLY large advantage that RISC has. It simply has >been able to reduce the design time. "Simply" is the correct word, but not just applied to design time. The "incredible complexity" vs. "stupefying complexity" also aplies to the problem of generating code for the super-scalar design. A chip like the 860 is hard enough; if somebody builds a a similar machine with complex addressing modes, etc... it'll be really difficult to build a good compiler for. Lindsay pointed out that RISC machines were a response to compilers that only used the simple instructions. I expect (don't know for sure) that the compilers for 80x86's and 680x0's are still mostly using the simple instructions. Speed isn't the only reason for avoiding complex instruction; you avoid them because they're difficult to generate, because they don't do what you want in the first place, and because the intermediate results aren't available for reuse. For (an old, perhaps overworked) example: Suppose I want to load a value from memory and add it to a register. On most CISC's I can do it in one instruction. On most RISC's, I have to use 2 instructions. On the RISC machine, the value I loaded will still be in a register where I can reuse it later. Of course, we could also use 2 instructions on the CISC. How often does this case arise? That depends on your code and the strength of your optimizer. The RISC bet (supported by dynamic code measurements) is that it happens a lot. So, no matter how fast the CISC people make that "add from memory" instruction run, it won't matter much because it isn't used much. Preston Briggs preston@titan.rice.edu
pkr@maddog.sgi.com (Phil Ronzone) (02/05/90)
In article <4537@brazos.Rice.edu> preston@titan.rice.edu (Preston Briggs) writes: >So, no matter how fast the CISC people make that "add from memory" >instruction run, it won't matter much because it isn't used much. I was too brief. Perhaps in a sense, I am saying that RISC is a non-concept, or a transient "necessary evil". Clearly we have (at least) two technological thrusts for RISC: 1). Compiler technology 2). Highly automated designed tools that cross the threshold of the minimal number gates needed to do something useful. I am of the belief that as our design tools progress, they will go far beyong what is needed for RISC. At some stage, we'll get into the ranges of handling large word width microcoded machines. Such microcoded machines can certainly execute many simple (i.e., RISC type) instructions in one cycle -- I know, I've done one. BUT - they also can implement many useful instructions that are not in some (or all) of the current RISC machines. From multiply to divide to handling the TLB context to entire context switches to (of course) FP instructions. Since what counts is the HUMAN design time most, if they are equivalent, then why not a CRISC? RISC instruction for typical code generation, and the messy many for the rest of the real world. Of course, they could be an order of magnitude difference between RISC and CRISP/CISC, however, how much the die area and hence cost will matter is, IMHO, not a big factor. ------------- This came up for me because I microcode an mainly RISC machine (stack oriented, non-virtual) that had to support unaligned data transfer. The microcode word kept getting wider and wider, but, I had barrel shifters in front of the memory and in front/back of the ALU and hardware assist to keep the 16 top words of the stack in microcode registers. It was a near joy to do unaligned word transfers -- if the next sequential instruction was a push/pop from/to memory and the word was unaligned 4 bytes on a 2 byte boundary, I never lost a cycle (pre-interrupts went off in the previous instruction so I could start the shift fetch shift in memory). ------Me and my dyslexic keyboard---------------------------------------------- Phil Ronzone Manager Secure UNIX pkr@sgi.COM {decwrl,sun}!sgi!pkr Silicon Graphics, Inc. "I never vote, it only encourages 'em ..." -----In honor of Minas, no spell checker was run on this posting---------------
henry@utzoo.uucp (Henry Spencer) (02/06/90)
In article <25cb6b65.702c@polyslo.CalPoly.EDU> jdudeck@polyslo.CalPoly.EDU (John R. Dudeck) writes: >Now the tables are turned, because the cpu designers have figured out >how to get the cisc chips to perform the complex instruction in the same clock >cycle that the risc chip takes to perform the simple instruction... No, they've figured out how to make tomorrow's CISC chips perform the simpler instructions in the same clock cycle that yesterday's RISC chips took to perform similarly simple operations. The complicated instructions are still slow (and still rarely used), the RISCs still have a built-in lead due to shorter design times, and the CISCs still have a built-in handicap due to the mass of instruction/decoding/exception baggage dragging along behind their RISC-like cores. -- SVR4: every feature you ever | Henry Spencer at U of Toronto Zoology wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu
pkr@maddog.sgi.com (Phil Ronzone) (02/07/90)
In article <1990Feb5.211208.15741@utzoo.uucp> henry@utzoo.uucp (Henry Spencer) writes: >No, they've figured out how to make tomorrow's CISC chips perform the >simpler instructions in the same clock cycle that yesterday's RISC chips >took to perform similarly simple operations. The complicated instructions >are still slow (and still rarely used), the RISCs still have a built-in >lead due to shorter design times, and the CISCs still have a built-in >handicap due to the mass of instruction/decoding/exception baggage >dragging along behind their RISC-like cores. Hmm, like automatic TLB loading, or that even more rarely used set known as MUL and DIV??? :-) ------Me and my dyslexic keyboard---------------------------------------------- Phil Ronzone Manager Secure UNIX pkr@sgi.COM {decwrl,sun}!sgi!pkr Silicon Graphics, Inc. "I never vote, it only encourages 'em ..." -----In honor of Minas, no spell checker was run on this posting---------------
chasm@attctc.Dallas.TX.US (Charles Marslett) (02/07/90)
In article <4537@brazos.Rice.edu>, preston@titan.rice.edu (Preston Briggs) writes: > For (an old, perhaps overworked) example: > Suppose I want to load a value from > memory and add it to a register. On most CISC's I can do it in one > instruction. On most RISC's, I have to use 2 instructions. > On the RISC machine, the value I loaded will still be in a register > where I can reuse it later. Of course, we could also use 2 instructions > on the CISC. How often does this case arise? That depends on > your code and the strength of your optimizer. The RISC bet (supported > by dynamic code measurements) is that it happens a lot. Is this dynamic code measurements of sloppy assembly code or good compiler generated code or . . . If the addend is used all that much, some form of code strength reduction would be warranted. I suspect the measurements are due to the "cheapness" of the CISC address arithemetic [using PTR+6, PTR+9, PTR+12, etc., because it costs nothing in execution time]. Add immediate is used a lot because it is cheap, too. > So, no matter how fast the CISC people make that "add from memory" > instruction run, it won't matter much because it isn't used much. If my memory is still holding up, the probability that you will reuse the addend is less than 50% with vast numbers of registers (64 or so), and then only if you spend an immense amount of computational resources doing rather good data flow analysis. That says if you have add-from-memory, you're better off most of the time using it. The more accurate statement might be that if you design a RISC box, it is better to design a register-to-register add than a memory-to-register add or a register-to-memory add because the penalty for the other 50% where it is not optimal is much less. > Preston Briggs > preston@titan.rice.edu Charles Marslett chasm@attctc.dallas.tx.us [I needed some hate mail anyway, ;^)]
ccc_ldo@waikato.ac.nz (02/07/90)
How about the 65816? Wasn't that around 1986?
sgolson@pyrite.East.Sun.COM (Steve Golson) (02/07/90)
In article <3300098@m.cs.uiuc.edu> gillies@m.cs.uiuc.edu writes: > When was the last time someone introduced a new CISC architecture? > How many years has it been? New versions of old chips ('486, '040, > etc) do not count as "new architectures". Since no one else has mentioned it... what about TRON? Steve Golson sgolson@East.sun.com golson@cup.portal.com Trilobyte Systems -- 33 Sunset Road -- Carlisle MA 01741 -- 508/369-9669 (consultant for, but not employed by, Sun Microsystems) "As the people here grow colder, I turn to my computer..." -- Kate Bush
henry@utzoo.uucp (Henry Spencer) (02/08/90)
In article <3674@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes: >>... and the CISCs still have a built-in >>handicap due to the mass of instruction/decoding/exception baggage >>dragging along behind their RISC-like cores. > >Hmm, like automatic TLB loading, or that even more rarely used set known >as MUL and DIV??? :-) Automatic TLB loading is not worth the hardware needed to do it, as Mips (among others) has clearly demonstrated. And most RISCs do something about multiplication and division, although sometimes the "something" is a carefully-considered decision, based on extensive simulations, to leave it to software. (Of course, sometimes the same decision is made without the careful consideration and extensive simulation... :-( ) I haven't heard many complaints about having to live without TranslateAndTest or EvaluatePolynomial instructions. :-) -- SVR4: every feature you ever | Henry Spencer at U of Toronto Zoology wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu
mash@mips.COM (John Mashey) (02/08/90)
In article <3562@odin.SGI.COM> pkr@maddog.sgi.com (Phil Ronzone) writes: >In article <7826@pt.cs.cmu.edu> lindsay@MATHOM.GANDALF.CS.CMU.EDU (Donald Lindsay) writes: >>RISC reduced the design time - an advantage that a fast CISC doesn't >>have. It also reduced the silicon area, but as all the players add >>onchip caches and whatnot, that matters little. Finally, RISC >>increased the clock rate, but advanced CISC should come close. >> >>So, is it a wash? More-or-less, yes - if RISC designs stand still. >>But they aren't. RISC is moving to ECL and GaAs, where transistors >>are scarce. They are also moving to superscalar designs, where the >>RISC/CISC difference is between incredible complexity and stupefying >>complexity. > >I see that as the ONLY large advantage that RISC has. It simply has >been able to reduce the design time. > >The second argument (gate scarcity) is interesting, but does it not >also have a limit? If gates are "typical" in the 10,000-100,000 range, >yes, but how about when gates are "typical" in the 1,000,000-10,000,000. Gate/transistor count can be misleading. On anything current, most of the transistors will be in the caches & MMU and register files. Certainly, with million-transistor chips, there are not enough to do everything you like, and even 3-4M, although it gets you bigger caches, we'll still have compromises and arguments in the hallways. Some of the issues are: 1) In CMOS, the smaller size is less of advantage for RISC than it used to be. Nevertheless, at a given technology level, for a few years, it often means you get a bigger cache, a more parallel FPU, a bigger MMU, or something else on the same die, or, that the die can be smaller and hence cheaper, or that the RISC gets an even more aggressive pipeline in the same space. 2) AS everybody gets more aggressive, the pipelines and other critical paths get more complex, as more aggressive = more things in parallel. CISCs may well take longer to design (or not), but the key issue is what happens in the critical paths on the chip. From past history (i.e., things like 360/91), you can make any architecture go faster, but if not designed for smooth pipelining, the complexity can get very high. In addition, VLSI has design constraints that forbid some of the solutions used in the less integrated designs, i.e., lots of big busses and interconnects. (You certainly can have big busses, but you still only get a few layers of them, and as soon as your design exceeds what you can get on the chip, performance drops, whereas the performance cost of incremental complexity in a less integrated design is not necessarily so much.) 3) Exception-handling is always one of the most trouble-prone areas of a design, and anything that makes it more complex slows down the design process. 4) Finally, nothing will make the current CISC micro architectures have more registers available at once [they might get register sets, but the instruction encoding makes it pretty hard to increase the number available to the compiler at once.] Note: this was not intended to be a RISC commercial, merely to point out that the transistor-count issue gets over-emphasized. I give a talk that ends with: IF RISC IS SO GOOD, WILL CISC DISAPPEAR? No. 68Ks and X86s will be with us forver. WILL CISCS GET FASTER? Yes, using RISC-like techniques, or in fact, the same techniques that mainframe and supermini people have been using for 20 years to speed existing CISC architectures. WILL THEY CATCH UP? No: Intellectual complexity. Longer design cycles. Less registers than match current global optimizers. WHERE ARE THEY NOW? It is hard to tell, as I've seen no real benchmarks for an 040 yet [they may exist, I haven't seen them], and I'm hoping to see SPEC ratios for 486 fairly soon, which will really help get over apples and oranges comparisons.o I have been collecting some numbers in advance of that, although unfortuantely limited to *stones & such, and will post some soon. What I see so far says that 25 MHz 486s, in 32-bit mode (I think), have floating point that looks mostly like a MIPS M/500, and integer somewhat less than an M/800, even with external caches as well as internal. This is faster than an 8MHz R2000, and slower than a 12.5MHz one.... To be fair, they're somewhat faster under MS/DOS, and I'm not sure if that's compiler differences, or (more likely) 16-vs-32 bit model differences. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
gillies@p.cs.uiuc.edu (02/09/90)
> Since no one else has mentioned it... what about TRON?
TRON was the only response I received via email that fit the
requirements (a genuinely new CISC instruction set, not an enhancement
to some old product). How long ago was TRON conceived? I'm guessing
it may have been 5 years ago, but I'm not sure. It's very interesting
that TRON comes from Japan, and the U.S. is still considered to be the
leader in CPU design. Maybe I should have asked for the most recent
U.S. CISC cpu design.
pkr@maddog.sgi.com (Phil Ronzone) (02/09/90)
In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >WILL THEY [risc] CATCH UP? > No: > Intellectual complexity. > Longer design cycles. > Less registers than match current global optimizers. Hmmm -- maybe we should break RISC into RISC-the-instruction-chip and RISC-the-el-cheapo-hardware-realization-of-an-instruction-set. What would we call a chip with say, an R3000 instruction set AND a microcoded 68XXX instruction set, with a mode bit to flip between the two? RISC? CISC? CRISP? :-) RIDICULOUS? We're going to be able to do it some day. ------Me and my dyslexic keyboard---------------------------------------------- Phil Ronzone Manager Secure UNIX pkr@sgi.COM {decwrl,sun}!sgi!pkr Silicon Graphics, Inc. "I never vote, it only encourages 'em ..." -----In honor of Minas, no spell checker was run on this posting---------------
slackey@bbn.com (Stan Lackey) (02/09/90)
In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >3) Exception-handling is always one of the most trouble-prone areas of >a design, and anything that makes it more complex slows down the design >process. Microcode is the way this problem is commonly dealt with. Microcode turns an intractable hardware control mechanism into a part of the design that many a computer hardware or software person can understand, design, debug, etc. Because exception handling must be dealt with in hardware in a RISC, one could make the claim that this makes RISCs more complex (from a certain point of view) than CISCs. >4) Finally, nothing will make the current CISC micro architectures >have more registers available at once [they might get register sets, >but the instruction encoding makes it pretty hard to increase the number >available to the compiler at once.] This issue keeps coming up. You know, I have seen so many compiled routines that contain inner loops that only use 5 or 6 registers, inner loops that determine the performance of the program, that I really wonder how many registers are needed (Oh no not that again!) Global register allocation and all, who cares if you save and restore a half dozen registers outside of a loop that is going to execute 100 times. Having more registers really does help some of the time, and when the industry starts making new CISC architectures I'll bet you will see more, now that program size is not so constraining. >IF RISC IS SO GOOD, WILL CISC DISAPPEAR? > No. > 68Ks and X86s will be with us forver. As will lots of other architectures, as long as RISCs are going to neglect certain functionality. >WILL THEY CATCH UP? > No: > Intellectual complexity. > Longer design cycles. > Less registers than match current global optimizers. Vector machines always run faster on vector problems than non vector machines. Even if the cycle time is a little slower. The shoe is moving to the other foot, so to speak; in order to match vector machines, RISCs will need to go to super scalar execution (assuming they don't add the large register sets or the instructions to do vectors). To do this they need to deal with variable length instructions (variability determined by register dependencies and stuff in the pipe, not to mention the surrounding instructions), register and opcode fields in variable places in the instruction word, complexity handling exceptions, and all the other CISC characteristics RISCers love to bash. -Stan
baum@Apple.COM (Allen J. Baum) (02/09/90)
[] >In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >Gate/transistor count can be misleading. >On anything current, most of the transistors will be in the caches & MMU >and register files. Actually, I thought that I'd heard that 3/4 of the area on the 88100 is the FPU (course, that doesn't include caches) >Some of the issues are: >2) AS everybody gets more aggressive, the pipelines and other critical paths >get more complex, as more aggressive = more things in parallel. >CISCs may well take longer to design (or not), but the key issue is what >happens in the critical paths on the chip. From past history (i.e., things >like 360/91), you can make any architecture go faster, but if not designed >for smooth pipelining, the complexity can get very high. Bingo! I believe you've said something I believe strongly, and the crux is the "designed for smooth piplelining" phrase. I feel that this is really the major distinguishing feature between "RISC" & "CISC". This is far more important than the silly RISC/CISC #of-regs/addressing modes arguments. A "CISC" which is designed for pipelining should keep up with a "RISC". The tricks used to make "RISC"s go faster then work for "CISC"s as well. Most of the otherwise fundamental problems of "CISC"s (exception handling) go away (to the same extent they go away in "RISC"s anyway) -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
baum@Apple.COM (Allen J. Baum) (02/09/90)
[] >In article <51951@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>3) Exception-handling is always one of the most trouble-prone areas of >>a design, and anything that makes it more complex slows down the design >>process. > >Microcode is the way this problem is commonly dealt with. Microcode >turns an intractable hardware control mechanism into a part of the >design that many a computer hardware or software person can >understand, design, debug, etc. Well, you still need to be able to save the necessary info to unwind state, which may mean keeping around a lot more info than a RISC, and you may have to be very careful to do it in the right order, etc. Note that exception handling on some RISCs (I860) is a complete bitch as well. >>4) Finally, nothing will make the current CISC micro architectures >>have more registers available at once [they might get register sets, >>but the instruction encoding makes it pretty hard to increase the number >>available to the compiler at once.] > >This issue keeps coming up. You know, I have seen so many compiled >routines that contain inner loops that only use 5 or 6 registers, >inner loops that determine the performance of the program, that I >really wonder how many registers are needed (Oh no not that again!) >Global register allocation and all, who cares if you save and restore >a half dozen registers outside of a loop that is going to execute 100 >times. > >Having more registers really does help some of the time, and when >the industry starts making new CISC architectures I'll bet you will >see more, now that program size is not so constraining. Yes and no. Mall's paper about link time register alloc. showed a 15% increase with 52 regs. showed a 10-29% speedup. The high end was with the Stanford benchmark, not a real workload. These are still relatively small numbers compared to what waiting one year will bring. On the other hand, lots of registers lets you perform optimizations (besides avoiding register spills) which can't be done otherwise, notably loop unrolling. (But, to counter the counter-argument, superscalar techniques might severly lessen the advantages of unrolling, since the overhead which is being saved might be done in parallel). >The shoe is moving to the other foot, so to speak; in order to match >vector machines, RISCs will need to go to super scalar execution >(assuming they don't add the large register sets or the instructions >to do vectors). To do this they need to deal with variable length >instructions (variability determined by register dependencies and >stuff in the pipe, not to mention the surrounding instructions), >register and opcode fields in variable places in the instruction word, >complexity handling exceptions, and all the other CISC characteristics >RISCers love to bash. You betcha. RISCs probably don't need to be quite as aggressive as CISCs to take advantage of these techniques, but the complexity is going to be worse than current day CISCs. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
colwell@mfci.UUCP (Robert Colwell) (02/09/90)
In article <51951@bbn.COM> slackey@BBN.COM (Stan Lackey) writes: >In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>3) Exception-handling is always one of the most trouble-prone areas of >>a design, and anything that makes it more complex slows down the design >>process. > >Microcode is the way this problem is commonly dealt with. Microcode >turns an intractable hardware control mechanism into a part of the >design that many a computer hardware or software person can >understand, design, debug, etc. Because exception handling must be >dealt with in hardware in a RISC, one could make the claim that this >makes RISCs more complex (from a certain point of view) than CISCs. One could also dispute that claim. Exception handling in normal machines (meaning those lacking the hard-real-time limit that incoming missiles pose) don't deserve special hardware attention. Give the software as much as it needs to clean up the mess. Anything more increases the hardware design time and the likelihood that something will have to be respun to fix bugs. Sure, now I've moved that complexity into software, and there it will still have to be dealt with. But I don't know of any machines that were late because the software implementing their exception handlers weren't ready, and I can think of lots of examples for complexity-related bugs delaying hardware. >Having more registers really does help some of the time, and when >the industry starts making new CISC architectures I'll bet you will >see more, now that program size is not so constraining. You need as many registers as it takes for spilling and restoring them to stay off your list of bottlenecks. This is a fairly complicated function of the number of functional units, their respective latencies, the bandwidth available (and needed) to and from memory, and the cleverness of the compiler. Not a RISC/CISC issue at all (which we first pointed out in 1983). >>WILL THEY <CISC> CATCH UP? >> No: >> Intellectual complexity. >> Longer design cycles. >> Less registers than match current global optimizers. >Vector machines always run faster on vector problems than non vector >machines. Even if the cycle time is a little slower. "Always" is a tad strong. If you're talking about 100% vectorizable code I suppose you're right, but there isn't much of that around. It certainly doesn't constitute the workloads of the customers and benchmarks that we routinely run across. For anything less I believe vector machines are yesterday's answer to the problem. >The shoe is moving to the other foot, so to speak; in order to match >vector machines, RISCs will need to go to super scalar execution >(assuming they don't add the large register sets or the instructions >to do vectors). To do this they need to deal with variable length >instructions (variability determined by register dependencies and >stuff in the pipe, not to mention the surrounding instructions), >register and opcode fields in variable places in the instruction word, >complexity handling exceptions, and all the other CISC characteristics >RISCers love to bash. We solved this in Multiflow's machines without resorting to any of that. Number of registers and memory bandwidth scale with the number of functional units. Instruction variability is at the packet (32-bit instruction word) level; a packet is present or it is not, and the cache miss hardware looks at a "mask" word to decide. This allows us to do cache refill at full memory bandwidth without the refill engine having to even see any of the packets -- they just get blasted into icache directly. And since they're fully decoded already, we get the RISC benefit of simple, fast instruction decode. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
colwell@mfci.UUCP (Robert Colwell) (02/09/90)
In article <38462@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >>In article <35647@mips.mips.COM> mash@mips.COM (John Mashey) writes: >>Some of the issues are: >>2) AS everybody gets more aggressive, the pipelines and other critical paths >>get more complex, as more aggressive = more things in parallel. >>CISCs may well take longer to design (or not), but the key issue is what >>happens in the critical paths on the chip. From past history (i.e., things >>like 360/91), you can make any architecture go faster, but if not designed >>for smooth pipelining, the complexity can get very high. > >Bingo! I believe you've said something I believe strongly, and the >crux is the "designed for smooth piplelining" phrase. I feel that this >is really the major distinguishing feature between "RISC" & "CISC". >This is far more important than the silly RISC/CISC >#of-regs/addressing modes arguments. A "CISC" which is designed for >pipelining should keep up with a "RISC". The tricks used to make >"RISC"s go faster then work for "CISC"s as well. >Most of the otherwise fundamental problems of "CISC"s (exception handling) >go away (to the same extent they go away in "RISC"s anyway) But this is only true if you are setting out to design a CISC starting from a clean slate. It's not clear to me why anyone would do that, unless they had goals other than performance uppermost in their minds (and I believe there are some.) If your task is to implement the VAX, say, with "smooth pipelining", and you want to keep up with a machine designed to be a RISC-like compiler target, then I believe you're doomed to failure. (See Doug Clark's description of his travails in implementing the VAX-8600 in ASPLOS-II. I actually felt sorry for him, it reads like an Edgar Allan Poe horror story.) There are just too many architecturally-required strands of spaghetti for you to end up with a clean design. And if you're starting from scratch to design a CISC, and you try to implement this concept of "smooth pipelining" (which could stand a more rigorous definition, by the way), and you try to yank exception handling into software, and you make the frequent ops go fast, and you minimize the side-effects of ops, then what have you got left that qualifies your design as a CISC? Maybe you've left yourself some really complicated instrs for some special purposes. If so, good luck working those into your exceptions-handling and pipelining schemes. I don't believe you can do that in anything like the same amount of time a similar RISC design would need. Bob Colwell ..!uunet!mfci!colwell Multiflow Computer or colwell@multiflow.com 31 Business Park Dr. Branford, CT 06405 203-488-6090
oconnordm@CRD.GE.COM (Dennis M. O'Connor) (02/09/90)
baum@Apple (Allen J. Baum) writes:
] >CISCs may well take longer to design (or not), but the key issue is what
] >happens in the critical paths on the chip. From past history (i.e., things
] >like 360/91), you can make any architecture go faster, but if not designed
] >for smooth pipelining, the complexity can get very high.
]
] Bingo! I believe you've said something I believe strongly, and the
] crux is the "designed for smooth piplelining" phrase. I feel that this
] is really the major distinguishing feature between "RISC" & "CISC".
A major illustrative example of this was the MCF architecture, developed by
the military when DEC refused to license the VAX architecture to MIL-SPEC
computer manufacturers. ( MCF was known as Nebula, also )
MCF was very similar to a VAX, but more so. It had recursive addresing
modes, for instance : you could, in a single addres specification,
specify something like ( M[x] = contents of memory location x )
[offset + M[ offset + M[ offset + M[ offset + register ] ] ] ]
I kid you not. And with no limit on the level of nesting. Just
think how easy (!?) this made compilation of high-level code
constructs like
rec_array( index_array( frame(2).index ).in_ptr ).rec_field( 2 )
;-)
Worse than than this, the instruction set was byte-quantized and
variable length, and you couldn't tell how to decode a byte until all
the previous bytes had been decoded. ( One method of solving this was
to decode each byte all five possible ways and then select the correct
decoding. ) The(dynamic) average instruction length was five bytes, so to
achieve, say, 10 million instructions per second execution you had to decode
50 million bytes per second, one at a time. Yeesh.
Designing a pipelined architecture for this beast was tough ( for
example, the pipeline had a loop in the middle of it to handle
the recursive addresing modes. ) A few changes to the architecture
would have allowed it to run much more quickly.
Apparently, this is what happens when a machine architecture is
designed by ONLY the compiler people ( I guess ) with no input
from the hardware people. The two must work together, IMHO :-)
--
Dennis O'Connor OCONNORDM@CRD.GE.COM UUNET!CRD.GE.COM!OCONNOR
Science and Religion have this in common : you must take care to
distinguish both from the people who claim to represent each of them.
bcase@cup.portal.com (Brian bcase Case) (02/10/90)
>Vector machines always run faster on vector problems than non vector >machines. Even if the cycle time is a little slower. I don't believe this. See the WM-machine architecture proposed by W. Wulf. This is a general-purpose architecture that can achieve vector rates without actual vector hardware (well, the memory system has to be done right, but there are no vector instructions).
andrew@frip.WV.TEK.COM (Andrew Klossner) (02/10/90)
> When was the last time someone introduced a new CISC architecture?
How about the i960? Object-oriented instructions with tags, and
"silicon operating system" features, though you'd never know it from
the externally released documentation.
-=- Andrew Klossner (uunet!tektronix!frip.WV.TEK!andrew) [UUCP]
(andrew%frip.wv.tek.com@relay.cs.net) [ARPA]
aglew@oberon.csg.uiuc.edu (Andy Glew) (02/13/90)
>>Vector machines always run faster on vector problems than non vector >>machines. Even if the cycle time is a little slower. > >I don't believe this. How about a bit of hand-waving?: If instruction dispatch is your bottleneck, vector machines are faster because they dispatch multiple operations with one instruction. Usually the operations are simple, with trivial dependencies. Really RISCy vector machines do not require hardware to resolve the possible dependencies. CISCs may dispatch multiple operations per instruction, but the dependencies are typically more complicated and the operations dispatched are less regular. Superscalar RISCs (or CISCs) dispatch multiple operations per "instruction decode cycle", but the operations dispatched are less regular and the dependencies are more general. Of course, if instruction dispatch is not your bottleneck, and you are limited by things like data memory access and dependency depth, then you use the most powerful multiple operation dispatch technique you can get away with. -- Andy Glew, aglew@uiuc.edu
danh@halley.UUCP (Dan Hendrickson) (02/13/90)
In article <26765@cup.portal.com] bcase@cup.portal.com (Brian bcase Case) writes:
]]Vector machines always run faster on vector problems than non vector
]]machines. Even if the cycle time is a little slower.
]
]
]I don't believe this. See the WM-machine architecture proposed by
]W. Wulf. This is a general-purpose architecture that can achieve
]vector rates without actual vector hardware (well, the memory system
]has to be done right, but there are no vector instructions).
Has anyone ever designed the hardware for this "proposed" architecture? Or
are you comparing real iron to paper? There are a lot of very interesting
architectures which have been proposed but never saw silicon because they
could not effectively be implemented, or if they were the penalties on the
cycle time for the hardware made them slower than the "inferior" architectures.
slackey@bbn.com (Stan Lackey) (02/13/90)
In article <26765@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >>Vector machines always run faster on vector problems than non vector >>machines. Even if the cycle time is a little slower. >I don't believe this. See the WM-machine architecture proposed by >W. Wulf. This is a general-purpose architecture that can achieve >vector rates without actual vector hardware (well, the memory system >has to be done right, but there are no vector instructions). The proposed WM machine assumes a memory access unit that is programmed to access a linear data structure with a base address, a stride, and a length. That's a "vector move" instruction. The WM combines this with a Multiflow-style RISC-VLIW instruction set and huge register file. I have no doubts this machine can match standard vector machine performance; in my opinion, it also qualifies as a vector machine, as it has all the key features. :-) Stan
baum@Apple.COM (Allen J. Baum) (02/14/90)
[] >In article <1229@m3.mfci.UUCP> colwell@mfci.UUCP (Robert Colwell) writes: >>In article <38462@apple.Apple.COM> baum@apple.UUCP ( me! ) writes: >> ....flames about smooth pipelining & CISC.... >But this is only true if you are setting out to design a CISC starting >from a clean slate. It's not clear to me why anyone would do that, >unless they had goals other than performance uppermost in their minds Well, I think my point was that you would start with a clean slate; after all, that's what RISC did. My furhter point was that some of the CISCy features that people think of as awful (e.g. reg.-mem instructions) a). can be implemented without adversely b). use of these features can improve performance (cut # of cycles) of programs Another way to look at a reg-mem architecture is as a limited superscalar architecture. If you look from a pipelining point of view, you'll notice that the address calculation and execution (of different instructions) occur in parallel; parallelism that does not occur in load/store architectures. Note that you still have to use instruction scheduling in order to avoid interlocks, but IBM talked about doing that to improve performance at the first ASPLOS conference. Current RISC design is based on some level of compiler technology. I believe that current compiler technology can go further. In fact, one of the series of patents that IBM got on the 801 was a method for choosing which address mode to use, i.e. when to load into a register and re-use the value, and when to use a reg-memory instruction because it won't get re-used. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
sbw@naucse.UUCP (Steve Wampler) (02/22/90)
> This new "bat" machine is quite CISCy, and sounds pretty interesting, > at least based on the tidbits from EE Times. A 64 bit MPU with > instructions to support C-like things such as telling if one byte in > a machine word is '\0' or '\n', or another which basically implements > an 8-case switch statement. Of course, like most new things, there > are incredible MIPS claims for it but nothing like SpecMark or even > Dhrystone out yet. > > Dave Haynie Commodore-Amiga (Systems Engineering) "The Crew That Never Rests" Well, I haven't seen the EE Times, but I know a little about the BAT. The first model, the 6420, is, in my opinion, not a great implementation. The 6430, however, is well implemented. The BAT series does many of the things that RISCS have been doing (large caches, lots of registers, etc.) in a CISC environment (18 different addressing modes!) Most of the 'common' instructions (i.e. RISC) are single cycle and typically one or two bytes (so you can load 8 or 4 in one 64-bit bus access). C-style string functions and memory functions are encoded as single instructions that operate 8-bytes at a time. Function calls are 3(?) cycles, and interupts are equally fast. Also nice are the 256(?) data channels for nice fast I/O. The MIPS claim is misleading - that is for character processing (where they should be *fast*, given the 8-byte parallelism), PEAK performance. The biggest win (aside from character processing) is the small impact heavy I/O has on overall performance - it just doesn't slow down much under heavy I/O. Given the slow clocks on the 6420 and 6430, they do a pretty impressive job. I'm anxious to see what the faster clock versions do. I'm supposed to be getting one next week - though the OS may follow after that by another week or so - to play with. If there's anything really interesting in the real machine (I don't expect that much from the 6420) I'll be happy to post it. Oh, yes, it's a 48-bit address space. One last interesting feature is the ability to play with part of a register without affect the other parts. So you can put a 16-bit tag on a 48-bit address and not have to unpack them into separate registers. The machine *should* be able to run Icon and Lisp like, well, a bat out of .... -- Steve Wampler {....!arizona!naucse!sbw}