baum@Apple.COM (Allen J. Baum) (02/08/90)
[] >In article <160@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: >Has anyone seen a 68040? I thought not. You are comparing >a chip that won't ship until this summer with one that is in >a machine that has been in production for some time. This >occurs over and over in the RISC/CISC debate, but that doesn't >seem to keep people from making these silly comparison's. I'm afraid that I have to agree with this one. The '040 has hit silicon, but you're still comparing a 1.2million transistor chip, with built-in caches, to a much smaller chip (how many transistors in the Cypress SPARC, anyone know?) It is still very significant that the are claiming to be faster AT THE SAME COLCK RATE. It also took them a few more years to build the complex chip that would do that- not an easy task, even with the extra time. The Moto folks appear to have done a very nice job on the design of this chip. We still need to wait for real benchmarks. Anyone planning SPECmarks for the '040? -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/08/90)
In article <38415@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: | It is still very significant that the are claiming to be faster AT THE SAME | COLCK RATE. It also took them a few more years to build the complex chip | that would do that- not an easy task, even with the extra time. The Moto | folks appear to have done a very nice job on the design of this chip. This is very impressive. I would like to propose using MISC instead of CISC, since the microcode which used to require many cycles per instruction is now replaced by hard logic for virtually all of the instructions, maybe all in the 040. I expect the 586 to have 1+ instructions per cycle average, too, indicating that traditional RISC may have been the way to go when chips were small, and that richer instruction sets may become possible in the next decade without giving up any performance. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) (02/08/90)
In article <2101@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: >In article <38415@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: > >| It is still very significant that the are claiming to be faster AT THE SAME >| COLCK RATE. It also took them a few more years to build the complex chip >| that would do that- not an easy task, even with the extra time. The Moto >| folks appear to have done a very nice job on the design of this chip. > > This is very impressive. I would like to propose using MISC instead of >CISC, since the microcode which used to require many cycles per >instruction is now replaced by hard logic for virtually all of the >instructions, maybe all in the 040. I expect the 586 to have 1+ >instructions per cycle average, too, indicating that traditional RISC >may have been the way to go when chips were small, and that richer >instruction sets may become possible in the next decade without giving >up any performance. Yes, but how much do we benefit from the richer instruction sets, even if all the instruction are hardwired and execute at 1 cycle/instruction? Isn't one of the RISC folks' main arguments for simple instruction sets that current compilers don't effectively exploit the complex addressing modes and instructions supported in CISC chips? Perhaps someone would like to speculate on what progess the next decade will bring in compiler technology... -- Jeff Kuskin, Dartmouth College jskuskin@eleazar.dartmouth.edu
scott@bbxsda.UUCP (Scott Amspoker) (02/08/90)
In article <19233@dartvax.Dartmouth.EDU> jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) writes: >Yes, but how much do we benefit from the richer instruction sets, even >if all the instruction are hardwired and execute at 1 cycle/instruction? >Isn't one of the RISC folks' main arguments for simple instruction sets >that current compilers don't effectively exploit the complex addressing >modes and instructions supported in CISC chips? Perhaps someone would >like to speculate on what progess the next decade will bring in >compiler technology... Well, it doesn't take much to find instructions on a 680x0 that are not used by a C compiler. However, my code tends to do a lot of structure accesses with pointers such as "pointer->field". The 68020 double-indirect-with-offset addressing mode is a real life saver and I haven't seen that on the few RISC machines I've used. Admittedly, a good optimizing compiler would not need such a mode. -- Scott Amspoker Basis International, Albuquerque, NM (505) 345-5232 unmvax.cs.unm.edu!bbx!bbxsda!scott
davec@proton.amd.com (Dave Christie) (02/09/90)
In article <2101@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: | | I would like to propose using MISC instead of | CISC, since the microcode which used to require many cycles per | instruction is now replaced by hard logic for virtually all of the | instructions, maybe all in the 040. I expect the 586 to have 1+ According to an article on the '040 in this week's EE Times: "The IU integer pipeline has three different [control] mechanisms: an initial decode PLA for the EA [address formation] stage, a ROM-driven microcode engine for following stages and a finite ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ state machine to control the EU stage." Just thought I'd pass that on.... -------- Dave Christie
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/09/90)
In article <19233@dartvax.Dartmouth.EDU> jskuskin@eleazar.dartmouth.edu (Jeffrey Kuskin) writes: | Yes, but how much do we benefit from the richer instruction sets, even | if all the instruction are hardwired and execute at 1 cycle/instruction? | Isn't one of the RISC folks' main arguments for simple instruction sets | that current compilers don't effectively exploit the complex addressing | modes and instructions supported in CISC chips? A fair question, but hard to answer in the middle ground. There are some instructions which make code generation and execution faster for almost all applications, such as mpy and div. The question has always been if the gates could be better used to make something else faster, not if those instructions would be useful. At the other end, there are instructions which are really special purpose, and I don't think that anyone would argue for including them in a general purpose CPU, such as the FFT instruction I discussed here a few weeks ago. The answer is that instructions should be added if the sequence of simple instructions to do the same thing is (a) common, and (b) slower. If the sequence is more than a few instructions long some tradeoff comes in because fewer instructions mean fewer hits on the memory. The guide has got to be the overall speed of the CPU for a general mix (assuming a g.p. CPU), rather than aiming for a single benchmark. This compromise leaves room for lots of competition, because performance is a factor of load characteristics to some extent. As long as adding the instructions and addressing modes don't slow down other stuff, directly or by stealing gates, they can be a net win. Another compromise is in register scoreboarding. By using a complex instruction part of the execution may be overlapped with execution of following instructions. This rapidly gets into interactions between the compiler quality and features. I am told that the 586 will have an SPU for the string operations. While I would expect this to have very little effect on general performance, kernel bitmap searches and bitblt *may* now be overlappable with other things. Is this a better use of gates than more cache? Is the rumor even true? I don't claim to have the answers, but I have some programs which use strchr(), strcat(), memcpy(), and such *very* heavily, and I would be willing to try writing a few routines in assembler if I could get 20-30% better performance. You have to take advantage of the hardware. Some address complexity, at least in the area of having autoincr on things is usually a win, but it may require a smart compiler or scoreboarding to make best use of it. Operations directly to memory is a favorite whipping boy of the RISC people, but it often saves use of a register, saves two instructions, and if it allows fewer registers implemented, or fewer saved on a context switch with only dirty registers saved, it may be an overall win. Sorry for the long reply, but I said initially that the question was complex. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI) (02/09/90)
From article <2101@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr): > This is very impressive. I would like to propose using MISC instead of > CISC, since the microcode which used to require many cycles per > instruction is now replaced by hard logic for virtually all of the > instructions, maybe all in the 040. I expect the 586 to have 1+ > instructions per cycle average, too, indicating that traditional RISC > may have been the way to go when chips were small, and that richer > instruction sets may become possible in the next decade without giving > up any performance. It is impressive. In the next design, we have to think about how we can fill 5,000,000 transistors or more on one chip which LSI technology will offer us. I agree on this point. However, how about exploiting hidden parallelism? As disucssed in this group, RISC technology has exploited hidden parallelism in high level language descriptions. When one accesses a variable on a memory, a RISC chip loads it into its register. Then it does some operation on it. The value is remained on the register. One can reuse it in another operation. Current RISC optimizing compilers make use of it in a certain art level. I am very intersted in another RISC/CISC war in this decade, 1990s. -- Toshihiko YAMAKAMI Toshihiko YAMAKAMI NTT Telecommunication Networks Laboratories Telephone: +81-468-59-3781 FAX: +81-468-59-2546 junet: yam@nttmhs.ntt.jp CSNET: yam%nttmhs.ntt.jp@relay.cs.net snail-mail: Take 1-2356-523A, Yokosuka, Kanagawa 238-03 JAPAN
tim@nucleus.amd.com (Tim Olson) (02/09/90)
In article <2105@crdos1.crd.ge.COM> davidsen@crdos1.crd.ge.com (bill davidsen) writes: | As long as adding the instructions and addressing modes don't slow | down other stuff, directly or by stealing gates, they can be a net win. | Another compromise is in register scoreboarding. By using a complex | instruction part of the execution may be overlapped with execution of | following instructions. This rapidly gets into interactions between the | compiler quality and features. But the complex instruction typically binds many operations together, *reducing* the ability to efficiently overlap subsequent operations. However, if the complex instruction is split into its constituent parts, there is much more opportunity for instruction scheduling. | Some address complexity, at least in the area of having autoincr on | things is usually a win, but it may require a smart compiler or | scoreboarding to make best use of it. Either this will take an extra cycle to write back the incremented address register (in which case an explicit add is just as fast), or an extra register file port just to write the incremented address at the same time the load data is written. If more register file ports are going to be added, I'd rather issue multiple, general-purpose instructions, which have a much greater chance of being used than a limited auto-increment mode. -- Tim Olson Advanced Micro Devices (tim@amd.com)
bcase@cup.portal.com (Brian bcase Case) (02/09/90)
>It is still very significant that the[y] (Moto about the 040) >are claiming to be faster AT THE SAME >COLCK RATE. It also took them a few more years to build the complex chip >that would do that- not an easy task, even with the extra time. Well, it's not entirely clear what clock rate means here. It is interesting to note that the 040 doubles the clock internally and uses four edges. The "execute pipeline stage" does all of the following *in one clock cycle*: register read, ALU, and register writeback. That hardly sounds like a pipeline comparable to other optimized pipelines. Question: could a 25 MHz 040 operate at 50 MHz with a better pipeline? It seems the answer is yes. What does this say about RISC vs. CISC? I don't know, and besides I am speculating anyway. REGARDLESS, the 040 is a really great chip and it will make some damn nice Macintoshes. Add a graphics accelerator (using a RISC, let's say), and WOW!
bcase@cup.portal.com (Brian bcase Case) (02/09/90)
> This is very impressive. I would like to propose using MISC instead of >CISC, since the microcode which used to require many cycles per >instruction is now replaced by hard logic for virtually all of the >instructions, maybe all in the 040. I expect the 586 to have 1+ >instructions per cycle average, too, indicating that traditional RISC >may have been the way to go when chips were small, and that richer >instruction sets may become possible in the next decade without giving >up any performance. No, hard logic is there for the simple instructions; the complex ones do a "Hold everything for a few cycles until we can finish this thing." Yes, the 586 may indeed have 1+ instructions per cycle on average, but that will only be for programs that use the simple instruction subset. Hey guys, only certain kinds of instructions can be made to fit in a reasonable-length, uniform, lock-step pipeline. That's a fact and it's one of the ones on which RISC is based. CISC chips are looking good because they are making the simple instructions go fast and because the compilers are changing. And that's a function of $$ available for development efforts....
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (02/10/90)
In article <29099@amdcad.AMD.COM> tim@amd.com (Tim Olson) writes: | But the complex instruction typically binds many operations together, | *reducing* the ability to efficiently overlap subsequent operations. | However, if the complex instruction is split into its constituent | parts, there is much more opportunity for instruction scheduling. Performance depends on how it's done. If the CPU can't do anything else when it starts a complex instruction, then the gains from possible internal overlap of phases will have to outweigh the blocking of the CPU. If the CPU can continue to execute at least some other instructions, then a smart compiler can probably find instructions. This isn't black and white, where all complex instructions are a lose and all simple ones are a win. Volume of instructions impacts memory bandwidth, too. | Either this will take an extra cycle to write back the incremented | address register (in which case an explicit add is just as fast), or | an extra register file port just to write the incremented address at | the same time the load data is written. If more register file ports | are going to be added, I'd rather issue multiple, general-purpose | instructions, which have a much greater chance of being used than a | limited auto-increment mode. What I said about memory bandwidth applies here, but even more to the point, a load or store through a pointer (address register) usually has at least one cycle overhead after the address is used, even with cache. This can be used to do the increment without slowing anything down, and without running another instruction decode. The issue which I believe is primary is if the added complexity of the instruction decode slows it down. Given the number of gates available I believe the answer is "usually not." There are people who argue against having increment, stating that it's not general purpose and that the incrment should be two discrete instructions, namely (1) load immediate to 2nd register value 1, and (2) add 2nd register to the register to be incremented. I don't agree with this, either, but I can see that it is the ultimate extension of the RISC method. -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "Stupidity, like virtue, is its own reward" -me
csimmons@jewel.oracle.com (Charles Simmons) (02/11/90)
In article <38415@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: > From: baum@Apple.COM (Allen J. Baum) > Subject: '040 vs. SPARC (was: Next computer...) > Date: 7 Feb 90 18:02:06 GMT > > >In article <160@zds-ux.UUCP> gerry@zds-ux.UUCP (Gerry Gleason) writes: > >Has anyone seen a 68040? I thought not. You are comparing > >a chip that won't ship until this summer with one that is in > >a machine that has been in production for some time. This > >occurs over and over in the RISC/CISC debate, but that doesn't > >seem to keep people from making these silly comparison's. > > I'm afraid that I have to agree with this one. The '040 has hit > silicon, but you're still comparing a 1.2million transistor chip, > with built-in caches, to a much smaller chip (how many transistors in > the Cypress SPARC, anyone know?) > -- > baum@apple.com (408)974-3385 > {decwrl,hplabs}!amdahl!apple!baum While I am firmly a RISC bigot, there is a good CISC argument here. When comparing the '040 and Sparc, you are not comparing a 1.2M transistor chip with a 120K transistor chip. In the 1.2M transistors of the '040, there's an ALU, FPU, portions of a cache, and probably an MMU. For an accurate comparison, you'ld want to consider the Sparc chip [ALU and portions of an MMU?], the FPU chip used with the Sparc, and at least some of the transistors used to implement the off-chip Sparc cache. It has been suggested that when looked at in this light, the Sparc uses just about as many transistors as the '040. -- Chuck
tom@.parcom.nl (Tom van Peer) (02/12/90)
yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI) writes: >From article <2101@crdos1.crd.ge.COM>, by davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr): >However, how about exploiting hidden parallelism? >As disucssed in this group, RISC technology has exploited >hidden parallelism in high level language descriptions. >When one accesses a variable on a memory, a RISC chip loads it >into its register. Then it does some operation on it. >The value is remained on the register. One can reuse it >in another operation. Big fun if you want to make a multi processor set-up. -- Tom van Peer. Parallel Computing, Amsterdam. +31-20-233274 E-mail: tom@parcom.nl
jdarcy@pinocchio.encore.com (Jeff d'Arcy) (02/12/90)
Either yam@nttmhs.ntt.jp (Toshihiko YAMAKAMI) or davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr): > However, how about exploiting hidden parallelism? > As disucssed in this group, RISC technology has exploited > hidden parallelism in high level language descriptions. > When one accesses a variable on a memory, a RISC chip loads it > into its register. Then it does some operation on it. > The value is remained on the register. One can reuse it > in another operation. tom@.parcom.nl (Tom van Peer): > Big fun if you want to make a multi processor set-up. Quite right, Tom. In fact, it's big enough fun that doing this for any non-local variables is probably too dangerous to try. I don't know of any scheme by which bus-snooper logic could tell the CPU to invalidate a value in a register that wouldn't involve truly hideous complexity. Fortunately, access to shared variables is less frequent than access to locals, and in many such cases you have to use more complex mutual-exclusion mechanisms already. If you have to go through all that anyway, the extra cost of not being able to keep the value in a register is pretty negligible. Disclaimer: I'm in OS, not compilers, so there may be issues here beyond my ken. Jeff d'Arcy OS/Network Software Engineer jdarcy@encore.com Encore has provided the medium, but the message remains my own
henry@utzoo.uucp (Henry Spencer) (02/13/90)
In article <604@bbxsda.UUCP> scott@bbxsda.UUCP (Scott Amspoker) writes: >Well, it doesn't take much to find instructions on a 680x0 that are >not used by a C compiler. However, my code tends to do a lot of >structure accesses with pointers such as "pointer->field". The >68020 double-indirect-with-offset addressing mode is a real life >saver and I haven't seen that on the few RISC machines I've used. Have you measured the costs of doing without it? Those fancy addressing modes are usually quite slow. Just because it's one addressing mode rather than an instruction or two doesn't mean it's faster. -- SVR4: every feature you ever | Henry Spencer at U of Toronto Zoology wanted, and plenty you didn't.| uunet!attcan!utzoo!henry henry@zoo.toronto.edu
henry@utzoo.uucp (Henry Spencer) (02/22/90)
In article <9755@cbmvax.commodore.com> jesup@cbmvax.cbm.commodore.com (Randell Jesup) writes: > 2) The 68030 improves the speeds of many of the new addressing modes. >I think some of them become useful. And the 68040 improves them even more... except that it improves the speed of the *simple* stuff by a much larger margin. Nobody is going to optimize for the 030 and ignore the 040 at this point... and the list of fast modes on the 040 is the original 68000 mode list minus indexed. Not one of the new-on-020 modes is included; they all take the slow path. -- "The N in NFS stands for Not, | Henry Spencer at U of Toronto Zoology or Need, or perhaps Nightmare"| uunet!attcan!utzoo!henry henry@zoo.toronto.edu