dan@pyramid.UUCP (Danial Carl Sobotta) (03/04/86)
> This doesn't seem right. Does 'practical' in this sentence mean less > bus contention? > > Since a RISC machine doesn't have the fancy microcoded instructions of > a CISC machine, it takes more instructions to do the same job. Even > though a RISC instruction typically requires fewer bits than a CISC > instruction, a program for a RISC machine is generally said to be > larger than the equivalent program for a CISC machine. With today's > low memory prices, this is not a terrible thing. > > I was always taught that 80%-95% of the bus usage of a processor was > for instruction fetches. Therefore if a RISC machine takes more bytes > of instructions to run a program than a CISC machine would, the RISC > processor will eat up MORE bus cycles, leaving fewer for displays, DMA > , and co-processors. knudsen@ihwpt.UUCP (mike knudsen) replys: >I agree with you. Modern CISC processors are microcoded >(nanocoded?) and fetch one CISC instruction from system RAM, >then proceed to fetch many nano-instrs from internal ROM >to perform it. Meanwhile, the bus is free. >RISC machines essentially run "nano code" out of YOUR main >RAM over YOUR bus. So yes, you seem right to me. >Or are we both missing something? Yup, you probably are. The space that is freed up on a RISC chip by having little if any u-code ROM can be used for more cache. The (*rare*) cases of needing a multiple RISC instruction routine to 'emulate' a CISC instruction can be handled by having that routine in cache. This is automatically done by caching algorithms in hardware (simple!). So, having the cache effectively reduces Bus traffic not only to a CISC level (because of above explanation) but also probably LESS traffic because the cache can be used for ALL instructions (and data). Or am I missing something? -- 'Out of the inkwell comes Bozo the Clown ...' DISCLAIMER: These opinions are neither mine nor my C-compiler's sun!pyramid!dan
billw@Navajo.ARPA (William E. Westfield) (03/06/86)
Hey, look everybody - no one is claiming that computers with simple instruction sets are a new idea - nearly all processors invented a sufficeintly long time ago have nice simple instruction schemes, so as to make hardware implementation easier (remember, there wasn't always microcode!). The original PDP8 (a 12 bit computer), PDP11s (that everyone knows and loves), and the PDP6 (DECs original 36 bit machine, with essentially the same instuction set as a DEC10/20, was built of approximately 3000 gates) all were RISCy in there own ways. So are 8008's, 8080's, 1802's, 6800's, 6502's, and the rest of your "first generation" microprocessors. What the RISC people did is essentially say "look guys, its all very nice that transistors are smaller now, and you have microcode and nanocode and all that and thik that you can come close to implementing a high level language right on the chip. Unfortunately, almost isn't good enough, and trying to write a compiler that takes advantage of a chip that "almost" implements something is much worse than writing a compiler for something that doesn't do so at all. In fact, we aren't sure we know enough about code generation to take advantage of even the simple addressing modes on something like a PDP11. We'de really be happy if the machine could just do this, that, and the other thing, only do it real fast. then our compiler could be simple, and easier to move to faster processors, and this study here shows that it is likely to come out working faster anyway..." In short, "RISC" is a reaction to the fact that hardware is advancing faster than software technology, and hardware designers who thought they understood software really didin't. With respect to there being less availble DMA cycles in a RISC machine, this is true. On the other hand, they say, allocate a bunch of memory dedictaed to device communications, or just use faster memory. Mass memory is about 10 times faster than it was 10 years ago, and 10^n times cheaper. Software has improved in that time too, but not anywhere near as much. glacier!navajo!billw [DEC should put the 20 on a chip, call it a risc machine, and sell systems for 15K. They'd be very successful.]
berger@imag.UUCP (Gilles BERGER SABBATEL) (03/07/86)
In article <136@pyramid.UUCP> dan@.UUCP (Dan Sobottka) writes: > >... The space that is freed up on a RISC chip by having >little if any u-code ROM can be used for more cache. ... >... So, having the cache effectively reduces Bus traffic not only to a CISC >level (because of above explanation) but also probably LESS traffic because >the cache can be used for ALL instructions (and data). >Or am I missing something? > OK, but what when the system is multiprogrammed? Frequent swaps between processes aren't likely to break the cache efficiency? This could be the cause of important degradation of RISC performance in multiuser environment (Cf previous discussions about the Ridge). ... Or am I missing something?.... -- Gilles BERGER SABBATEL - IMAG-TIM3/INPG, GRENOBLE - FRANCE berger@archi@imag.UUCP
rose@think.ARPA (John Rose) (03/07/86)
In article <136@pyramid.UUCP> dan@.UUCP (Dan Sobottka) writes: > >... The space that is freed up on a RISC chip by having >little if any u-code ROM can be used for more cache. ... > In article <570@imag.UUCP> berger@imag.UUCP (Gilles BERGER SABBATEL) replies: >OK, but what when the system is multiprogrammed? Frequent swaps between >processes aren't likely to break the cache efficiency? Interesting: To maintain the functional correspondence between CISC ucode ROM and RISC fast memory, we'd need something like shared libraries for low-level routines. The Unix practice of copying library code into every executable image would cause gratuitous flushing of cached "milli-code", only to bring in nearly-identical code for the next process. That hard-to-change ROM or WCS has its good points :-). What's needed for a RISC is a slowly-changing set of libraries at well-known addresses, and--more broadly-- software engineering practices which discourage re-invention! The CISC's have context-switch problems too--look at how the stack-frame formats have grown in the 68k family (although they seem also to be paying for early design bugs). (Co-)processor registers are a cached data memory which must be flushed (saved) and reloaded (restored). -- ---------------------------------------------------------- John R. Rose, Thinking Machines Corporation, Cambridge, MA 245 First St., Cambridge, MA 02142 (617) 876-1111 X270 rose@think.arpa ihnp4!think!rose
reiter@harvard.UUCP (Ehud Reiter) (03/08/86)
The numerous articles on RISC machines have all assumed that such machines have caches. However, the only commercial RISC machine that I'm familiar with, the IBM PC/RT, does NOT have a cache, and seems to suffer a factor of 3 performance degradation because of this (2 MIPS instead of 6 MIPS). To quote from IBM RT PERSONAL COMPUTER TECHNOLOGY (probably available from your friendly neighborhood IBM salesman), pg 48 - "The 801 minicomputer ... had exceptionally high performance. However, much of its performance depended on its two caches, which can deliver an instruction word and a data word on each CPU cycle. SINCE SUCH CACHES WERE PROHIBITIVELY COSTLY FOR SMALL SYSTEMS ..." (emphasis mine). The point is that you can't assume that RISC machines have caches, because some don't. And, as near as I can tell, an RT has much less performance than a SUN 3 (lousy floating point and I/O as well as no cache), but costs twice as much ($15k vs $8k for diskless systems (??) ). So, if RISC machines need caches to perform well, then CISC machines win out, at least at the bottom end of the market. Incidentally, the RT has an "overlapped load" feature, where instructions that don't reference the loaded data can be executed concurrently with a LOAD instruction. The only problem is, this feature is disabled in virtual memory mode (presumably because of the difficulty of saving the state of the machine when a page fault occurs). A case of the "cruel real world" destroying a cute RISC idea? One last point - the RT does NOT have a fancy subroutine call mechanism (like Berkeley's RISC). So, even if it were true that an RT could execute a MULTIPLY routine out of memory as fast as a CISC machine could execute it out of microcode, the RISC multiply is much more expensive because of the subroutine call overhead. I highly recommend reading IBM RT PERSONAL COMPUTER TECHNOLOGY - its very well written, and it shows you what the pros and cons of a real machine. Ehud Reiter harvard!reiter.UUCP reiter@harvard.ARPA
ark@alice.UucP (Andrew Koenig) (03/09/86)
> OK, but what when the system is multiprogrammed? Frequent swaps between > processes aren't likely to break the cache efficiency? This could be > the cause of important degradation of RISC performance in multiuser > environment (Cf previous discussions about the Ridge). > > ... Or am I missing something?.... The IBM 360/91 was one of the fastest machines of its time. Although it took 360 nanoseconds to read from memory and 720 nanoseconds to write, the machine was nevertheless capable of executing one instruction every 60-nanosecond cycle (when running a well-tuned program) because of heavy memory interleaving and pipelining. This raised the problem of how to synchronize the CPU with the clock, which was apparently stored in memory and was updated every 60th of a second. They did it by quiescing the entire machine for each clock tick. The loss of performance was trivial compared with the expense of handling the clock some other way. The same may be true of context switching on RISC machines with caches. If no more than 100 or so context switches occur per second, and the machine executes tens of thousands of instructions between context switches, it doesn't really matter that the cache is flushed each time.
aglew@ccvaxa.UUCP (03/09/86)
Responding to billw at navajo.ARPA, who was responding to... I agree with your basic point, but there's another aspect to RISCs: there is a big difference at the moment between hardware, where it is easy to do things in parallel, and software, where it isn't. Microcode is just software used to implement sequential operations. One of the things we can do to increase speed is to make sequential operations parallel, which usually comes down to implementing serial operations combinatorically. Whenever you have a serial operation that cannot be made parallel, there are usually enough special cases that can be detected at compile time to make a standard library function suboptimal - and this is just as true for microcode as it is for a matrix mathematical library. (Just how many different forms of matrix multiplication are there: block, upper triangular, band, sparse...). Somebody else was talking about caches. Here're some random musings: registers are just caches explicitly controlled by software. Register windows are specially structured stack caches. We should have a special cache for each frequently used data type, with a fetch/replacement strategy optimized for that data type. Instructions and data are different, so they need different caches. We have both transparent and explicitly controlled (registers) data caches; instruction caches are usually transparent, not explicitly controlled. Could explicitly controlled instruction caches be useful? (Ask MU5). The likely bit on branches is a start. Overlays are an explicitly controlled instruction cache mechanism. An instruction cache should have automatic linear prefetch, and should probably try to prefetch the heads of procedures. It should try to keep return points in the cache. Heads of loops should be left in the cache once fetched; backward branches can be used as a clue to finding heads of loops, but are no good if the loop is long - which is exactly when you want to keep the loop head in the cache. What we need is a special mark for heads of loops - perhaps an explicit instruction, perhaps just a bit in an instruction, perhaps branch tables as in MU5. Perhaps this could be used to minimize loop overhead for while test at the top rather than until test at the bottom loops: the branch back to the test at the top could automatically fire off the head of loop instruction, so it might be possible to execute them both in one cycle.
johnl@ima.UUCP (John R. Levine) (03/10/86)
In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes: >One last point - the RT does NOT have a fancy subroutine call mechanism (like >Berkeley's RISC). So, even if it were true that an RT could execute a MULTIPLY >routine out of memory as fast as a CISC machine could execute it out of >microcode, the RISC multiply is much more expensive because of the subroutine >call overhead. Huh? Not true at all (and I should know, I wrote the multiply routine for the AIX C compiler.) There are two points. One is that the multiply routine is what has been called "millicode" which does not go through a full function linkage. In the AIX compiler, you just put the two numbers to be multiplied in two registers and jump; the routine doesn't save anything so there is negligible linkage overhead. The other point doesn't apply directly to the RT but is generally important, and it is that with decent compilers you don't need the sliding register window hack. Neither the MIPS chip nor the 801 had multiple register sets, in both case because the compiler technology made them unnecessary. Even in the AIX compiler, which is a straightforward version of PCC, careful choice of argument and scratch registers keeps the number of registers saved per call to a minimum. I saw some of the code generated by the PL.8 compiler, and it was pretty spectacular. Register windows would have bought practically nothing. -- John Levine, Javelin Software, Cambridge MA 617-494-1400 { decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA The opinions above are solely those of a 12 year old hacker who has broken into my account, and not those of my employer or any other organization.
johnl@ima.UUCP (John R. Levine) (03/10/86)
In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes: >The numerous articles on RISC machines have all assumed that such machines have >caches. However, the only commercial RISC machine that I'm familiar with, the >IBM PC/RT, does NOT have a cache, and seems to suffer a factor of 3 performance >degradation because of this (2 MIPS instead of 6 MIPS). To quote from IBM RT >PERSONAL COMPUTER TECHNOLOGY (probably available from your friendly >neighborhood IBM salesman), pg 48 - "The 801 minicomputer ... had exceptionally >high performance. However, much of its performance depended on its two caches, >which can deliver an instruction word and a data word on each CPU cycle. SINCE >SUCH CACHES WERE PROHIBITIVELY COSTLY FOR SMALL SYSTEMS ..." (emphasis mine). Hmmn. If you continued reading a few pages past that quote, you'd find that the ROMP has other architectural aspects that mitigate the effects of having no cache. For one thing, the ROMP does have four words of instruction prefetch buffer which gives the chip some latitude in when it fetches its instructions. It also has an extremely fast bus, the ROMP Storage Channel, which can handle a transfer every cycle and allows several transfers to be outstanding, since each request has a five-bit tag which the slave device passes back for matching up by the master. Memory can be interleaved many ways to allow lots of cycles to be going at once. The technology book on p. 58 says that the chip only uses 60% - 70% of the bus bandwidth, which suggests that adding a cache wouldn't help as much as you'd think. Software can also be of some help here -- for example there are instructions for unpacking the bytes in a register and I gather that the PL.8 compiler tries to fetch fullwords and unpack them rather than fetching several adjacent bytes separately. -- John Levine, Javelin Software, Cambridge MA +1 617 494 1400 { decvax | harvard | think | ihnp4 | cbosgd }!ima!johnl, Levine@YALE.ARPA The opinions above are solely those of a 12 year old hacker who has broken into my account, and not those of my employer or any other organization.
jlg@lanl.UUCP (03/10/86)
In article <765@harvard.UUCP> reiter@harvard.UUCP (Ehud reiter) writes: >One last point - the RT does NOT have a fancy subroutine call mechanism (like >Berkeley's RISC). So, even if it were true that an RT could execute a MULTIPLY >routine out of memory as fast as a CISC machine could execute it out of >microcode, the RISC multiply is much more expensive because of the subroutine >call overhead. The 'RISC' machine I use doesn't call any subroutine to do multiplies. The CRAY implements ADD, SUBTRACT, and MULTIPLY for both integers and floats and DIVIDE for floats only in the hardware with hardwired logic. It's expensive in hardware (lots of chips), but not nearly as expensive as CISC microcoding would make it because of slow operation. The whole idea of RISC is to pick those instructions which are important for the application for which the machine is used, and make them FAST! So, don't assume that certain instructions will not be found in RISC machines - depends on the target market for the device. J. Giles Los Alamos
reiter@harvard.UUCP (Ehud Reiter) (03/11/86)
It's true that the IBM PC/RT, through a combination of instruction prefetch and interleaved memory, does not wait for instruction fetches when executing a sequential instruction stream. However, other memory references are quite expensive. The "bottom line" is (quoting from page 49 of the RT book) "Although most ROMP [RT] instructions execute in only one cycle, additional cycles are taken when it is necessary to wait for data to be returned from memory for Loads and Branches. As a result, the ROMP takes about three cycles on the average for each instruction" In short, a cacheless RISC machine does not come anywhere close to the one instruction per cycle goal. Ehud Reiter harvard!reiter.UUCP reiter@harvard.ARPA
henry@utzoo.UUCP (Henry Spencer) (03/12/86)
> The point is that you can't assume that RISC machines have caches... Only the well-designed ones do. > ...as near as I can tell, an RT has much less performance than a > SUN 3 (lousy floating point and I/O as well as no cache), but costs twice as > much... Nobody said the RT was particularly well designed. > ...So, if RISC machines need caches to perform well, then CISC machines > win out, at least at the bottom end of the market. At the very bottom, maybe. Most CISC machines have caches nowadays. The 68020 inside the Sun 3 does, for example. > I highly recommend reading IBM RT PERSONAL COMPUTER TECHNOLOGY - its very well > written, and it shows you what the pros and cons of a real machine. So long as you don't take it as saying much about the general merits of RISCs... -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
jer@peora.UUCP (J. Eric Roskos) (03/12/86)
> I agree with your basic point, but there's another aspect to RISCs: > there is a big difference at the moment between hardware, where it is > easy to do things in parallel, and software, where it isn't. Microcode > is just software used to implement sequential operations. One of the > things we can do to increase speed is to make sequential operations > parallel, which usually comes down to implementing serial operations > combinatorically. Whenever you have a serial operation that cannot be > made parallel, there are usually enough special cases that can be detected > at compile time to make a standard library function suboptimal - and this > is just as true for microcode as it is for a matrix mathematical library. While I must admit that I have some difficulty understanding how some of the statements above follow from one another, there's one basic idea here that sort of bothers me; the idea that "microcode" is just like RISC instructions. Someone in another recent posting stated it very well -- the RISC instructions are similar to *vertical* microcode. In the case of vertical microcode, it's true that you have a limited amount of parallelism. But, in machines with a more "horizontal" microcode, you can achieve a great deal of parallelism. Furthermore, not all microcode looks anything at all like a conventional program; as Mead & Conway point out, possibly the only reason so many microprogrammed machines have microprograms that look like a conventional "assembly language" program is that people are more used to writing that kind of program than they are at writing microprograms in state machines. Awhile back I said that I think that really the RISC and CISC goals, beneath all the politics, are fairly similar. Let me explain one reason why. One of the ongoing areas of research in microprogramming involves "vertical migration" -- analyzing sequences of code to determine things that can be migrated into the microcode, essentially to produce new instructions. From the RISC end you'd just go the other way; it's been argued that the cache does that "automatically," but it's hard to believe that in the long run, when the RISC approach has come to be seen as mundane, that someone doesn't start doing statistical analyses on RISC instruction sequences, and discovers that some sequences commonly occur, and makes new instructions out of those. But that's essentially identical to the vertical migration strategy. You probably come up with cleaner sequences than you would from some arbitrary CISC instruction set [although I suspect in the long run they would be self-refining anyhow], but I think the underlying approach is more or less the same. -- e: ( ) jer@j omp ( ( ) ) ter Co d, ( ) o, FL "It's a long way from Brooklyn to LA..."
aglew@ccvaxa.UUCP (03/13/86)
RISCs => caches => cache flushes => inefficiencies in multiprogramming. In addition to being able to neglect cache flushes if the interval between context switches is long enough (in terms of instructions processed), context switches will hopefully become less important on multi-microprocessor machines, where you can schedule more processes per second for the entire machine (an important number for real-time systems) but each processor will be handling fewer context switches.
reiter@harvard.UUCP (Ehud Reiter) (03/15/86)
I thought it would be nice to try to get some numbers to quantify RISC and CISC performance. The numbers below are hopefully not too full of bugs - I would be glad to send interested people references. 1) Cache, RISC, and MIPS - some figures for a VAX-11/780 cycles/inst MIPS cache disabled 25 .2 VAX's were not designed for this normal cache 10 .5 perfect cache 7 .7 MIPS rating 1 What DEC marketing claims "RISC" mode, perf. cache 2 2.5 All reg-to-reg inst Note the effect a cache has, that the typical VAX instruction is a complex 7 cycle one and NOT a simple 2-3 cycle one (as some have claimed), and that a VAX pretending its a RISC machine clocks in at 2 MIPS or so. 2) Complex instruction execution - the following figures compare a "VLSI VAX" (presumable a microVAX), an M68020 (16 Mhz, no wait states), and an IBM PC/RT, all executing the operation R3=4(R2)+(R1). Cache-RT is a guess for what an RT with cache would do (assumes 2 cycles for LOAD/STORE). All caches are assumed to be perfect. uVAX 68020 real-RT cache-RT time (us) 1.2 .94 1.83 .83 cycles 6 15 11 5 bytes 5 6 6 6 cycle time (ns) 200 60 170 170 instructions 1 2 3 3 scratch registers 0 0 1 1 3) Simple instruction execution - for R3=R1+R2 uVAX 68020 RT time(us) .4 .25 .17 cycles 2 4 1 bytes 3 2 2 instructions 1 1 1 microVAX's seem much more efficient (compared to an RT) at executing complex instructions than at executing simple instructions - but that's OK, since the data in (1) indicates that most VAX instructions are indeed complex. This seems due to pipelining, incidentally - a microVAX does little inter-instruction overlap (only instruction fetch/decode), which hurts the small instructions but doesn't effect the complex ones as much (so much for the claim that complex VAX instructions are less efficient than simple ones). An RT, on the other hand, pipelines its instructions. 4) Note on MIPS. IBM has been very cautious to only say that an RT is a "2 RISC MIPS" machine, but one can imagine a more "enthusiastic" company (I am NOT accusing anyone, but merely pointing out the possibility) claiming that an RT-type machine with cache was a "6-MIPS" machine, although the above data indicates that such a machine would only be 50% faster than a "1 MIPS" microVAX, and a third the speed of a "4.5 MIPS" VAX 8600. Ehud Reiter harvard!reiter.UUCP reiter@harvard.ARPA
chris@umcp-cs.UUCP (Chris Torek) (03/23/86)
In article <1128@unc.unc.UUCP> rentsch@unc.UUCP (Tim Rentsch) writes: >>If no more than 100 or so context switches occur per >>second, and the machine executes tens of thousands of instructions >>between context switches, it doesn't really matter that the cache >>is flushed each time. [Andrew Koenig] >It *can* matter, depending on how big the cache is and on how full it >must be to achieve a good hit ratio. [followed by some numbers to demonstrate this] True enough, or at least from my software perspective (I know little about cache design). However, one argument on the RISC side is that if the processors are simple and cheap enough, you need never context switch. Just assign one processor-plus-cache per process. This sounds like a parallel computation engine, but it need not be. If it is easier to design the rest of the system as a single-CPU-at-a-time machine, and each CPU costs, say, $5, you can easily stick 100 CPUs into a $40K machine. This cuts the context switch rate by a factor of 100. Of course, there are cache contention problems if you have shared data. Just a crazy idea, no doubt...? -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 1415) UUCP: seismo!umcp-cs!chris CSNet: chris@umcp-cs ARPA: chris@mimsy.umd.edu