cory@gloom.UUCP (Cory Kempf) (10/17/88)
A while back, I was really hot on the idea of RISC. Then a friend pointed out a few things that set me straight... First, there is no good reason that all of the cache and pipeline enhancements cannot be put on to a CISC processor. Second, to perform a complex task, a RISC chip will need more instructions than a CISC chip. Third, given the same level of technology for each (ie caches, pipelines, etc), a microcode fetch is faster than a memory fetch. As an aside, the 68030 can do a 32 bit multiply in about (If I remember correctly -- I don't have the book in front of me) 40 cycles. A while back, I tried to write a 32 bit multiply macro that would take less than the 40 or so that the '030 took. I didn't even come close (even assuming lots of registers and a 32 bit word size (which the 6502 doesn't have)). -- Cory Kempf UUCP: {decvax, bu-cs}!encore!gloom!cory revised reality... available at a dealer near you.
usenet@cps3xx.UUCP (Usenet file owner) (10/18/88)
In article <156@gloom.uucp>, Cory Kempf (decvax!encore!gloom!cory) writes: >First, there is no good reason that all of the cache and pipeline >enhancements cannot be put on to a CISC processor. This is definitely true. Look at the caching on the 68030, or the Z80,000 for instance. The advantage a RISC gives you is more space for caching logic, though--so you can have a bigger cache (or more registers, or possibly both). >Second, to perform a complex task, a RISC chip will need more >instructions than a CISC chip. Right! Unless you add special hardware to help it with the most common complex tasks, in which case you're heading right back to CISC. Nick Tredennick gives an interesting characterization of RISC in his paper from the IEEE CompCon '86 panel on RISC vs. CISC: Cut a MC68000 in half across the middle just below the control store. Throw away the part with the instruction decoders, control store, state machine, clock phase generators, branch control, interrupt handler, and bus controller. What you will have left is a RISC "microprocessor." All the instructions execute in one cycle. The design is greatly simplified. The chip is smaller. And the apparent performance is vastly improved. [stuff omitted] ...try to build a system using this wonderful new chip. You have to rebuild on the card the parts you just cut off. Good luck trying to service the microcode interface at the 'microprocessor' clock rate. I think this is a great characterization of a particular segment of the debate: that which talks about chip complexity. Now, instruction set complexity is a bit different, and I'm not convinced one way or the other on that yet (though I lean toward CISC). The recent discussion of "the 68030 is RISCier than the 68020" and "a RISC compatible with the 68020" doesn't have anything to do with the instruction set--just the chip design. Maybe there's a better term for it than RISC.... Just my thoughts... Anton Disclaimer: I'm into software, not hardware! +----------------------------------+------------------------+ | Anton Rang (grad student) | "UNIX: Just Say No!" | | Michigan State University | rang@cpswh.cps.msu.edu | +----------------------------------+------------------------+
tim@crackle.amd.com (Tim Olson) (10/18/88)
In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes: | A while back, I was really hot on the idea of RISC. Then a friend | pointed out a few things that set me straight... I guess we are going to have to reset you straight, again! ;-) | First, there is no good reason that all of the cache and pipeline | enhancements cannot be put on to a CISC processor. If it is a microcoded processor, than the CISC machine will have to perform this pipelining at both the microinstruction and macroinstruction level, in order to be able to execute simple instructions in a single cycle. This costs more than if the micro and macro levels were the same (RISC). | Second, to perform a complex task, a RISC chip will need more | instructions than a CISC chip. This is true, although it is typically only 30% more from dynamic measurements, not the "3 to 5 times" that some people report. | Third, given the same level of technology for each (ie caches, pipelines, | etc), a microcode fetch is faster than a memory fetch. Also true. However, this only buys you anything if most of your instructions take multiple cycles. Unfortunately (?), most programs use simple instructions which should execute in a single cycle. If a CISC processor is to compete effectively, it must also be able to execute the most-used instructions in a single cycle. Therefore, it must also have the off-chip instruction bandwidth or on-chip cache bandwidth that RISC requires. With this requirement, it doesn't matter that microcode may be slightly faster than a cache access -- the cache is the limiting factor. | As an aside, the 68030 can do a 32 bit multiply in about (If I remember | correctly -- I don't have the book in front of me) 40 cycles. A while | back, I tried to write a 32 bit multiply macro that would take less | than the 40 or so that the '030 took. I didn't even come close (even | assuming lots of registers and a 32 bit word size (which the 6502 | doesn't have)). Most (if not all) RISCs address this by a) using existing floating-point multiply hardware (i.e. 32x32 multiplier array) for integer multiply (1 - 4 cycles) or b) having multiply sequencing or step operations that perform 1-2 bits at a time (16 - 40 cycles) so they are no slower than the current crop of CISC processors. In addition, if step operations are used, inexpensive "early-out" calculations will allow the average multiply time to drop quite a bit (because the distribution of runtime multiplies leans heavily towards multipliers of 8 bits or less). -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
bcase@cup.portal.com (Brian bcase Case) (10/20/88)
>A while back, I was really hot on the idea of RISC. Then a friend >pointed out a few things that set me straight... >First, there is no good reason that all of the cache and pipeline >enhancements cannot be put on to a CISC processor. True for the simple instructions in the CISC instruction set. Not so true for the ones with complex addressing modes, etc. Fixed instruction size, format, and prevention of page-boundary crossings are very good things to do. This limits the CISCy ness of an instruction set, or the instructions will need to be very long, or they will need to be two-address instead of three address, or worse, one-address, or.... >Second, to perform a complex task, a RISC chip will need more >instructions than a CISC chip. This is simply an exageration. Yeah, maybe 1.2 to 1.5 times as many, but this is usually not a big deal. If it is (it might be for some), then CISC away. >Third, given the same level of technology for each (ie caches, pipelines, >etc), a microcode fetch is faster than a memory fetch. But not much faster than a cache fetch. And the cache will have the "macros" that the program actually uses, not the ones that the instruction set designers assumed the application would use. The problem is that it is presumptuous to think that you know exactly how the procedure linkage, run-time addressing model, etc. is going to be implemented by the language and operating system designers. Once its in uCode, it's there for a long time. And if the microcode routines are longer than one instruction, you no longer have single cycle instructions. But this is a complex issue. >As an aside, the 68030 can do a 32 bit multiply in about (I don't have >the book in front of me) 40 cycles. I tried to write a 32 bit multiply >macro that would take less than the 40 or so that the '030 took. I >didn't even come close (even assuming lots of registers and a 32 bit word >size (which the 6502 doesn't have)). First of all, the 6502 is simple, but it is very far from a RISC. Maybe you mistyped. Second, if multiply is important, which it typically isn't in systems code, implement it combinatorially, in parallel. Third, you probably failed to check for the reduced cases in your macro; by checking for small operands, etc. you can get the cycle count down. And multiplies by small constants (the most frequent case in system codes) can be done in very few cycles using shifts, adds, and subtracts. Disclaimer: Everone is entitled to an opinion.
bcase@cup.portal.com (Brian bcase Case) (10/20/88)
>>Second, to perform a complex task, a RISC chip will need more >>instructions than a CISC chip. For most purposes the difference is not important, maybe 20% more with the top at 40%. But the working set is the important issue where caches are concerned. Is the RISC's cache working set bigger than the CISC's? Maybe, I don't know for sure. I wrote a fairly big program, 40K lines (an incrementally-compiling simulator for the 68000, fun!) last year. The code size was 15% bigger on the 29K than on the Vax. Admittedly, the 29K compiler is much better than the vax's. But does anyone ever ask *why* it is much better? Slightly paraphrased Nick Tredennick: > Cut a MC68000 in half; Throw away the instruction decoders, > control store, state machine, clock phase generators, branch > control, interrupt handler, and bus controller. What you will > have left is a RISC "microprocessor." All the instructions Not true. RISCs do not throw away the bus controller, the interrupt handler, the insruction decoder, branch control, etc. etc. He is just griping because it is so small on the chip that it looks like it has been thrown out. :-) :-) > execute in one cycle. The design is greatly simplified. The > chip is smaller. And the apparent performance is vastly > improved. [stuff omitted] ...try to build a system using this > wonderful new chip. You have to rebuild on the card the parts > you just cut off. Good luck trying to service the microcode > interface at the 'microprocessor' clock rate. There are existence proofs: several systems are doing it. What more could he want? >The recent discussion of "the 68030 is RISCier than the 68020" and "a RISC >compatible with the 68020" doesn't have anything to do with the >instruction set--just the chip design. Maybe there's a better term >for it than RISC.... The 68030 core is *exactly the same* (maybe a Moto guy can comment?) as the 68020's core. They shrunk it and added the data cache. The bus controller now supports 4-word bursts. The cache line size changed. What has been left out of this discussion is the software side of the issue. The almighty Compiler can save us from our sins! It is our saviour! Long live common subexpression elimination! Hail to the code reorganizer! Praise the register allocator! Jim Bakker, watch out!
aglew@urbsdc.Urbana.Gould.COM (10/20/88)
>First, there is no good reason that all of the cache and pipeline >enhancements cannot be put on to a CISC processor. Space. It's less of a reason now, which is way the phase of RISC may pass. >Second, to perform a complex task, a RISC chip will need more >instructions than a CISC chip. There aren't many complex tasks. Code size inflation is usually due to lack of memory to register ops, not sophisticated instructions. >Third, given the same level of technology for each (ie caches, pipelines, >etc), a microcode fetch is faster than a memory fetch. I used to work for a company where, for straight line code, a microfetch was the same speed as the memory fetch. Plus, access to main memory from microcode was 2 to 4 times *more* expensive than access to memory from an instruction (dedicated hardware for instruction memory accesses, hiding most load/stores). >As an aside, the 68030 can do a 32 bit multiply in about (If I remember >correctly -- I don't have the book in front of me) 40 cycles. A while >back, I tried to write a 32 bit multiply macro that would take less >than the 40 or so that the '030 took. I didn't even come close (even >assuming lots of registers and a 32 bit word size (which the 6502 >doesn't have)). There do exist RISCs with multiply instructions. In fact, real multiplies, with full multiplier arrays taking lots of space that might otherwise have had to be used for microcode. >Cory Kempf Andy Glew
grzm@zyx.SE (Gunnar Blomberg) (10/20/88)
In article <156@gloom.UUCP> cory@gloom.UUCP (Cory Kempf) writes: >[...] > >Second, to perform a complex task, a RISC chip will need more >instructions than a CISC chip. > >[...] Is this really the widely accepted truth? It seems to me that a typical well-designed RISC chip should actually need *fewer* instructions (statically and dynamically) to perform most tasks than your typical CISC chip, for the following reasons: * The RISC chip has more registers. * The RISC chip has a more orthogonal instruction set. * The RISC chip has three-operand instructions. I am assuming something like a 680x0 or an 80386 as the CISC here, ie something that suffers heavily from non-orthogonality and lack of registers. A memory-to-memory CISC with a really orthogonal instruction set is quite a different animal. What this boils down to is that a well-designed orthogonal instruction set should give fewer instructions for most tasks than your typical Complex Instruction Set, even taking into account all the strange instructions for function calls and other things. I would *much* rather program an HP-PA RISC than any CISC I have ever seen (with the possible exception of the PDP-10), and the same is true for the SPARC chip, though less emphatically so. Thank heaven chip designers (finally) realized the value of a clean, orthogonal instruction set! On the other hand, since most RISC encodings use a fixed instruction size, the program will probably be bigger. Maybe this is what is meant above? -- Disguised as a Dutch mathematician, | Gunnar Blomberg Brow [the alien] had advanced the | ZYX Sweden AB, Bangardsg 13, destructive mathematical philosophy | S-753 20 Uppsala, Sweden called intuitionsism --Rudy Rucker | email: grzm@zyx.SE
elg@killer.DALLAS.TX.US (Eric Green) (10/21/88)
>>A while back, I was really hot on the idea of RISC. Then a friend >>pointed out a few things that set me straight... >>First, there is no good reason that all of the cache and pipeline >>enhancements cannot be put on to a CISC processor. > > True for the simple instructions in the CISC instruction set. Not so > true for the ones with complex addressing modes, etc. Fixed instruction > size, format, and prevention of page-boundary crossings are very good > things to do. This limits the CISCy ness of an instruction set, or the For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It is almost impossible to effectively pipeline the macroarchitecture of a Vax, because of the multitude of instruction set formats (almost as bad as the 680x0). What this seems to mean is that the difference between CISC and RISC lies more in instruction format than in number of instructions (as someone else on this list pointed out). I suspect that you could have a "CISC" just as fast as a "RISC" IF the instruction format is fairly regular (i.e., no "expanding opcode" hyper-compressed formats need apply). True, a compact opcode takes less memory space. BUT, it has to UNcompacted before it's used... Someone else on this newsgroup mentioned that Seymour Cray's secret to a fast computer was to put as few gates as possible in critical paths. Anybody got a reference for where Cray said that? In any event, on too many CISCS, instruction decode is that critical path... -- Eric Lee Green ..!{ames,decwrl,mit-eddie,osu-cis}!killer!elg Snail Mail P.O. Box 92191 Lafayette, LA 70509 It's understandable that Mike Dukakis thinks he can walk on water. He's used to walking on Boston harbor.
chris@mimsy.UUCP (Chris Torek) (10/22/88)
In article <5863@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) writes: >For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It >is almost impossible to effectively pipeline the macroarchitecture of >a Vax, because of the multitude of instruction set formats (almost as >bad as the 680x0). While the 680x0 have a number of formats (and thus lengths), one of the nice properties of its instruction set is that the first word tells you the length of the entire instruction. This is not true of the Vax instruction set: on the Vax, the first byte is simply the opcode, and you must read all of the operand bytes to discover the location of the next instruction. In other words, you must (almost) fully decode the current instruction before you can begin decoding the next. -- In-Real-Life: Chris Torek, Univ of MD Comp Sci Dept (+1 301 454 7163) Domain: chris@mimsy.umd.edu Path: uunet!mimsy!chris
robert@pvab.UUCP (Robert Claeson) (10/22/88)
In article <310@lynx.zyx.SE>, grzm@zyx.SE (Gunnar Blomberg) writes: > It seems to me that a typical > well-designed RISC chip should actually need *fewer* instructions > (statically and dynamically) to perform most tasks than your typical CISC > chip, for the following reasons: > > * The RISC chip has more registers. The more registers, the more to save at every context switch in a typical OS (such as UNIX). Which will slow things down if you have many processes running. -- Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden Tel: +46 758-202 50 Fax: +46 758-197 20 Email: robert@pvab.se (soon rclaeson@erbe.se)
bcase@cup.portal.com (Brian bcase Case) (10/25/88)
>The more registers, the more to save at every context switch in a typical >OS (such as UNIX). Which will slow things down if you have many processes >running. >-- >Robert Claeson, ERBE DATA AB, P.O. Box 77, S-175 22 Jarfalla, Sweden What data do you have to substantiate this claim? This is another popular misconception, I think. I used to work at Pyramid Technology. They make a machine with 512 32-bit words of register windows (16, 16, 16 organization). When we were porting UNIX to the machine, we wondered what the register file save/restore on context switch was costing. Turns out that the average save spent 40 usec saving registers. The other 200 usec (this is a guess, I can't remember the total context switch time) were taken by the implementation of the context-switch *menchanism* inherent in UNIX. And this is on top of the fact that some of the critical loops, e.g., the one that decides what process is next to run, were hand coded. Pyramid's UNIX has (at least had) an incredibly fast resopnse. Customers noticed that this was so. Two things: if the memory system and save/restore mechanism are designed with some care, they can go fast, and, except for real-time systems, where save/restore is indeed a critical factor, context-switch time is dominated by everything but the register save/restore. At 200 context-switches per second (unusual max number on a machine like the 780), saving 128 registers on every switch takes, as a percentage of total available processor time,: ((200/sec)*(128 registers)*(2 cycles/register)/(25 Mega cycles/sec) which is 0.20 percent of the total available CPU time. I don't think this is significant. For some implmentations, it is more like 1 cycle per register saved. The other side of the equation is register restore: but on machines with register windows (or work-alikes), only a small number of registers, say 32, need to be restored (since any others will be faulted in). Thus, to total might be 0.30 percent. It is even less on machines with flat files of 32 registers, e.g. MIPS. By speeding up the machine on general code, the registers more than make up for this cost.
matloff@bizet.Berkeley.EDU (Norman Matloff) (10/25/88)
In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes: >In article <310@lynx.zyx.SE>, grzm@zyx.SE (Gunnar Blomberg) writes: *> It seems to me that a typical *> well-designed RISC chip should actually need *fewer* instructions *> (statically and dynamically) to perform most tasks than your typical CISC *> chip, for the following reasons: *> *> * The RISC chip has more registers. * *The more registers, the more to save at every context switch in a typical ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *OS (such as UNIX). Which will slow things down if you have many processes, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ *running. ^^^^^^^^ Based on parameters of Berkelely RISC I or II, the register-saving might take on the order of 0.1 msec. If the quantum size is set to be in the range claimed to be typical in the Peterson and Silberschatz OS book, i.e. 10 to 100 msec, then we see that the register-saving issue for a RISC with lots of regiters has probably been greatly overemphasized. Comments? Norm Matloff
mash@mips.COM (John Mashey) (10/25/88)
ARGH! I'm away for a week and comp.arch goes crazy! :-) Rather than try to multiple post on the hordes of RISC-CISC stuff, I've glommed them together: >Article 6913 of comp.arch: >From: cory@gloom.UUCP (Cory Kempf) >A while back, I was really hot on the idea of RISC. Then a friend >pointed out a few things that set me straight... [1] At least one of the things is pretty misleading: >First, there is no good reason that all of the cache and pipeline >enhancements cannot be put on to a CISC processor. Cache and pipeline enhancements help CISCs also; people are always working on making CISCs go faster by better (deeper) pipelining and caching. (The literature has plenty of examples of the efforts of CISC implementors. to make their existing architectures go faster.) However, there are some FUNDAMENTAL ways that most popular CISC architectures differ from the higher-performance RISCs. Here are a few, and why they matter: CISCs RISCs EFFICIENT (DEEP) possible, but designed for this PIPELINE expensive in hardware and/or design time example: variable-size instrs, 32-bit instrs sequential decode (VAX) complex side-effects at most simple side-effects example: conditional branches delayed-branches (tricky, much hardware) SEPARATE I&D cache maybe, but sometimes usually, and don't support must support old code store-into-instr stream w/o that does store-into- explicit info instr-stream (this is a royal pain, since you pay hardware in the fastest part of the machine for something that seldom happens. (Yes, I was bad too: a popular S/360 program I wrote almost 20 years ago used this "feature". sigh.) examples: comparators in Amdahls watching for I-stream stores ADDRESSING MODES can be very complex, usually just load/store with including side-effects at most indexing & auto- +/- and page-crossings with no page-crossings examples: VAX; new modes in 68020 Note that complex addressing modes can interact horribly with deep pipelining, because the very thigns you want to do to make it go fast add complexity and/or state in the fastest parts of the machines. DESIGNED FOR maybe, maybe not yes OPTIMZERS examples: registers either 32 or more GP registers insufficient, or available for allocation split up in odd ways. example: When you count general-purpose regs available for general allocation, a 386 gives you about 5-6, I think, compared to maybe 26-28 on an R3000, SPARC, HP PA, etc. No amount of caching and pipelining makes 5 look like 26 to an optimizer. (This is not to say a good optimizer won't HELP, and in fact, Prime bought our compilers because it will help them; it just doesn't help as much.) EXPOSED PIPELINE usually not usually some example: It helps to reorganize code on CISCs (like S/360s) to cover load latencies, and spread settings of condition codes apart from the branch-conditions (on some models), but RISCs usually cater to these. Note that machines with complex address-modes built into the instructions are hard to do this with, i.e., the compilers can't easily split instructions with memory-indirect loads, for example, to get a smoother pipeline flow. EXCEPTION-HANDLING can get complex relatively simpler example: Exception-handling in heavily-pipelined CISCs not designed for that can either get very tricky, take a while to design and get right, or burn a lot of hardware, or all 3. Note that hardware complexity is especially an issue in VLSI: it is relatively easy to get dense regular structures on a given amount of silicon (registers, MMUs, caches), but complex logic burns it up fast, and routing can get tricky. These are a few of the salient areas that illustrate a common principle: there's hardly anything you couldn't do [except perhaps cleanly increase the simultaneously-available registers] that you can do in a RISC that you can't also do in a CISC. HOWEVER, IT MAY TAKE YOU SO LONG TO GET IT RIGHT, OR COST YOU SO MUCH, THAT IT DOESN"T MAKE COMPETITIVE SENSE TO DO IT!!!! More than one large, competent computer company has discovered this fact, which is why you often see multiprocessors being popular at certain times, i.e., because it's easier to gang them together than it is to make them go faster. The problems often show up in 2 places: bus interface (including MMU) exception-handling I'm sure any OS person out there who's dealt with early samples of 32-bit micros still has nightmares over some of these [How about some postings on the chip bugs you remember worst! I'll start with one: UNIX always seems to find these $@!% things, which somehow have slipped thru diags. Our 1973 PDP 11/45 had a bug which was only seen on UNIX, because it used the MMU differently than DEC did, and the C compiler often used some side-effectful addressing mode that DEC didn't often: as I recall, if you accessed the stack with a particular sequence, and a page boundary got crossed, and a trap resulted, something bad happened.] Making CISCs go faster is an interesting and worthy art in its own right, and is certainly a good idea for anybody with a serious installed base. However, it does get hard: one of the architects of a popular CISC system once told me that making it go much faster (other than with circuit speedups) seemed beyond human comprehensibility to do in a reasonable timeframe. >Article 6914 of comp.arch: >Subject: Re: RISC v. CISC >Reply-To: rang@cpswh.cps.msu.edu (Anton Rang) >In article <156@gloom.uucp>, Cory Kempf (decvax!encore!gloom!cory) writes: >>First, there is no good reason that all of the cache and pipeline.... > This is definitely true. Look at the caching on the 68030, or the >Z80,000 for instance. The advantage a RISC gives you is more space >for caching logic, though--so you can have a bigger cache (or more >registers, or possibly both). [2] Again, there is no good reason not to uses caches, but there are good reasons why deeper CISC pipelines sometimes get very expensive. Re: Z80,000: is that a real chip? [Real = actually shipping to people in at least large sample quantities; would be nice to see UNIX running, etc]. Note: you can find magazine articles describing it in detail, as though it were imminently available....the problem is, some of those articles are now 4 years old...If it doesn't really exist as a product, how can it be cited as an example to prove anything? (If it is really out there in use, please post some more to that effect and this comment will go away.) >Article 6918 of comp.arch: >From: baum@Apple.COM (Allen J. Baum) >Subject: Re: RISC v. CISC --more misconceptions [3] (Allen properly replies to many of the original misconceptions, omitting only the discussion in [1] above on difficulty of deep pipelining on some CISCs.) >Article 6920 of comp.arch: >From: sbw@naucse.UUCP (Steve Wampler) >Subject: CISCy RISC? RISCy CISC? >Just what is it about RISC vs. CISC that really sets them apart? >... Other than that, I doubt I would care >whether my machine is RISC or CISC, if I can even tell them apart. [4] ABSOLUTELY RIGHT! Most people should care less whether it's RISC or CISC, just whether it does the job needed, goes fast, and is cheap. >A case in point. I know of a not-yet-announced machine (perhaps >never to be announced machine) that has just about the largest >instruction set I can imagine (not to mention the 15+ addressing >modes).... >The result is a 12.5MHz machine that runs 25000 (claimed) >dhrystones using what I would call a 'throwaway' C compiler.... As you note, not-yet-announced. On the other hand, MIPS R3000s do 42K Dhrystones, and they're already in real machines, and vendors are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts. >Now, I've missed most of the RISC/CISC wars, but these seem to >me to be very fine numbers, at least compared with the >uVAXen I've played with (all of which cost more). But uVAXen are real... >How do they compare to current RISCs? I'd bet pretty much the same. >I personally couldn't care which machine I'd own (not that I can >afford any). When the really fast chips come in, I bet the RISC >machines are the first to come out, but still, is there something >that will keep CISC from catching up? See the discussion in [1] above. ALso, note, in a time when the design cycle is 12-18 months, and people double performance in that period, being that far behind means a factor of 2X in performance.... >Article 6936 of comp.arch: >From: daveh@cbmvax.UUCP (Dave Haynie) >>>It seems that the NeXT machine may have a few problems: >>>1) Outdated Processor Technology: NeXT just missed the wave of fast RISC >>> processors. The 5 MIPS 68030 is completely out performed by the currently >>> available RISC chips (Motorola, MIPS, Sparc) that run at approximately >>> 20 VAX (they claim) MIPS. In a year or two, ECL versions of some of these >>> RISC chips will be running at 40 to 50 MIPS. >Priced the ~8 MIPS Sun 4 lately? Or the ~14 MIPS 88K chipset. How about >an Apollo 10K? RISC machines are starting to get fast, and they're even >starting to get down in price, but these two directions haven't met yet. [5] Actually, this is the wrong reason: you can put together MIPS chipsets at similar (or even slightly better) cost/performance levels (have you prices a 68882 lately, for example?) However, be fair to Jobs & co: when they started, none of the RISC chips was generally available; some of them [88K] are not yet generally available in volume. try drawing a timeline sometime of a) when you get first specs on a chip, b) when you can design it in c) when you can make enough to get the software act together d) when you can actually ship in volume. IT TAKES A WHILE! (I've commented earlier on ECL desktop hoevercrafts.) Also, betting on a new architecture at the beginning of a cycle [i.e., in the Z8000/68K/X86/X32 etc wars in the early 80s, and the current BRAWL (Big RISC Architecture War & Lunacy) is very exciting, and probably not something a startup should do. Consider, choosing an architecture is like an odd form of Russian Roulette: you pick a chip and pull the trigger, then wait a year or two to see if you've blown your brains out. (An awful lot of workstation startups picked wrong the last time, and they're gone, for example.) Fortunately, the BRAWL will be over before the end of the year, which will make life saner. >Since the VAST majority of Suns sold to universities are Sun 3s (68020 based) >and below (believe it or not, folks STILL use Sun 2s here and there), I don't >think a 68030 based system, even NeXT's, which isn't an especially fast 68030 >system (they're running it's memory at about 1/2 the possible speed), will have >no trouble competing with the installed 68020 systems. Or a $25,000-$50,000 >RISC based workstation. I still think there's nothing wrong with NeXT using a 68030; there will however be both SPARC and MIPS-based workstations a lot cheaper than $25-50K, in volume, by the time the NeXT boxes are out in volume. >Article 6964 of comp.arch: >From: wkk@wayback.UUCP (W.Kapalow) >Subject: RISC realities > >I have used, programmed, and evalutated most of the current crop of >RISC chipsets..... [6]....some reasonable analysis, from somebody with fewer axes to grind than most of us, thank goodness! >Chips like the Amd29000 are trying to make things better by having >an onboard branch-target cache and blockmode instruction fetches. Try >getting 1-2 cycles/instruction with a R2000 with dynamic memory and no >cache, the 29000 does much better. Yep, although R3000s with some of the new cache-chip variants will get to be an interesting fight here, i.e., since the R2000/R3000 has all ofthe cache control on-chip, and there are new cheap, small FIFO parts that eliminate the write buffers. >.... Look at the AT&T CRISP processor, .... Worth doing: some interesting ideas, regardless of commercial issues. >Article 6968 of comp.arch: >From: peter@ficc.uu.net (Peter da Silva) >Subject: RISC/CISC and the wheel of life. >I have noticed one very interesting thing about RISCs lately... they are >getting quite sophisticated instruction sets. 3-address operations and >addressing modes aren't what I used to associate with RIS, but if you look >at them they turn out to be refinements of older RISCs. [7] This is very confusing. Most RISCs use 3-address operations, i.e., reg3 = reg1 OP reg2. rather than just 2-address ops: reg1 = reg1 OP reg2 Certainly, these include, but are not limited to: IBM 801, HP PA, MIPS R2000, SPARC, 29K, 88K. >What's happening, of course, is that the chips are so much faster than any >sort of affordable RAM that it's worthwhile to put more into the instructions. >The speed of the system as a whole goes up, since the chip can still handle >all three register references in one external clock. No point in fetching >instructions any faster than that... I think this obfuscates the issue. Any reasonable design has a register file that has at least 2 read-ports and 1 write-port, i.e., can do 2 register reads and 1 write per cycle. BOTH 3-address and 2-address forms need to do those 2 reads & 1 write; the only difference is that the 2-address form allows a denser instruction encoding, but the base hardware is rather similar. >Article 6970 of comp.arch: >From: guy@auspex.UUCP (Guy Harris) >Subject: Re: The NeXT Problem [8]....Guy gives some reasonable comments.... >>Not trying to start a flame war, but 030's are faster than Sun 4's. >To which '030 machine, and to which Sun-4, are you referring? At the >time the Sun-4/260 came out, no available '030 machine was faster >because there weren't any '030 machines..... >Also, you might compare '030s against MIPS-based machines; are they >faster than them, as well? No. >>I puke trying to write assembly on RISC machines. >Fortunately, I rarely had to do so, and these days fewer and fewer >people writing applications have to do so. >These days, "ease of writing assembly code" is less and less of a >figure of merit for an instruction set. 100% agree; however, most people who've used our RISCs think they're easier to deal with in assembler anyway, although they observe there's less oppurtunity for writing unbearably obscure/clever code.... >Article 6974 of comp.arch: >From: daveh@cbmvax.UUCP (Dave Haynie) >Subject: Re: "Compatible" (was Re: The NeXT Problem) >> In article <5941@winchester.mips.COM> John Mashey writes: >>> This defies all logic. >>> a) If it's compatible with an 030, it's not a RISC. > >> I agree with John, completely. > >For an example of an architecture that's 68000 compatible and RISCy to >the point of executing most instructions in a single clock cycle, look >no farther than the Edge computer. However, if you want this on a >single chip, instead of a bunch of gate arrays, you'll have to wait. This gets back to the point in [1]: you can throw an immense pile of hardware and design time at an architecture to make it go faster, but that doesn't make it a RISC architecture. Maybe it makes it a RISCier, or less CISCier implementation implementation [which is what I meant when I said the 030 was RISCier than 020, which caused a lot of confusion. sorry.] Another example is the way that the MicroVAX chipset is a RISCier implementation of a VAX (and this is more true than the Edge example, i.e., the MicroVAX gets by with less hardware by moving some of the less frequent ops to software.) >> the MC680X0's instruction set would NOT be a RISC instruction set. >.... Consider that most >of the RISCy CPUs on the market have been done as little baby chips, >by ASIC houses (SPARC, MIPS). Wrong. the first SPARCs were gate arrays, but the Cypress SPARCs are coming. MIPS chips have NEVER been done in ASICs, although LSI Logic is working on ASIC cores of them. In our case, the CPU+MMU is about 100K transistors, which is NOT as large as a 386 or 030, but not a "little baby chip" either. AMD 29Ks are definitely not little baby chips either< and they're real, too. >Article 6975 of comp.arch: >From: daveh@cbmvax.UUCP (Dave Haynie) >Subject: Re: The NeXT Problem (AMD 29K prices from Tim Olson). >> 16MHz $174 >> 20MHz $230 >> 25MHz $349 > >> I'm sure that LSI Logic could also show you very low prices on their >> RISC chips. Last I heard, the 68030 was in the $300+ price range. >Alot of it depends on quantity. I'm sure NeXT and Apple are buying their >68030s more that 100 at a time. Many of the ASIC houses making RISCs are >output limited. And with most of the RISC designs, once you pay the >additional cost of caches and MMUs, you're way out of the 68030 league, >cost wise. Complete systems I've seen with both MIPS and 88k put you >at around $1000 for the CPU subsystem. All of these depend on quantity, and what it is you're trying to build. Admittedly, it's hard for us to build anything less than about 6 VUPs. I suspect you can build a CPU (+ FPU) subsystem like that for around $500, given large quantities, maybe $400-$500 as the new cache chips come out. >Article 6977 of comp.arch: >From: jsp@b.gp.cs.cmu.edu (John Pieper) >Actually, I heard a guy from Motorola talking about their n+1st generation >680X0 machine -- they run an internal clock at 2X the external clock, and >play some other tricks to get 14 MIPS effective, 25 MIPS max @ 25 MHz. Seems >to me that CISC designers could do this very effectively to get ahead of the >RISC types (modulo the design time). [10] But remember that existing RISCs, shipping now, get 20 MIPS @ 25 MHz, so it's hard to see how that's getting CISCs ahead. [It still is perfectly reasonable to do, i.e., a 68040. Plenty will get sold.] >BTW, as far as design time goes, you have to take the RISC argument with a >grain of salt. the 68030 is only a little different that the 68020, but with >technology advances and just a few man-years they more than tripled the >speed of the initial 68020 release (in 82?). The 68040 will take the same >basic ALU design, and add the FPU. This shouldn't require too much redesign. >The point is that a good CISC design can be modified (added to) as quickly >as a major redesign of a RISC chip. What really counts is who can sell their >instruction set. Starting from scratch in 1984, and getting the first systems in mid-1986, the high-performance VLSI RISC [i.e., MIPS as example] is: 1986 5 MIPS 1987 10 MIPS 1988 20 MIPS But the last comment is really right: what really counts is who sells the instruction set. That's why the battle is pretty ferocious over who gets to be the RISC standard (or standards), because everybody knows there can only be a few, at most. >Article 6987 of comp.arch: >From: rsexton@uceng.UC.EDU (robert sexton) >Subject: Re: The NeXT Problem >While RISC may be cheaper(smaller design, less silicon) what you are really >doing is shifting the cost burden onto the rest of the system. The high >memory bandwidth of the RISC design means more high speed memory, bigger >high-speed caches. With a CISC design, you put all of the high speed silicon >on one chip, lowering the cost of all the support circuitry and memory. [11] This is not a reasonable conclusion. You can put caches on-chip in either case. A fast machine, in either case, will need a lot of memory bandwidth: observe, for example, that the data-bandwidth should be about the same for both. Finally, note that people are generally adding external caches to X86s and 68Ks to push the performance up, for all the same reasons as RISCs. >Article 7005 of comp.arch: >From: phil@diablo.amd.com (Phil Ngai) >Subject: Re: RISC realities [12]....reasonable discussion about burst mode I-fetchs, VRAMS, etc. >I don't think the R2000 or the Mc88000 support this, but that's not >an inherent limitation of RISC architectures. Nope, we don't do this, or at least not exactly. R3000s support "instruction-streaming", whereby when you have an I-cache miss, you do multi-word refill into the cache, but you execute the relevant instructions, as they go by. Typical designs use page-mode DRAM access. Note, of course, that in the next rounds of design across the industry, where almost everybody goes for on-chip I-cache with burst-mode refill (I.e., 486; >= 68030, etc), the distinction disappears. >Article 7013 of comp.arch: >From: malcolm@Apple.COM (Malcolm Slaney) >Subject: Re: CISCy RISC? RISCy CISC? >P.S. An interesting question is whether Symbolics/TI/LMI will fail because >the market is to small to support a processor designed for Lisp and GC or >because CISC's are a mistake. [13] The evidence so far is that neither reason is the most likely reason for potential failure. The more general reason is that special-purpose processors that don't get real serious volume get hurt sooner or later, for one of several reasons: a) A more general part ends up getting more volume, which keeps costs down. b) It's hard to stay on the technology curve without the volume. >Article 7033 of comp.arch: >From: eric@snark.UUCP (Eric S. Raymond) >Subject: Re: RISC/CISC and the wheel of life. >My understanding of RISC philosophy suggests that 3-address ops and fancy >addressing modes are only regarded as *symptoms* of the RISC problem -- poor >match of instructions to compiler code generator capabilities, excessive >miceocode-interpretation overhead in both cycles and chip real estate. > >If your compiler can make effective use of three-address instructions, and >you've got CAD tools smart enough to gen logic for them onto an acceptably >small % of the chip area (so that you don't have to give up on more important >features like a big windowed register file and on-chip MMU), then I don't see >any problem with calling the result a RISC. [14] As noted in [7] above, 3-address instructions are NATURAL matches to typical register-file designs; people shouldn't be assuming that there is some big cost to having them (in terms of logic complexity). >Article 7040 of comp.arch: >From: doug@edge.UUCP (Doug Pardee) >Subject: Re: CISCy RISC? RISCy CISC? >Organization: Edge Computer Corporation, Scottsdale, AZ >The incorrect assumption here is that you would want to build a mainframe >using RISC technology -- that RISC technology has anything to offer at >that price/cost level. Well, M/2000s act like 5860s, and we think next year's M/xxxx will make 5990s sweat some. Why wouldn't we want to build RISC-based mainframes? Lots of people do. >As we at Edgcore have shown, it is both possible and practical to implement >CISC instruction sets at speeds faster than RISC. But -- it doesn't all fit >on one chip. Yet. Could you cite some benchmarks for the newest machines? [I don't believe that the current production ones are faster than 25MHz R3000s, but I could be convinced.] >In a mainframe design, who cares if it fits on one chip? Jeez, in our E2000 >system we need an entire triple-high VME card jam-packed with surface-mount >parts just to hold the *caches* that we need to have to keep from starving >the CPU. The complexity and board area of the CPU itself is insignificant >compared to that required by mainframe-sized multi-level memory systems. I sort-of agree, in the sense that if you're building a physically large/expensive box anyway, then the CPU is a small piece of the action. On the other hand: People who want to put mainframe (CPU performance) on desktop/deskside systems care; weirdly enough, a whole lot of people expect to do this. How big are the caches? It does surprise me they're a whole big VME card, unless they're absolutely immense. We get 20-VUPS performance with 128K cache, which fits with the CPU+FPU+write buffers on about a 6" x 6" square. >Article 7041 of comp.arch: >From: pardo@june.cs.washington.edu (David Keppel) >Subject: Re: LISPMs not RISC? - Re: CISCy RISC? RISCy CISC? >Oh, heck, there's some (relatively) new supercomputer being produced >by some subsidiary of CDC (I think?) that was written up in "digital [16] ETA is the reference. One could argue about whether to call it CISC or RISC, depending on what you generally think vector machines really are. >Also, while CISC is out of vogue in new industry designs at the >moment, there are plenty of Universities building microcoded >processors (read "CISC"?). Of course, this proves little about commercial reality [that is not good or bad; it is not the job of universities to do that.], but quite a few folks think there is more to RISC than "being in vogue". Whew! -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
chrisj@cup.portal.com (Christopher T Jewell) (10/25/88)
In <14112@mimsy.UUCP>, chris@mimsy.UUCP (Chris Torek) writes >In article <5863@killer.DALLAS.TX.US> elg@killer.DALLAS.TX.US (Eric Green) >writes: >>For example, the high-end Vaxen have a pipelined MICROARCHITECTURE. It >>is almost impossible to effectively pipeline the macroarchitecture of >>a Vax, because of the multitude of instruction set formats (almost as >>bad as the 680x0). > >While the 680x0 have a number of formats (and thus lengths), one of the >nice properties of its instruction set is that the first word tells you >the length of the entire instruction. Only if x <= 1. On the '020 and '030, an indexed addressing mode (specified by the opcode word) can require from 1 to 5 extension words (specified by the first extension word for that operand). An instruction whose opcode word specifies `MOVE (ix,An),(ix,An)' can be from 6 to 22 bytes long. Christopher T Jewell chrisj@cup.portal.com sun!cup.portal.com!chrisj "Sure I'm an egomaniac---like everyone else, I'm the only god there is." Spinrad, _Riding_the_Torch_
paul@unisoft.UUCP (n) (10/25/88)
In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: >In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes: > >*The more registers, the more to save at every context switch in a typical >*OS (such as UNIX). Which will slow things down if you have many processes, >*running. > >Based on parameters of Berkelely RISC I or II, the register-saving >might take on the order of 0.1 msec. If the quantum size is set to >be in the range claimed to be typical in the Peterson and Silberschatz >OS book, i.e. 10 to 100 msec, then we see that the register-saving >issue for a RISC with lots of regiters has probably been greatly >overemphasized. Actually some modern chips run faster than this, the 29k for example has 192 registers which it can save with a single burst write to memory, a 25MHz part takes: 192 x 40nS = 7.68 uS Since the stack cache isn't always full, and because the OS uses some of these registers for itself the total save time is usually actually less (and of course the 30MHz parts can save even faster). Of course compared with a quantum size in the milliseconds range this is virtually non existant. In fact compared with the normal Unix process switch overhead it's not really a big deal. Paul -- Paul Campbell, UniSoft Corp. 6121 Hollis, Emeryville, Ca ..ucbvax!unisoft!paul Nothing here represents the opinions of UniSoft or its employees (except me) "Where was George?" - Nudge, nudge say no more
tim@crackle.amd.com (Tim Olson) (10/26/88)
In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: | Based on parameters of Berkelely RISC I or II, the register-saving | might take on the order of 0.1 msec. If the quantum size is set to | be in the range claimed to be typical in the Peterson and Silberschatz | OS book, i.e. 10 to 100 msec, then we see that the register-saving | issue for a RISC with lots of regiters has probably been greatly | overemphasized. | | Comments? Actually, the register saving is more likely to be on the order of 10 to 20 microseconds (order of magnitude less than the 0.1 you suggest). Comparing 100 context-switches per second to 350,000 procedure calls per second, it isn't hard to see where to concentrate your optimization efforts... -- Tim Olson Advanced Micro Devices (tim@crackle.amd.com)
peter@ficc.uu.net (Peter da Silva) (10/26/88)
In article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > [7] This is very confusing. Most RISCs use 3-address operations, i.e., > reg3 = reg1 OP reg2. > rather than just 2-address ops: > reg1 = reg1 OP reg2 > Certainly, these include, but are not limited to: IBM 801, HP PA, > MIPS R2000, SPARC, 29K, 88K. I've been out of things for a while, but didn't RISCs use to use either stack or load-store architecture? Or was that just RISC-1? Anyway, I brought up two CISCy features I'd read about here recently. That was one. Addressing modes are the other. And addressing modes... even just indexing and autoincrement... are pretty CISCy. Just pointing out that RISC isn't a religion... it's a technique. -- Peter da Silva `-_-' Ferranti International Controls Corporation "Have you hugged U your wolf today?" uunet.uu.net!ficc!peter Disclaimer: My typos are my own damn business. peter@ficc.uu.net
mat@amdahl.uts.amdahl.com (Mike Taylor) (10/26/88)
In article <15964@agate.BERKELEY.EDU>, matloff@bizet.Berkeley.EDU (Norman Matloff) writes: > ^^^^^^^^ > > Based on parameters of Berkelely RISC I or II, the register-saving > might take on the order of 0.1 msec. If the quantum size is set to > be in the range claimed to be typical in the Peterson and Silberschatz > OS book, i.e. 10 to 100 msec, then we see that the register-saving > issue for a RISC with lots of regiters has probably been greatly > overemphasized. > > Comments? > > Norm Matloff I have trouble with milliseconds, but it depends on the workload and the OS variant. How about transaction processing, where there may be as few as (say) 4K cycles between process switches in a message-oriented environment. (I know this has nothing to do with NeXT). Then cache and register effects may be very significant - particularly if you dump a large register file into a cache. -- Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
fotland@hpihoah.HP.COM (Dave Fotland) (10/26/88)
>Based on parameters of Berkelely RISC I or II, the register-saving >might take on the order of 0.1 msec. If the quantum size is set to >be in the range claimed to be typical in the Peterson and Silberschatz >OS book, i.e. 10 to 100 msec, then we see that the register-saving >issue for a RISC with lots of regiters has probably been greatly >overemphasized. >Comments? > Norm Matloff ---------- This assumes all your processes are compute bound and run for the whole time slice. In commercaial applications there are very few instructions between system calls and these frequently block, causing a contect switch. If you only execute 10,000 instructions between context switches (about 1 msec) then a .1 msec overhead for saving and restoring the registers is a big deal. If you are only interested in workstations running mainly single compute bound jobs then register windows don't cost very much performance, but if you want to build a general purpose architecture that can also be used for large commercial systems then you probably want to leave them out. Also, if you want to build a general purpose system that can be used for real time applications, that .1msec in your interrupt latency could be a problem. -David Fotland
matloff@bizet.Berkeley.EDU (Norman Matloff) (10/26/88)
In article <23367@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes: >Actually, the register saving is more likely to be on the order of 10 to >20 microseconds (order of magnitude less than the 0.1 msec you suggest). >Comparing 100 context-switches per second to 350,000 procedure calls per >second, it isn't hard to see where to concentrate your optimization >efforts... My computation was a conservative one, assuming (e.g.) the slow 400 ns cycle time on RISC I, and taking into account that LOAD/STOREs take an extra cycle. But the point is, and you seem to agree, that the often-voiced (and recently brought up in comp.arch) claim that context switches would make multiple-window-register-file-based RISC's unsuitable for timeshare applications is just simply not borne out by the data. Norm
mash@mips.COM (John Mashey) (10/26/88)
In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: >In article <23367@amdcad.AMD.COM> tim@crackle.amd.com (Tim Olson) writes: > >But the point is, and you seem to agree, that the often-voiced (and >recently brought up in comp.arch) claim that context switches would >make multiple-window-register-file-based RISC's unsuitable for >timeshare applications is just simply not borne out by the data. In the following, it must be noted that I am NOT biased in favor of register windows: 1) It is CLEAR that in a typical UNIX environment, saving/restoring a SPARC or 29K register file is not, in of itself, particularly important, compared with typical UNIX scheduling. [Other people got there first and ran the numbers, I see.] Of course it costs something, and even little things add up, but I doubt that this is a dominant effect. 2) It is CLEAR that one MIGHT care about this in a) Real-time applications that require guaranteed minimal latency. (Note that the real real-time folks would consider anathema the solution of keeping one window free and then faulting the others in as needed.) [These folks like things like locking things in caches, for example.] b) Heavy transaction-oriented systems (as Mike Taylor noted); these could either be big database systems or things like electronic switching systems, which have (effectively) numerous small processes. In both cases, one would have to run the numbers and see, and this is much more instance-specific than a general UNIX environment. 3) The UNIX kernel can sometimes be painful for register windows (as in SPARC, but NOT as in the non-window styles of 29K or CRISP) as follows: Register window design (as in UCB) used certain kinds of programs to create the statistics to support the design. User programs often bounce around in a fairly shallow window-count. UNIX kernels are worse. They often zoom up and down 10-12 levels very quickly, causing window faults like crazy. I have to believe the SunOS folks have been working hard to tune for this. 4) Finally, issues of multi-user performance on Sun-4s is a completely separate matter. As discussed in various USENIX papers, the kind of virtual cache used in Sun-[34]/2xx needs substantial cache-flushing on context switch [the UNIX u-area, particularly]. This can be gotten around, but takes a while to do. Of course, this effect has nothing at all to do with register windows. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (10/26/88)
In article <2005@ficc.uu.net> peter@ficc.uu.net (Peter da Silva) writes: >In article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: >> [7] This is very confusing. Most RISCs use 3-address operations, i.e., >> reg3 = reg1 OP reg2. >> rather than just 2-address ops: >> reg1 = reg1 OP reg2 >I've been out of things for a while, but didn't RISCs use to use either >stack or load-store architecture? Or was that just RISC-1? RISCs are mostly load/store designs, but maybe I misread what you meant. Most RISCs use load/store designs, where a single load/store accesses 1 memory object, which generally can't cross page (or even naturally-aligned object) boundaries. Some of them allowed for simple indexed and/or auto-increment/decrement addressing. I don't know of any RISCs that have instructions that touch 3 addresses in memory, so I assume you were asking about the 3-operand forms (in registers), which are used by most RISCs. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (10/26/88)
In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >As you note, not-yet-announced. On the other hand, MIPS R3000s >do 42K Dhrystones, and they're already in real machines, and vendors >are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts. >Starting from scratch in 1984, and getting the first systems in mid-1986, >the high-performance VLSI RISC [i.e., MIPS as example] is: > 1986 5 MIPS > 1987 10 MIPS > 1988 20 MIPS The above two paragraphs aren't here for any good reason. I just liked them. (Remember that an Amdahl 5890 [the second fastest scaler processor in the world...:-] does on the order of 42 or 43K dhrystones.) >>From: doug@edge.UUCP (Doug Pardee) >>The incorrect assumption here is that you would want to build a mainframe >>using RISC technology -- that RISC technology has anything to offer at >>that price/cost level. >Well, M/2000s act like 5860s, and we think next year's M/xxxx will >make 5990s sweat some. Why wouldn't we want to build RISC-based mainframes? >Lots of people do. A couple things. At Amdahl, people do think about things like building a RISC based mainframe processor. The big problem that arises is in guaranteeing object-code compatibility for old COBOL binaries that do ugly things like use self-modifying code. But mainframe people are definitely interested in RISC technology, and are working on thinking up ways to take advantage of the technology. John Mashey brings up a point that I've never had a satisfactory answer to. If we assume that RISC-based manufacturers can build machines that outperform mainframes, where will companies like Amdahl make their money? When I asked this question around Amdahl, the answer was "I/O bandwidth. I/O bandwidth!" To what extent would next year's M/xxxx (40 Mips?) processor really make a 5990 sweat? I'll concede that on some programs, this processor- to-come will be as fast as a 5990. But let's look at the kinds of processing that are common on mainframes: database processing. A 5990 can be equipped with 256 Megabytes of 55nanosecond static ram. (That's its main memory, not its cache.) That kind of memory costs a whole lot, and if you need that kind of memory (for your huge database and 3000 users), it's going to cost, even on a RISC based mainframe. The 5990 also has lots of I/O bandwidth. (Anyone want to help me with the numbers here?) I believe that you can hook up something like 32 4.5Megabyte (byte, not bit) per second channels to one of these beasties. That kind of I/O bandwidth costs. (For comparison, a diskless Sun has about 1.25 Megabytes of bandwidth [10 Megabit Ethernet].) A diskful Sun probably doesn't have much more than 4 Megabytes. So, a mainframe can do something like 30 times as much I/O as a workstation...) (People at Amdahl would also mention that when you build a mainframe, is has to be highly reliable and extremely serviceable. Apparently, there's a fair amount of hardware and money that go into increasing the reliability and serviceability of a mainframe.) So, the basic claim that I want to make, and that I'd like to hear counter-arguments to, is that if you build a RISC-based mainframe, it's still going to cost $10,000,000. (Random thoughts... People at Amdahl are starting to worry that the next generation of Amdahl mainframes might be able to support 64K concurrent processes, or at least enough processes to make pid's wrap way to frequently. Has MIPS started worrying about the problem of 16-bit pid's yet? Seems like MIPS might run into trouble in 1990 or 1991...) (16-bit major/minor device numbers are already too small for a 5890 [have you ever tried to configure 3000 terminal devices in an 8-bit field?] How much trouble is MIPS having with this 16-bit limit?) -- Chuck
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (10/26/88)
In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: >But the point is, and you seem to agree, that the often-voiced (and >recently brought up in comp.arch) claim that context switches would >make multiple-window-register-file-based RISC's unsuitable for >timeshare applications is just simply not borne out by the data. > > Norm If I remember the arguments from MIPS correctly (want to help me out John?), there's a stronger objection to multiple-window-register-files. I think it's something to the effect that register-windows cause the load/store access time to be slower. I think there also may be some argument that a good compiler makes multiple-windows relatively unnecessary. Could one of you nice people address something like the above and help me clarify my thinking? -- Thanks, Chuck
jack@cwi.nl (Jack Jansen) (10/26/88)
In article <15964@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: > >Based on parameters of Berkelely RISC I or II, the register-saving >might take on the order of 0.1 msec. If the quantum size is set to >be in the range claimed to be typical in the Peterson and Silberschatz >OS book, i.e. 10 to 100 msec, then we see that the register-saving >issue for a RISC with lots of regiters has probably been greatly >overemphasized. > >Comments? > > Norm Matloff Well, 100 usec might be fine for standard unix, it is definitely not fine for operating systems supporting light-weight threads. In amoeba, our distributed system, thread-to-thread switch time is in the order of 20-50usec, and on a fast machine like a R2000 it would probably be down to 5-20usec, not counting the register save. What I would like is some help from the architecture, like dirty bits on groups of registers or something. Actually, I'm not *that* familiar with the R2000 (or the other risc chips, for that matter); do any of them provide a feature for this? Also, does anyone know thread switch times for Mach or other systems that support light-weight threads, and how these would be affected on machines with large register files? -- Fight war, not wars | Jack Jansen, jack@cwi.nl Destroy power, not people! -- Crass | (or mcvax!jack)
daveh@cbmvax.UUCP (Dave Haynie) (10/26/88)
in article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) says: > Xref: cbmvax comp.arch:7151 alt.next:236 >>Priced the ~8 MIPS Sun 4 lately? Or the ~14 MIPS 88K chipset. How about >>an Apollo 10K? RISC machines are starting to get fast, and they're even >>starting to get down in price, but these two directions haven't met yet. > [5] Actually, this is the wrong reason: you can put together MIPS chipsets > at similar (or even slightly better) cost/performance levels (have you > prices a 68882 lately, for example?) The probably pay less than $50 for the 25MHz part, if they are able to convince Motorola to give them good volume pricing. > All of these depend on quantity, and what it is you're trying to build. > Admittedly, it's hard for us to build anything less than about 6 VUPs. > I suspect you can build a CPU (+ FPU) subsystem like that for around $500, > given large quantities, maybe $400-$500 as the new cache chips come out. Which is still more than twice as expensive as a 68030 based system's chips will cost, in reasonable quantities. Perhaps twice the performance, or better, if the caches work well and you can live with the same priced memory the 68030 system can use. If you're really trying to design a workstation, maybe you should consider RISC at this point, because all the PCs are starting to use '020s, '030s, and '386s. Still it's a problem of choosing the RISC of the week and betting that it will succeed. > -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> -- Dave Haynie "The 32 Bit Guy" Commodore-Amiga "The Crew That Never Rests" {uunet|pyramid|rutgers}!cbmvax!daveh PLINK: D-DAVE H BIX: hazy Amiga -- It's not just a job, it's an obsession
guy@auspex.UUCP (Guy Harris) (10/26/88)
>>The more registers, the more to save at every context switch in a typical >>OS (such as UNIX). Which will slow things down if you have many processes >>running. >What data do you have to substantiate this claim? This is another popular >misconception, I think. There appears to be a belief out there that register windows slow context switches down on Sun-4s; this may be the source of the claim. It isn't true; what slows them down is the expense of flushing the entries for the U area from the virtual address cache, since the U area is at the same address in kernel virtual space in all processes, and the context number is the same for all those processes, so you can't rely on the context number to distinguish between the virtual address of the U areas in different processes. The fix is to put them at different virtual addresses....
mash@mips.COM (John Mashey) (10/27/88)
In article <468@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes: >>>From: doug@edge.UUCP (Doug Pardee) >>>The incorrect assumption here is that you would want to build a mainframe >>>using RISC technology -- that RISC technology has anything to offer at >>>that price/cost level. >>Well, M/2000s act like 5860s, and we think next year's M/xxxx will >>make 5990s sweat some. Why wouldn't we want to build RISC-based mainframes? >>Lots of people do. In general, I agree 100% with Chuck: CPU performance doesn't necessarily imply I/O performance (which I've said numerous times), and if I'd not been in catchup mode, I would have said "sweat some on uniprocessor CPU performance". Actually, in terms of market conflict, as far as I can tell, despite managing to bump into lots of other people, Amdahl is one we don't, and probably never will. [Why? 1) Most people who buy from Amdahl have already chosen their architecture, based on existing applications, 2) They pick Amdahl over other PCMs or IBM for a variety of reasons, including cost/performance or smart features like the mulitple-domain thing, 3) Their customers tend to be very loyal, as they appear to be treated well. (These comments arise from having spoken at an Amdahl User's Group meeting not long ago and spending a lot of time talking to their customers.)] >A couple things. At Amdahl, people do think about things like building >a RISC based mainframe processor. The big problem that arises is in >guaranteeing object-code compatibility for old COBOL binaries that do >ugly things like use self-modifying code. But mainframe people are >definitely interested in RISC technology, and are working on thinking >up ways to take advantage of the technology. As noted elsewhere, it makes perfect sense, once you have some base for it, to keep pushing an architecture further. S/360 and it's descendants are clearly a fertile area for this. >John Mashey brings up a point that I've never had a satisfactory >answer to. If we assume that RISC-based manufacturers can build >machines that outperform mainframes, where will companies like Amdahl >make their money? When I asked this question around Amdahl, the >answer was "I/O bandwidth. I/O bandwidth!" This is a legitimate technical answer, as it certainly distinguishes things with mainframe-class CPU performance from real, large mainframes. (Actually, I think the other issues mentioned above are at least as important.) >To what extent would next year's M/xxxx (40 Mips?) processor really >make a 5990 sweat? I'll concede that on some programs, this processor- >to-come will be as fast as a 5990. But let's look at the kinds of >processing that are common on mainframes: database processing. >A 5990 can be equipped with 256 Megabytes of 55nanosecond static ram. >(That's its main memory, not its cache.) That kind of memory costs >a whole lot, and if you need that kind of memory (for your huge >database and 3000 users), it's going to cost, even on a RISC based >mainframe. Yep, absolutely. My guess is that it will be a while before people build RISC-based systems that can capture these sorts of applications: a) You do have to build memories with a lot of bandwidth. b) You have to build I/O, spend a lot of $ on reliability & serviceability. c) You have to move the applications. [IMS? CICS? hmmm.] d) You have to be a company of such size and nature that those folks will trust those applications to you....and some of those folks have only recently noticed that companies like DEC or Amdahl are substantial enough to consider :-) On the other hand, some mainframe cycles go towards engineering applications, or towards general time-sharing, and other less immediately "mission-critical" applications, and some of those we actually get a chance to fight for. (Actually, quite a few MIPS machines are used in multi-user database environments, but not in the same ones that Amdahls would be used in.) >The 5990 also has lots of I/O bandwidth. (Anyone want to help me >with the numbers here?) I believe that you can hook up something >like 32 4.5Megabyte (byte, not bit) per second channels to one of these >beasties. That kind of I/O bandwidth costs. (For comparison, >a diskless Sun has about 1.25 Megabytes of bandwidth [10 Megabit Ethernet].) >A diskful Sun probably doesn't have much more than 4 Megabytes. >So, a mainframe can do something like 30 times as much I/O as >a workstation...) Yes, I believe we won't have quite that bandwidth next year, although the I/O will be quite respectable at the price. Of course, we worry about the issue in general: CPU performance is going up so fast right now, it's clearly leaving cost-equivalent I/O behind. On the other hand, there is interesting work going on in the world towards, for example, farms of small disks, which can get some good bandwidth rather cheaply. >(People at Amdahl would also mention that when you build a mainframe, >is has to be highly reliable and extremely serviceable. Apparently, >there's a fair amount of hardware and money that go into increasing >the reliability and serviceability of a mainframe.) Yes. Note that here there is some edge for the RISCs, just because the basic hardware is simpler in the first place; it's less work to air-cool them, etc. The CPU+cache can be 1 board, etc. Again, this only applies to the CPU subsystem, but that's certainly one of the more stressful areas. >So, the basic claim that I want to make, and that I'd like to hear >counter-arguments to, is that if you build a RISC-based mainframe, >it's still going to cost $10,000,000. When you get to really large configurations, it's clear that very little of the money is in the CPU any more. On the other hand, sometimes you can trade CPU performance for some kinds of I/O gear (i.e., a small example would be having cheaper serial-i/o support because you can afford to have more CPU overhead per interrupt, becuase the CPU is faster). I'll have to think about the number: it will be a long time if ever before we build something that costs that much. >(Random thoughts... People at Amdahl are starting to worry that >the next generation of Amdahl mainframes might be able to support >64K concurrent processes, or at least enough processes to make >pid's wrap way to frequently. Has MIPS started worrying about >the problem of 16-bit pid's yet? Seems like MIPS might run into >trouble in 1990 or 1991...) (16-bit major/minor device numbers >are already too small for a 5890 [have you ever tried to configure >3000 terminal devices in an 8-bit field?] How much trouble is >MIPS having with this 16-bit limit?) We haven't done a lot in that direction, mainly because: a) It's more likely to get solved as part of the general UNIX evolution, I think. You guys have just run into it earlier than most people. b) We don't need to quite yet. Although we have some M/1000s that have 60-100 users on them, and M/2000s that will have more, I suspect we won't have 3000 for a while. Among other things, people get greedy, if the cycles are cheap. (We now have the spectacle of people having gotten used to having a 20-mips machine by themselves, and wanting it all of the time. Of course, we have people who'd barely be satisfied with Cray-YMPs on their desks, so that's probably not surprising.) Anyway, mainframe I/O is definitely in a different league right now. In fact, it might be instructive for the newsgroup for somebody to post a description of what a 5990 memory hierarchy looks like in more detail. This newsgroup argues more about microprocessors than mainframes. If you review computing history, you find that each wave [mainframe, mini, micro] has tended to repeat much of the evolution of the earlier waves. Now that VLSI micros are getting supermini & up performance, many of the same old issues will arise. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (10/27/88)
In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes: >If I remember the arguments from MIPS correctly (want to help me out >John?), there's a stronger objection to multiple-window-register-files. >I think it's something to the effect that register-windows cause the >load/store access time to be slower. I think there also may be some >argument that a good compiler makes multiple-windows relatively >unnecessary. 1) Register-windows make load/store access slower: I don't particularly believe this. I believe that people sometimes think that windows may reduce the numbers of loads and stores enough that they think they can get away with slower loads and stores. I believe that whether that's true or not depends on the application; it certainly is not true for some applications. As far as I know, the slower loads/stores on existing SPARCs has nothing to do with having windows, but with the cache design, and is not intrinsic to the architecture, but rather the implementation. 2) Good compilers + enough registers do reduce the benefits gained by register windows; over many benchmarks, we find that the average number of registers saved/restored is about 1.5-2. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
jjw@celerity.UUCP (Jim ) (10/27/88)
In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes: >The more registers, the more to save at every context switch in a typical >OS (such as UNIX). Which will slow things down if you have many processes >running. One solution to this is to have a "register cache" which holds the register sets for several processes. Context switches among the loaded processes is then very fast. The save/restore penalty need be paid only when a new process is brought into the mix. Given that there is a "locality" of processes (processes which just ran are likely to be run again soon) this significantly reduces the context switch cost.
mash@mips.COM (John Mashey) (10/27/88)
In article <5112@cbmvax.UUCP> daveh@cbmvax.UUCP (Dave Haynie) writes: >in article <6865@winchester.mips.COM>, mash@mips.COM (John Mashey) says: >> [5] Actually, this is the wrong reason: you can put together MIPS chipsets >> at similar (or even slightly better) cost/performance levels (have you >> prices a 68882 lately, for example?) >The probably pay less than $50 for the 25MHz part, if they are able to convince >Motorola to give them good volume pricing. Unfortunately, all of this depends on what things really cost, and when they cost what they cost, and at what volumes are those costs. I suspect it maybe be a while till the 68882 costs $50, but then I haven't bought any lately, so I could be wrong. It's often very hard to get real numbers. >the 68030 system can use. If you're really trying to design a workstation, >maybe you should consider RISC at this point, because all the PCs are >starting to use '020s, '030s, and '386s. Still it's a problem of choosing >the RISC of the week and betting that it will succeed. Yep. The latter is the real issue. Mark Linimon suggested a paraphrase of the Russian roulette analogy especially appropiate to RISC: Russian roulette in the delay slot. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
henry@utzoo.uucp (Henry Spencer) (10/27/88)
In article <332@pvab.UUCP> robert@pvab.UUCP (Robert Claeson) writes: >The more registers, the more to save at every context switch in a typical >OS (such as UNIX). Which will slow things down if you have many processes >running. This one comes up regularly, sigh... Whether it gives you a net slowdown or not depends on how much context-switching is going on, how long a process runs between context switches (i.e. how much chance it has to take advantage of having that data in registers), and how much you care about interrupt latency. If context switches are not *too* common and latency is not a big deal, lots of registers can be a huge net win even if it does slow context switching. The same comment applies to non- writethrough caches. -- The dream *IS* alive... | Henry Spencer at U of Toronto Zoology but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
matloff@bizet.Berkeley.EDU (Norman Matloff) (10/27/88)
In article <469@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes: *In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: *>But the point is, and you seem to agree, that the often-voiced (and *>recently brought up in comp.arch) claim that context switches would *>make multiple-window-register-file-based RISC's unsuitable for *>timeshare applications is just simply not borne out by the data. *If I remember the arguments from MIPS correctly (want to help me out *John?), there's a stronger objection to multiple-window-register-files. *I think it's something to the effect that register-windows cause the *load/store access time to be slower. I think there also may be some *argument that a good compiler makes multiple-windows relatively *unnecessary. A compiler which automatically inlines procedure calls might be able to do what you are saying. However, there may be some unpleasant side effects, depending on the language. Actually, one of my grad students in making a thesis out of this, so I'll try to report more on it later. Norm
dharvey@wsccs.UUCP (David Harvey) (10/27/88)
In article <10194@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes: > > What has been left out of this discussion is the software side of the > issue. The almighty Compiler can save us from our sins! It is our > saviour! Long live common subexpression elimination! Hail to the code > reorganizer! Praise the register allocator! Jim Bakker, watch out! This view is typical of hardware types. By all means, lets pass the buck to the next guy. So the compiler writer has his(her) share of nightmares actually getting something to make the thing compile some code. And then the systems programmer comes along and inserts a few more kludges to make the machine purr. Did he document them? I hope so. Now it is the application programmer's turn to s***w things up. If my memory serves me correctly, it is much easier to get something up and running on a Motorola 68000 than on an Intel 8086 (very nasty, those beasty little segments). And miracle of miracles, we learn that over 70% of computing costs are software. It seems like hardware types should be designing their end of the deal to reduce it at the other end. dharvey@wsccs What do I know...I don't design the d**n things, I just use them.
koopman@a.gp.cs.cmu.edu (Philip Koopman) (10/27/88)
In article <468@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes: > In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: > >As you note, not-yet-announced. On the other hand, MIPS R3000s > >do 42K Dhrystones, and they're already in real machines, and vendors > >are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts. Hey, wait a minute. You can't just spec the price of the CPU itself, you need to include the cost of other required chips (like cache controllers, MMU's, or whatever) when you say how much the CPU costs. On some machines, you can't run without these extra components. Phil Koopman koopman@maxwell.ece.cmu.edu Arpanet 5551 Beacon St. Pittsburgh, PA 15217 PhD student at CMU and sometime consultant to Harris Semiconductor.
guy@auspex.UUCP (Guy Harris) (10/27/88)
>> [7] This is very confusing. Most RISCs use 3-address operations, i.e., >> reg3 = reg1 OP reg2. >> rather than just 2-address ops: >> reg1 = reg1 OP reg2 >I've been out of things for a while, but didn't RISCs use to use either >stack or load-store architecture? Or was that just RISC-1? They still do. Note that the 2-address and 3-address operations he lists all have "regN" as the operands; RISCs tend to use load-store operations as their only *memory-reference* operations, but (unless you have magic "memory" locations that do arithmetic) you generally need arithmetic operations as well to make a useful computer. RISCs tend to have only register-to-register arithmetic operations, and they tend to be 3-"address" in the sense that they operate on two registers and stick the result in a third, with none of them obliged to be the same register. >Anyway, I brought up two CISCy features I'd read about here recently. That >was one. Addressing modes are the other. > >And addressing modes... even just indexing and autoincrement... are pretty >CISCy. Umm, if indexing is "pretty CISCy", then just about every machine out there is a CISC, which makes "CISCy" pretty much uninteresting as an adjective, unless you can show an interesting machine that lacks indexing. "Indexing" generally refers to forming an effective address by adding the values in one or more registers to a constant offset, and both the MIPS Rn000 (OK, John, what's the term you use to refer to the R2000 and R3000, or are they different enough that such a term wouldn't be useful?) and SPARC, to name just two machines generally thought of as RISCs, both support indexing in that sense (register+offset on MIPS, register+register+offset on SPARC).
guy@auspex.UUCP (Guy Harris) (10/27/88)
> Register window design (as in UCB) used certain kinds of programs > to create the statistics to support the design. User programs > often bounce around in a fairly shallow window-count. > UNIX kernels are worse. They often zoom up and down 10-12 levels > very quickly, causing window faults like crazy. I have to believe > the SunOS folks have been working hard to tune for this. While I was at Sun, I don't remember there ever having been any effort to reduce the depth of the kernel call stack in order to speed things up on SPARC-based Suns. (Remember, they have to make it run sufficiently fast on three architectures, not one - four, if you count 370/XA and compatibles, and even more, if you consider that a lot of Sun code is going into S5R4....)
mash@mips.COM (John Mashey) (10/28/88)
In article <7681@boring.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >Well, 100 usec might be fine for standard unix, it is definitely not >fine for operating systems supporting light-weight threads. >In amoeba, our distributed system, thread-to-thread switch time >is in the order of 20-50usec, and on a fast machine like a R2000 >it would probably be down to 5-20usec, not counting the register >save. >What I would like is some help from the architecture, like dirty bits on >groups of registers or something. >Actually, I'm not *that* familiar with the R2000 (or the other risc >chips, for that matter); do any of them provide a feature for this? There are two styles of doing this, most typically associated with the floating-point register file. a) Keep a dirty bit. b) Keep a "useable" bit, where you trap if somebody issues an FP instruction. In case a), on a context switch from task 1 to 2: if 1's registers are dirty, save them load 2's state into the reigsters switch In case b), for the same context switch: maintain an "owner" for the FP regs, which is either a task (X),or empty note that 1 may well not own the FP regs at this point before switching to 2: if 2 is the owner of the FP regs, turn useability on if 2 is not the owner, turn useability off switch to 2 if 2 uses an FP op, trap it save the FP state into X's context load up 2's FP state into the registers owner = 2 there are variant strategies, depending on how fancy you want to get. MIPS has a useability bit for each coprocessor; we also actually keep bits in the executables that say which registers got used. [we put these in just in case, although more for special-purpose environments. They turn out not to be very useful: the optimizers are too good at grabbing every register.] SPARC uses a similar technique, I think. Clipper uses a dirty bit. Various other micros do one or the other. BTW: it is not instantly obvious that one would add a bit in for just this purpose. On a 16.7MHz M/120, it takes something like 4-30 microseconds to save 32 registers and restore 32 registers [the 4 is all cache hit, the 30 is all cache miss]. On a 25MHz M/2000, it takes 3-10 microseconds, even with a large (i.e., inherently longer-latency) memory system: note that block refill of the caches helps a lot in that case. I'd guess that "typical" numbers, especially in a high context switch environment would be on the order of 15 & 7 microseconds, respectively, in a general-purpose environment. [In a real-time environment, one would gimmick some of the things to avoid I-cache misses.] Thus, a useability bit might save this for you, some of the time. We actually put it in for several reasons: a) Symmetry: we actually use a useability bit on coprocessor 0, which subsumes what would otherwise be privileged ops. b) Simplicity of handling systems without an FPU. c) and, finally, the ability to avoid FP context switches as described. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)
In article <10447@cup.portal.com> bcase@cup.portal.com (Brian bcase Case) writes: >>The more registers, the more to save at every context switch in a typical : >What data do you have to substantiate this claim? This is another popular >misconception, I think. (Interesting data on Pyramid study omitted) >which is 0.20 percent of the total available CPU time. I don't think >this is significant. For some implmentations, it is more like 1 cycle I agree with this point and would like to add that there may be some simple things which can be added to hardware to speed up context switching. CDC has typically used a "save everything" approach with the complete save taking place with a single hardware instruction. This instruction is easy to implement in a RISC machine as well, and it trades some use of extra memory bandwidth but with a potential payoff in less code executed to do a context switch. However, it may be true that picking the next runnable process may dominate by far the cost of a context switch. Is there any hard data out there? -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)
In article <16003@agate.BERKELEY.EDU> matloff@iris.ucdavis.edu (Norm Matloff) writes: >But the point is, and you seem to agree, that the often-voiced (and >recently brought up in comp.arch) claim that context switches would >make multiple-window-register-file-based RISC's unsuitable for >timeshare applications is just simply not borne out by the data. Correct. Register windows seem to be a bad idea, but not because of increased context switch time. Rather, they seem to be marginally less productive of performance gain than other uses of equivalent silicon/gates/etc. -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
mash@mips.COM (John Mashey) (10/28/88)
In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: >> Register window design (as in UCB) used certain kinds of programs >> to create the statistics to support the design. User programs >> often bounce around in a fairly shallow window-count. >> UNIX kernels are worse. They often zoom up and down 10-12 levels >> very quickly, causing window faults like crazy. I have to believe >> the SunOS folks have been working hard to tune for this. >While I was at Sun, I don't remember there ever having been any effort >to reduce the depth of the kernel call stack in order to speed things up >on SPARC-based Suns. (Remember, they have to make it run sufficiently >fast on three architectures, not one - four, if you count 370/XA and >compatibles, and even more, if you consider that a lot of Sun code is >going into S5R4....) I'm surprised that nobody was doing this. Note, however, that this is NOT the "standard UNIX optimized for SPARC" issue, i.e., although squishing the call tree wouldn't particularly help the other architectures, it wouldn't hurt them that much either, and it might help SPARC some. Of course, no sensible software engineer would do terrible distortions to the code to do this, but I can imagine where people might put some effort into (machine-independent) level-squishing. As usual, if Guy says it wasn't being done, it probably wasn't; certainly my comment was speculation. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mash@mips.COM (John Mashey) (10/28/88)
In article <3404@pt.cs.cmu.edu> koopman@a.gp.cs.cmu.edu (Philip Koopman) writes: >In article <468@oracle.UUCP>, csimmons@hqpyr1.oracle.UUCP (Charles Simmons) writes: >> In article <6865@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >> >As you note, not-yet-announced. On the other hand, MIPS R3000s >> >do 42K Dhrystones, and they're already in real machines, and vendors >> >are quoting the CPUs at $10/mip, i.e., $200 for 25MHz parts. >Hey, wait a minute. You can't just spec the price of the CPU itself, >you need to include the cost of other required chips (like cache controllers, >MMU's, or whatever) when you say how much the CPU costs. On some machines, >you can't run without these extra components. 100% agree: I just don't know the rest of the numbers offhand. In our case, the mMU & cache control are on-chip; these days, you add an FPU (probably costs 1.5-2X the corresponding CPU), and SRAM (which is where most of the money is), plus some fairly cheap glue parts for the external memory interface. There was no intention to make people think the entire CPU core costs $10/mip, and if readers thought that, unthink it. I'd guess a whole CPU core (CPU + FPU + MMU + cache control + cache + external memory interface) currently costs about $70-$10/VUP for the kind of things you build in systems [less for embedded systems, especially as the 4Kx16 & similar-shaped SRAMs become more available.] -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
mat@amdahl.uts.amdahl.com (Mike Taylor) (10/28/88)
In article <7038@winchester.mips.COM>, mash@mips.COM (John Mashey) writes: > In fact, it might be instructive for the newsgroup for somebody > to post a description of what a 5990 memory hierarchy looks like in > more detail. Your wish is my command. Each processor has a 64K byte instruction cache and a 64K byte operand cache, equipped with their own TLBs. Implemented in 16K chips, 2.8 ns. access (chips also have 1200 logic gates). From memory, I think it is organized as 128-byte lines, 4-way set associative, with about 15 cycles miss penalty. Main storage is up to 512 megabytes of 55ns. 256K SRAM, accessed as cache lines and interleaved. Below main storage, there is expanded storage and I/O. Expanded storage is up to 2GB of 1M DRAM, accessed as pages. I/O consists of up to 128 channels. 2 are byte-multiplexor channels, another 30 are either, and the remainder are only block-multiplexor channels. Block mux channels go up to 4.5 mB/sec. A typical configuration has about 5 GB/MIPS of DASD installed. So an average 5990-1400 at (say) 105 370 MVS MIPS would have about 500GB of DASD. Many of our customers have DASD farms in the terabyte league, however, plus tape, non-volatile electronic storage ("EDAS"), etc. -- Mike Taylor ...!{hplabs,amdcad,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
walsh@endor.harvard.edu (Bob Walsh) (10/28/88)
I know that the TCP/IP in the (BSD derivative) kernel has fewer call levels than some others (the BBN TCP/IP) because it was felt that subroutine call overhead was excessive (on the VAX). Though the original design decision was not based on a RISC chip, I would not be surprised if gprof were used on the (RISC) kernel by various implementors to see where time is spent, with the side effect that various routines are brought in-line or turned into macros. It is the sort of thing done in the privacy of one's office; a product is not wholely the result of company policy/non-policy.
peter@ficc.uu.net (Peter da Silva) (10/28/88)
In article <312@auspex.UUCP>, guy@auspex.UUCP (Guy Harris) writes: > Umm, if indexing is "pretty CISCy", then just about every machine out > there is a CISC, which makes "CISCy" pretty much uninteresting as an > adjective, unless you can show an interesting machine that lacks > indexing. Well, blithely stepping over the autoincrement question, what about the Cosmac 1802? The first CMOS microprocessor, the first micro with an orthogonal instruction set, the first micro with a real-time operating system. It had all sorts of RISCy features, such as a load-store architecture, gobs of registers, and orthogonal instructions. The PC was just a general register, pointed to by the 4-bit P register, so was the SP, so for a microcontroller application you could do a context switch in two instructions: SEX n SEP n It was widely used in embedded controller applications where low power was important well into the early '80s. If Cosmac has been able to support and expand it it'd be a decent contender to the intel chips today. It's a much saner design than the 8080, and its sucessors wouldn't have been the monstrosities that the 8080 has visited upon us. -- Peter da Silva `-_-' Ferranti International Controls Corporation "Have you hugged U your wolf today?" uunet.uu.net!ficc!peter Disclaimer: My typos are my own damn business. peter@ficc.uu.net
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)
In article <7180@winchester.mips.COM> mash@winchester.UUCP (John Mashey) writes: >In article <7681@boring.cwi.nl> jack@cwi.nl (Jack Jansen) writes: >>Well, 100 usec might be fine for standard unix, it is definitely not : >BTW: it is not instantly obvious that one would add a bit in for just this >purpose. On a 16.7MHz M/120, it takes something like 4-30 microseconds >to save 32 registers and restore 32 registers [the 4 is all cache hit, >the 30 is all cache miss]. On a 25MHz M/2000, it takes 3-10 microseconds, On a 50MHz Cyber 205, it takes approximately 200 minor cycles = 4 microseconds to swap the entire processor context. 128 of those cycles are used to swap the entire 256 general purpose registers. Sometimes people confuse procedure call time (which requires saving only those registers which will be used - small for small procedures) with the context switching time (time to switch to another user/process). There is no need for context switching time to be tiny in an ordinary (no fine-grained parallelism) system, since even on a very large machine context switch shouldn't occur more frequently than several thousand/second. Procedure call time must be very fast, obviously, since it may occur several thousand times more often than context switching does. The Cray-2 is the only machine that I know of that has a real problem with context switching. The reason is that it has an extraordinarily large user context in the form of "local memory". -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
lamaster@ames.arc.nasa.gov (Hugh LaMaster) (10/28/88)
In article <17208@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes: >Correct. Register windows seem to be a bad idea, but not because of I seem to have overstated my case. I should have said, that I consider that it is unproven that register windows are a good idea relative to other equivalent uses of the same real estate. The only "problem" that I have with register windows is that, like "RISC" (whatever that is), some people make extraordinary unsubstantiated claims. -- Hugh LaMaster, m/s 233-9, UUCP ames!lamaster NASA Ames Research Center ARPA lamaster@ames.arc.nasa.gov Moffett Field, CA 94035 Phone: (415)694-6117
henry@utzoo.uucp (Henry Spencer) (10/29/88)
In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: >While I was at Sun, I don't remember there ever having been any effort >to reduce the depth of the kernel call stack in order to speed things up >on SPARC-based Suns... Um, Guy, a lot of us Sun customers concluded quite a while ago that performance (except on benchmarks) is not high on Sun's list of priorities. "Just buy a faster machine, we'll be happy to sell you one." -- The dream *IS* alive... | Henry Spencer at U of Toronto Zoology but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu
alan@pdn.UUCP (Alan Lovejoy) (10/29/88)
In article <28200218@urbsdc< aglew@urbsdc.Urbana.Gould.COM writes:
<>As an aside, the 68030 can do a 32 bit multiply in about (If I remember
<>correctly -- I don't have the book in front of me) 40 cycles. A while
<>back, I tried to write a 32 bit multiply macro that would take less
<>than the 40 or so that the '030 took. I didn't even come close (even
<>assuming lots of registers and a 32 bit word size (which the 6502
<>doesn't have)).
<
<There do exist RISCs with multiply instructions. In fact, real
<multiplies, with full multiplier arrays taking lots of space that
<might otherwise have had to be used for microcode.
<
<>Cory Kempf
<
<Andy Glew
The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13
cycles, I believe). A 32-bit integer divide takes the 88k 39 cycles
(r3000 takes 36 cycles, I believe). Of course, if either of the
division operands is negative (signed division opcode), the 88k has to
trap to a software routine to finish the division. In other words,
not all RISCs are wimps just because they don't have "complex
instructions".
--
Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida.
Disclaimer: Do not confuse my views with the official views of Paradyne
Corporation (regardless of how confusing those views may be).
Motto: Never put off to run-time what you can do at compile-time!
aglew@urbsdc.Urbana.Gould.COM (10/30/88)
> Umm, if indexing is "pretty CISCy", then just about every machine out > there is a CISC, which makes "CISCy" pretty much uninteresting as an > adjective, unless you can show an interesting machine that lacks > indexing. How about the AMD 29000? I'm sure that Brian Case will comment on this.
larus@paris.Berkeley.EDU.berkeley.edu (James Larus) (10/30/88)
In article <17260@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes: >In article <17208@ames.arc.nasa.gov> lamaster@ames.arc.nasa.gov.UUCP (Hugh LaMaster) writes: >>Correct. Register windows seem to be a bad idea, but not because of > >I seem to have overstated my case. I should have said, that I consider >that it is unproven that register windows are a good idea relative to >other equivalent uses of the same real estate. The only "problem" >that I have with register windows is that, like "RISC" (whatever that is), >some people make extraordinary unsubstantiated claims. You might be interested in David Wall's paper "Register Windows vs. Register Allocation," in the 1988 PLDI Conf. He found that register windows were generally as good as large register files (64 or more entry) together with interprocedural register allocation. However, Wall argued that load/store in a large register file should be faster so that the overall performance would be better without windows. Given that the last assumption is questionable and that windows have significant advantages for simpler compilers and for languages that don't allow interprocedural allocation (e.g., Lisp), it is surprising that more machines don't use them. /Jim
aglew@urbsdc.Urbana.Gould.COM (10/31/88)
>In article <313@auspex.UUCP> guy@auspex.UUCP (Guy Harris) writes: >>While I was at Sun, I don't remember there ever having been any effort >>to reduce the depth of the kernel call stack in order to speed things up >>on SPARC-based Suns... > >Um, Guy, a lot of us Sun customers concluded quite a while ago that >performance (except on benchmarks) is not high on Sun's list of priorities. >"Just buy a faster machine, we'll be happy to sell you one." >-- >The dream *IS* alive... | Henry Spencer at U of Toronto Zoology >but not at NASA. |uunet!attcan!utzoo!henry henry@zoo.toronto.edu Yep, SUN keeps making liars out of the people who say that the CPU has ceased to be a bottleneck. :-)
guy@auspex.UUCP (Guy Harris) (11/01/88)
>This view is typical of hardware types. By all means, lets pass the >buck to the next guy. So the compiler writer has his(her) share of >nightmares actually getting something to make the thing compile some >code. Umm, I think compiler writers for CISC have their own headaches.... >And then the systems programmer comes along and inserts a few >more kludges to make the machine purr. Ditto.... >Now it is the application programmer's turn to s***w things up. >If my memory serves me correctly, it is much easier to get something >up and running on a Motorola 68000 than on an Intel 8086 (very nasty, >those beasty little segments). Well, perhaps a better comparison there would be between the 68K and the 80386; in that case, you can avoid dealing with the segments. Given that comparison, I don't see why it matters to the application program - or, in a lot of cases, to the *systems* programmer; I've written OS code that runs on the 68K, SPARC, IBM 370, and 80386 (and that would probably run on a boatload of other architectures), and I didn't have to do any extra work to make it work on them all - the C compiler did the work for me. >And miracle of miracles, we learn that over 70% of computing costs are >software. A more interesting figure would be "for a given system, how much of the *design* costs are hardware and how much are software." I suspect a lot of the "expensive software and cheap hardware" types are comparing the *production* costs of the hardware with the *development* costs of the software, which doesn't yield interesting results - why not compare the production costs of the hardware and of the software, which would prove that software is cheap and hardware is expensive.... (Then again, there's the question of whether microcode is software, hardware, or both....) >It seems like hardware types should be designing their end of the deal >to reduce it at the other end. No, both types should be designing their ends to reduce it at the bottom line, which is, after all, what really counts.
guy@auspex.UUCP (Guy Harris) (11/01/88)
>Yep, SUN keeps making liars out of the people who say that the CPU >has ceased to be a bottleneck. :-) s/the CPU/memory - most of the slowness I've seen on Suns has been due to paging.
bcase@cup.portal.com (Brian bcase Case) (11/01/88)
I wrote: >> The almighty Compiler can save us from our sins! >This view is typical of hardware types. By all means, lets pass the >buck to the next guy. If that makes the system better (cheaper, faster, more reliable), then we shold pass the buck. >And miracle of miracles, we learn that over 70% of computing costs >are software. The architecture has little if anything to do with this. The compiler is a comparatively-meager one-time investment, and getting a good one is very important. A good optimizing compiler is probably easier to write for a simple machine (e.g. RISC) than for a complex machine. >It seems like hardware types should be designing their end of the >deal to reduce it at the other end. This is the kind of thinking that got us machines like the VAX: "Well, I just *know* those ol' compiler guys are gonna love me 'cause I'm giving them these bitchin' addressing modes and memory to memory operations. They won't mind that I'm squeezing out registers to save code size and fit in the microcode. Sure its hard work, but I don't mind, besides my boss keeps telling me I'm supposed to be designing my end of the deal to reduce their work." Never mind that the hardware guy is thwarting their efforts to write a good compiler: the next version of the machine has instruction timings that completely change the trade-offs of code generation. Compiler guy: "What do you mean that addressing mode is now twice as slow!? I spent the better part of six months making the compiler use it to save code space! I think I'm gonna go work on ol' Lazy Larry's machine, at least I know each instruction's execution time, and he won't change it next year!"
mph@praxis.co.uk (Martin Hanley) (11/01/88)
With the Acorn RISC Machine (ARM), *EVERY* instruction can be conditional, not just the jumps. If the condition flags are not set, then the instruction is ignored. Also, a bit in the instruction flags whether of not to reset the condition codes when executing the instruction. This setup has obvious advantages when it comes to preserving pipelines, since the major bugbear of said pipelines is that every jump causes them to be broken. This is circumvented to some extent by the provision of delayed jumps (which the ARM also has), but not entirely. Does any other machine have this feature? Anybody have comments on it? mph. ----------------------------------------------------------------------------- "I'm not a god, I was misquoted" - Lister, Red Dwarf These are, of course, my opinions. Who else would want them? My home: mph@praxis.co.uk -----------------------------------------------------------------------------
cprice@mips.COM (Charlie Price) (11/02/88)
In article <4759@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes: > >The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13 >cycles, I believe). A 32-bit integer divide takes the 88k 39 cycles >(r3000 takes 36 cycles, I believe). Of course, if either of the >division operands is negative (signed division opcode), the 88k has to >trap to a software routine to finish the division. In other words, >not all RISCs are wimps just because they don't have "complex >instructions". >Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida. The MIPS R2000 and R3000 have integer multiply/divide instructions, but they are unlike the other main CPU instructions. The source operands are in general purpose registers and the result (64-bit product or 32-bit quotient and 32-bit remainder) is written to a special pair of registers named HI and LOW. There are instructions (MFHI MFLO) to move from HI and LOW to a general register. So why do it this (seemingly odd) way? From the architecture spec: Multiply and divide operations are performed by a separate, autonomous execution unit. After a multiply or divide operation is started, execution of other instructions may continue in parallel. The multiply/divide unit continues to operate during cache miss and other delaying cycles in which no instructions are executed. The number of cycles required for multiply/divide operations is implementation-dependent. The MFHI and MFLO instructions are interlocked so that any attempt to read them before operations have completed will cause execution of instructions to be delayed until the operations finishes. The table below gives the number of cycles required between a MULT, MULTU, DIV or DIVU operation and a subsequent MFHI or MFLO operation, in order that no interlock or stall occurs. MULT MULTU DIV DIVU R2000 12 12 33 33 R3000 12 12 33 33 Clearly in order to do something useful you need to pick up at least one 32-bit portion of the result, so in the best case you get a 13 cycle multiply and a 34 cycle divide. If a stall occurs, it may complicate restarting the pipeline and add an additional cycle. By the way, it is worth noting that the 88000 4-cycle multiply mentioned above only generates a 32-bit result... The "why" of the MIPS architecture is that integer multiply/divide is a sort-of coprocessor. When a full multiply is necessary, it can be done faster than with software-only and it may be possible to get other useful work done while waiting for the result. All of the work determining that this was a worthwhile feature to add to the architecture was done long before I came to MIPS so I can't comment on the basis for this decisison (perhaps mash will comment on that). In practice, many things that seem to require multiply instructions get turned into some sequence of inline shifts and adds. Obviously the compiler makes some sort of decision about which is "better" to use. -- Charlie Price cprice@mips.com (408) 720-1700 MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086
jkrueger@daitc.daitc.mil (Jonathan Krueger) (11/02/88)
In article <359@auspex.UUCP>, guy@auspex (Guy Harris) writes: >I suspect a lot of the "expensive software and cheap hardware" types >are comparing the *production* costs of the hardware with the >*development* costs of the software This is a good point. However, a complete analysis adds up life cycle costs and divides by number of runs or copies sold, arriving at each unit's amortized costs. This makes either custom hardware or software look very expensive indeed. Yet it may cost no more than general purpose hardware or software, where the R&D is amortized over many sales or runs. More generally, I suspect that many "expensive software and cheap hardware" types are comparing the per-unit costs of the hardware with the unamortized costs of the software. But since most of us customize software rather than hardware, this may be a valid comparison. If a vendor can sell you a workstation for $10K, including its payback on his R&D, but it costs you $20K to develop the software that will run on it, it seems valid to say that the software cost more than the hardware. If you run the same software on ten such workstations, of course the opposite is true. -- Jon --
guy@auspex.UUCP (Guy Harris) (11/02/88)
>But since most of us customize software rather than hardware, this may >be a valid comparison. Although I doubt it is in this particular case. Compiler and OS development costs tend to be amortized over a base close in size to the base over which hardware development costs are amortized, and that's what the original poster was talking about. (Arguably, the base is even larger, since many hardware development costs are amortized over an implementation of an architecture, rather than over the entire architecture.) I've not seen any good evidence that any added software development/production/maintenance cost, over the lifecycle of an architecture, of a simpler architecture outweighs any added hardware development/production/maintenance cost, over the life cycle of an architecture, of a more complex architecture. I'm also not convinced that there's a major added application development cost for developing in sufficiently high-level languages on RISC machines (I've heard some flames that there is, but many of those problems appear to be due to sloppy coding habits combined with luck, on the part of the developers, in the choice of CISC machines they ported to). I know that I had no great extra trouble getting C code, whether kernel-mode or user-mode, working on SPARC as well as 68K; it tended to run on the 80386 and 370 as well. In cases where code failed on SPARC, it often failed for reasons that would cause it to fail on some CISC-based systems as well....
khb%chiba@Sun.COM (Keith Bierman - Sun Tactical Engineering) (11/03/88)
In article <3264@newton.praxis.co.uk> mph@praxis.co.uk (Martin Hanley) writes: > >With the Acorn RISC Machine (ARM), *EVERY* instruction can be >conditional, not just the jumps. If the condition flags are not set, >then the instruction is ignored. Also, a bit in the instruction flags >whether of not to reset the condition codes when executing the >instruction. > >This setup has obvious advantages when it comes to preserving >pipelines, since the major bugbear of said pipelines is that every >jump causes them to be broken. This is circumvented to some extent by >the provision of delayed jumps (which the ARM also has), but not >entirely. > >Does any other machine have this feature? Anybody have comments on it? The late Cydra 5 had 'em. The condition bits could (and often were) set based on the data (thus the name "directed dataflow"). Since the cydra 5 had multiple instructions (6/7 depending on who counted) per clock, and very long pipes (26 for memory fetch) these conditional exeuction features were very valuable. Keith H. Bierman It's Not My Fault ---- I Voted for Bill & Opus
csimmons@hqpyr1.oracle.UUCP (Charles Simmons) (11/03/88)
In article <754@wsccs.UUCP> dharvey@wsccs.UUCP (David Harvey) writes: >In article <10194@cup.portal.com>, bcase@cup.portal.com (Brian bcase Case) writes: >> The almighty Compiler can save us from our sins! > >If my memory serves me correctly, it is much easier to get something >up and running on a Motorola 68000 than on an Intel 8086 (very nasty, >those beasty little segments). And miracle of miracles, we learn that >over 70% of computing costs are software. It seems like hardware types >should be designing their end of the deal to reduce it at the other end. > >dharvey@wsccs Hmmm... Bad example. The 8086 is an extremely unorthogonal architechture. The 68000 isn't very orthogonal (there are two different kinds of registers). RISC chips tend to be extremely orthogonal. Thus, this example would suggest that RISC designers are reducing the software complexity, and that it would be easier to get something up and running on a RISC than on a 68000 (much less an Intel chip). -- Chuck
grow@druhi.ATT.COM (Gary Oblock) (11/04/88)
In article <7472@winchester.mips.COM>, cprice@mips.COM (Charlie Price) writes: > In article <4759@pdn.UUCP> alan@pdn.UUCP (0000-Alan Lovejoy) writes: > > > >The 88k does a 32-bit integer multiply in 4 cycles (r3000 takes 13 > >cycles, I believe). A 32-bit integer divide takes the 88k 39 cycles > >(r3000 takes 36 cycles, I believe). Of course, if either of the > >division operands is negative (signed division opcode), the 88k has to > >trap to a software routine to finish the division. In other words, > >not all RISCs are wimps just because they don't have "complex > >instructions". > >Alan Lovejoy; alan@pdn; 813-530-8241; Paradyne Corporation: Largo, Florida. > > The MIPS R2000 and R3000 have integer multiply/divide instructions, > but they are unlike the other main CPU instructions. > The source operands are in general purpose registers and > the result (64-bit product or 32-bit quotient and 32-bit remainder) > is written to a special pair of registers named HI and LOW. > There are instructions (MFHI MFLO) to move from HI and LOW to a > general register. > So why do it this (seemingly odd) way? > : Deleted (a bunch of hardware reasons) > > In practice, many things that seem to require multiply instructions > get turned into some sequence of inline shifts and adds. > Obviously the compiler makes some sort of decision about which > is "better" to use. > -- > Charlie Price cprice@mips.com (408) 720-1700 > MIPS Computer Systems / 928 Arques Ave. / Sunnyvale, CA 94086 Another very good reason to do things this way is that your register allocation scheme does not have to deal with allocating register pairs. When register allocation takes place at the intermediate code level (e.g. the Stanford U-Code compiling systems) the problem is especially evident. As Fred Chow said about this in his section on the limitations of allocating at the intermediate level [page 71; A Portable Machine-Independent Global Optimizer-- Design and Measurement; Chow, Frederick Chi-Tak; PhD 1984; Stanford University] The requirements and effects of individual machine instructions cannot be taken into account. Such uses of registers arising out of instruction selection by code generators are not necessarily related to the register allocation decisions. When registers are globally allocated by the optimizer, intermixing of registers used by the optimizer and registers used by the code generator is not possible. ....stuff about redundant copies being introduced... In plain old-fashioned register allocators on machines that require the use of register pairs for multiplication and division you have to treat these registers as special cases. This is done by reserving a pair of scratch registers for these operations and/or using clumsy heuristics that attempt to keep register pair(s) free when needed for these operations. All things considerered, allocating registers on a machine with a special pair of result registers for multiplication and division (i.e. the MIPS) should be much easier and much more effective. Gary Oblock -- Compiler consultant to Bell Laboratories -- Denver, CO (303)538-4169 -- att!druhi!grow Disclaimer -- I'm pretty sure they'll agree with about this at MIPS. But,... they're not my employers!
mash@mips.COM (John Mashey) (11/08/88)
In article <473@oracle.UUCP> csimmons@oracle.UUCP (Charles Simmons) writes: >In article <754@wsccs.UUCP> dharvey@wsccs.UUCP (David Harvey) writes: >>...... And miracle of miracles, we learn that >>over 70% of computing costs are software. It seems like hardware types >>should be designing their end of the deal to reduce it at the other end. >Hmmm... Bad example. The 8086 is an extremely unorthogonal >architechture. The 68000 isn't very orthogonal (there are two >different kinds of registers). RISC chips tend to be extremely >orthogonal. Thus, this example would suggest that RISC designers >are reducing the software complexity, and that it would be easier >to get something up and running on a RISC than on a 68000 (much >less an Intel chip). Yes, for sure: it's absolutely weird that people ahve decided that you must work much harder in software for RISCs: IT'S NOT THAT YOU NEED MORE COMPLICATED SOFTWARE FOR A RISC THAN A CISC, IT'S THAT WELL-DESIGNED RISCS OFFER MORE AND EASIER OPPURTUNITIES FOR OPTIMIZATION THAN SOME CISCs: a) You spend less time on weird-case selection and analysis. (should I do an add, or should I use weird-address-mode XX?) (registers x,y,z, and Q are needed for funny-instruction ZZ) b) You usually have more registers, and more orthogonal ones, hence global register allocation has more to work with, and you can think about things like interprocedural allocation (because you have a useful number of registers), whereas if you've only got half-a-dozen, you don't even think about it. c) It is usually easier to do pipeline reorganization, given the above, plus lack of things like condition codes. You don't HAVE to do much of this, but you can. At MIPS, the first reorganizer was written in a couple weeks. We had C/Pascal compilers generating reasonable code BEFORE the architecture was completely frozen (work really got started December 84/January 85, and we could run generated+linked executables through our MIPS->VAX object code converter around mid-year. A compiled UNIX was running on a simulator 11/85, and the compilers bootstrapped thru themselves successfully about the same time, including lots of optimization. Admittedly, MIPS didn't start from scratch, but used the work from Stanford. Nevertheless, reasonable code was generated pretty early.) As a datapoint, consider that an 8MHz R2000 (in a "5-mips" M/500), with global optimization omitted, and NoRegs, yields 8,800 1.1 Dhrystones, which is nowhere near the 13,000 it gets with -O3....but still faster than most CISC micro implementations. So, good optimization is worth having, but what you get without it isn't bad. (You also get 5900 DP Kwhets on a 12.5MHz R2000, which likewise is not awful.) In addition, let me observe, that certainly on machines of the MIPS, HP PA, 88K, SPARC ilk: a) It is probably simpler to write assembler code (at least compared with S/360s, PDP-11s, Vaxen, 3Bs, 68Ks , in my experience, and other people, not at MIPS, who've had experience with both MIPS and other architectures) b) The simpler machines often help things like: debuggers the object-code-to-object-code translators like we or Ardent use. (The profiling and architectural analysis of these are incredibly productive; they may exist on CISCs, but if so, not many.) object formats (like dense line number tables, not so easy to do with variable-length instructions) Anyway, the bottom line is: In many RISCs, some hardware functionality has moved to software, but the REQUIRED software is mostly straightforward and modular (like doing * or / in software), although it can get pretty tricky (like the more complex versions of doing these), but it's still modular. Clean RISCs give optimizing compilers more leverage, but it is pretty easy to generate at least reasonable code for them. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
martin@minster.york.ac.uk (11/09/88)
In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes: >In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >>... >where n=[1,5]. To store a value from X<n>, you load the address into A<n>, ^^^^^ actually it was n=[3,4,5] A1 and A2 were also general purpose. >where n=[6,7]. A0 has no special values, and B0 is a hardwired 0. > >Ok, a few from the Wonderful World of the Cyber: > > Count the set bits in a word (register or memory, it doesn't >matter). Very useful for some trivial applications (such as playing >Othello), but I haven't seen much else done with it. It was put in, it >looks like, because the hardware was already there (in the form of parity >checking), but I could be wrong. > >Sean Eric Fagan | "Engineering without management is *ART*" >seanf@sco.UUCP | Jeff Johnson (jeffj@sco) >(408) 458-1422 | Any opinions expressed are my own, not my employers'. The history of the population instruction is interesting: I was told (by someone from CDC, whom I have every reason to believe knew what he was talking about) that when Seymour Cray first designed the 6600 it did not have the population instruction; when they showed the the machine to the people at Los Alamos they said ``Great! We'll have 6, but only if you add this instruction we need!''. Every machine that Seymour Cray designed since has included the population instruction. (that was up to the time I was told the story - can anyone verify this for the Cray-2, etc?) It is interesting to note that the instruction set is quite nice and regular (at the bit level), but the population instruction does not fit into the pattern, also suggesting that it was an afterthought. Note that the Cyber 170 is compatible with the 6600. It was an interesting machine to program in assembler, as can probably be guessed from the above description. However PP (Peripheral Processor) programming was much more fun, since you have unlimited access to the main memory of the Central Processors - the base/limit memory protection only affected CP programs. (I'm speaking in the past tense, not because there are no more of these machines, but because, fortunately, I don't have to program one any more! I've still got my Compass manuals though!!) Martin C Atkins ...!ukc!minster!martin PS I leave it as an exercise for the reader (is this wise?) to guess what Los Alamos wanted the population instruction for! But they weren't interested in Hamming distances, rather in making bits of Plutonium go bang!!
seanf@sco.COM (Sean Fagan) (11/20/88)
In article <595030314.2944@minster.york.ac.uk> martin@minster.york.ac.uk writes: >In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes: >>where n=[1,5]. To store a value from X<n>, you load the address into A<n>, > ^^^^^ >actually it was n=[3,4,5] A1 and A2 were also general purpose. Nope, sorry. 'SA1 X1' would load the first argument into X1 (assuming you were using FORTRAN calling conventions). Let's hope, however, that you only had one argument 8-). It would, btw, also load the address of the first argument into A1, so that, when you were done, you do: BX6 X1 SA6 A1 and the argument was stored. Nice. (also explains why a 5 can become a 3 and totally screw people up 8-).) >>where n=[6,7]. A0 has no special values, and B0 is a hardwired 0. >> >Note that the Cyber 170 is compatible with the 6600. It was an interesting >machine to program in assembler, as can probably be guessed from the above >description. However PP (Peripheral Processor) programming was much more >fun, since you have unlimited access to the main memory of the Central >Processors - the base/limit memory protection only affected CP programs. >(I'm speaking in the past tense, not because there are no more of these >machines, but because, fortunately, I don't have to program one any more! >I've still got my Compass manuals though!!) True, PP's do (note the present tense 8-)) have unlimited access to the main memory. However, you had to have System Access (or some such) bit set in your Protection word (NOS has, I believe, several 60-bit words for permissions. Many permissions are in the form of one-bit values). Then, you rebuild the libraries (aka the system), and your PP program could be loaded. However: PP's are 12-bit machines, with an Accumulator only. To address any arbitrary word, you a) had to build the address using the P register and the K register (P was 12 bits, K was 6; I think the names are right), and b) had to make five passes, since you only had 12-bit words in the PP. Lastly, why wouldn't you want to program these machines? They are *wonderful*. Anybody who has to learn assembly language should learn these machines (they're about as regular as a PDP, but are much faster, and prepare people for RISC)! -- Sean Eric Fagan | "Engineering without management is *ART*" seanf@sco.UUCP | Jeff Johnson (jeffj@sco) (408) 458-1422 | Any opinions expressed are my own, not my employers'.
rik@june.cs.washington.edu (Rik Littlefield) (11/21/88)
In article <1762@scolex>, seanf@sco.COM (Sean Fagan) writes: > In article <595030314.2944@minster.york.ac.uk> martin@minster.york.ac.uk writes: > >In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes: > >>where n=[1,5]. To store a value from X<n>, you load the address into A<n>, > > ^^^^^ > >actually it was n=[3,4,5] A1 and A2 were also general purpose. > > Nope, sorry. 'SA1 X1' would load the first argument into X1 (assuming you > were using FORTRAN calling conventions). Let's hope, however, that you only > had one argument 8-). It would, btw, also load the address of the first > argument into A1, so that, when you were done, you do: > BX6 X1 > SA6 A1 > and the argument was stored. Nice. (also explains why a 5 can become a 3 > and totally screw people up 8-).) Not quite. Sean is quite correct about A1-A5 being used for fetches. But the sequence SA1 X1 / BX6 X1 / SA6 A1 is a no-op as far as memory is concerned -- it simply fetches a value from the address originally stored in X1, then stores that value back *into the same address*. Perhaps Sean meant to imply that some other things happened to X1 in between the SA1 and the BX6. That would be inefficient, since the result could have been left in X6 directly, but I've seen it done. BTW, that's the "FTN" calling sequence. The earlier "RUN" Fortran compiler was completely different and used mostly the B-registers. > Lastly, why wouldn't you want to program these machines? They are > *wonderful*. Anybody who has to learn assembly language should learn these > machines (they're about as regular as a PDP, but are much faster, and > prepare people for RISC)! I agree completely -- no complicated addressing modes, a simple regular instruction set, and (for their day) they ran like scalded dogs. Having just 18-bit addresses did get in the way, though. Especially since only 17 of them were actually usable, and most machines weren't even that big! (Also, as an aside, it's a bit amusing to see the number of posters who revel in the simplicity of the instruction set ... and then get their examples wrong ;-) --Rik
martin@minster.york.ac.uk (11/21/88)
I wrote: > In article <1622@scolex> seanf@sco.COM (Sean Fagan) writes: > >In article <998@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >>... > >where n=[1,5]. To store a value from X<n>, you load the address into A<n>, > ^^^^^ > actually it was n=[3,4,5] A1 and A2 were also general purpose. I'm sorry - I wrote this before checking the manual (always a bad move!). The original poster was correct, A1 and A2 also load the X regsiters. Many apologies to all concerned, and thanks to those who were kind enough to point out my mistake by mail. Martin
eriks@cadnetix.COM (Eriks Ziemelis) (11/22/88)
In article <6475@june.cs.washington.edu> rik@june.cs.washington.edu (Rik Littlefield) writes: >In article <1762@scolex>, seanf@sco.COM (Sean Fagan) writes: > >> Lastly, why wouldn't you want to program these machines? They are >> *wonderful*. Anybody who has to learn assembly language should learn these >> machines (they're about as regular as a PDP, but are much faster, and >> prepare people for RISC)! > >I agree completely -- no complicated addressing modes, a simple regular >instruction set, and (for their day) they ran like scalded dogs. Having >just 18-bit addresses did get in the way, though. Especially since only 17 >of them were actually usable, and most machines weren't even that big! > >--Rik Hear, hear! Even though my experience with the Cyber family was in college (2 6500 and 1 6600: one of the systems had the ECS) I loved it. Still have copies of the manuals and after having worked with Vaxen, 68K, PDP, et. al. I still want to program a Cyber. Almost took a job out of college doing just that for a SDI (Star Wars) research company. Oh well, we all make mistakes. To anyone at Purdue: I heard a rumor that the assembly language programming course (CS 300) is no longer Compass. Is this true? Eriks A. Ziemelis Internet: eriks@cadnetix.com UUCP: ...!{uunet,boulder}!cadnetix!eriks U.S. Snail: Cadnetix Corp. 5775 Flatiron Pkwy Boulder, CO 80301 Baby Bell: (303) 444-8075 X221
smryan@garth.UUCP (Steven Ryan) (11/24/88)
today's trivia question: what does compass abbreviate? -- -- s m ryan -------------------------------------------------------------------------------- As loners, Ramdoves are ineffective in making intelligent decisions, but in groups or wings or squadrons or whatever term is used, they respond with an esprit de corps, precision, and, above all, a ruthlessness...not hatefulness, that implies a wide ranging emotional pattern, just a blind, unemotional devotion to doing the job.....
dik@cwi.nl (Dik T. Winter) (11/24/88)
In article <1973@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: > today's trivia question: > > what does compass abbreviate? > -- COMPrehensive ASSembler? -- dik t. winter, cwi, amsterdam, nederland INTERNET : dik@cwi.nl BITNET/EARN: dik@mcvax
seanf@sco.COM (Sean Fagan) (11/30/88)
In article <1973@garth.UUCP> smryan@garth.UUCP (Steven Ryan) writes: >today's trivia question: >what does compass abbreviate? COMPrehensive ASSembler. Truly trivial. Next trivia question: What happens if you unplug the wire at row 13, column 43, in a Cyber 170/760? (btw: 8-)) -- Sean Eric Fagan | "Engineering without management is *ART*" seanf@sco.UUCP | Jeff Johnson (jeffj@sco) (408) 458-1422 | Any opinions expressed are my own, not my employers'.
smryan@garth.UUCP (Steven Ryan) (11/30/88)
> > today's trivia question: > > > > what does compass abbreviate? > > -- >COMPrehensive ASSembler? COMPrehensive ASsembly System, which is why the product ident is CPn instead of CAn. (Product ident is CDC's 3 letter name for a product.) -- -- s m ryan +------------------------------------------------------------------------------+ |Good day-eh. Je me souviens.| +------------------------------------------------------------------------------+
bcase@cup.portal.com (Brian bcase Case) (12/01/88)
>What happens if you unplug the wire at row 13, column 43, in a Cyber >170/760? Uh, it runs twice as fast? Er, it emulates a 370? Hmm, several city blocks may be affected? Let's see, all student jobs fail? I know! The console displays "Greetings Seymour, it has been a long time."
phil@aimt.UUCP (Phil Gustafson) (12/03/88)
In article <1795@scolex>, seanf@sco.COM (Sean Fagan) writes: > What happens if you unplug the wire at row 13, column 43, in a Cyber > 170/760? Well, you remove a disable line from a gate. You enable a feature whose name I forget (Environment Swap or something), gain tens of thousands of 1965 dollars worth of feature, and guarantee that your service contract is null and void. phil -- Opinions outside attributed quotations are mine alone. Satirical material may not be labeled as such. -- -- Phil Gustafson, Graphics/UN*X Consultant {uunet,ames!coherent}!aimt!phil phil@aimt.uu.net 1550 Martin Ave, San Jose, Ca 95126