alfred@dutesta.UUCP (Herman Adriani & Alfred Kayser) (10/27/88)
I only see discussions about established firms and there hardware. Am I
the only one who ever heard about the Acorn Risc chip and there computer
'Archimedes', or do all of you ignore this machine just because it is a
small (british) company which manufactures strange computers (BBC).
I do not own an Archimedes, but I like the machine -> it is cheap, fast,
Hi-res lots of memory.
What I like to know: How does this newsgroup think about this machine (chip).
Herman Adriani.
P.S.
I don't want to start a new war about RISC or CISC, I like both architectures
and they should excist both!!!. They just have there won territory in which
they outperform the other one.
Don't flame me about starting something, I just wanna know (everything).
--
_____________________________________________________________________________
/ \
| Herman Adriani & Alfred Kayser: Computer fans especially from 24 pm to 7 am |
\_____________________________________________________________________________/
bcase@cup.portal.com (Brian bcase Case) (10/29/88)
>Am I the only one who ever heard about the Acorn Risc chip and there computer >'Archimedes', or do all of you ignore this machine just because it is a >small (british) company which manufactures strange computers (BBC). >I do not own an Archimedes, but I like the machine -> it is cheap, fast, >Hi-res lots of memory. >What I like to know: How does this newsgroup think about this machine (chip). I can only give my opinion; my views do not necessarily represent those of this group. No warantee, either expressed or implied, .... The ARM is a very interesting, and in some ways clever, architecture. The ability to do an ALU and SHIFT op in one cycle is surprisingly useful in some kinds of code, especially the embedded-control, bit-twiddling, graphics- handling kinds of applications. The conditional execution (each instruction hase a 4-bit condition field that must be satisfied for the instruction to execute) facility is also quite nice, sorta like a skip. It does have a three-address architecture. Its biggest problems are implementation and support. The address bus is only 26 bits, faulting handing and memory management are a totally weird, very fast implementations do not exist, no one is providing comprehensive support (maybe VLSI tech is doing a better job these days, but where are the compilers? UNIX implementations? etc.), etc. If I remember correctly, it also has only 25, or some weird number of, registers, and only some are available to user code. One of them is the PC, which is not good for really high-performance implementations. The ALU and SHIFT instructions take 2 cycles. It has only an address bus and a combined data/instruction bus. For best performance, you need more bandwidth. At high clock rates, the bus protocols will not work well. I have seen small graphics kernels on which the ARM does better than any vanilla RISC in terms of cycle count. But vanilla RISCs win in total time (shorter cycle). Most of the implementation problems could be solved, but who is solving them?
aglew@urbsdc.Urbana.Gould.COM (10/30/88)
>/* ---------- "RISC vs CISC" ---------- */ >I only see discussions about established firms and there hardware. Am I >the only one who ever heard about the Acorn Risc chip and there computer >'Archimedes', or do all of you ignore this machine just because it is a >small (british) company which manufactures strange computers (BBC). >I do not own an Archimedes, but I like the machine -> it is cheap, fast, >Hi-res lots of memory. >What I like to know: How does this newsgroup think about this machine (chip). > > Herman Adriani. >P.S. >I don't want to start a new war about RISC or CISC, I like both architectures >and they should excist both!!!. They just have there won territory in which >they outperform the other one. > >Don't flame me about starting something, I just wanna know (everything). >-- > _____________________________________________________________________________ >/ \ >| Herman Adriani & Alfred Kayser: Computer fans especially from 24 pm to 7 am | >\_____________________________________________________________________________/ Most of us simply don't know enough about the ARM to say much useful. I've looked at it's instruction set, and it appears to be fairly clean, although with the typical British penchant for gimmicks (don't fmae me, I'm (almost) (dual citizen) a Brit too). I am interested in the ARM because of its commodity market nature - I think it's VLSI Technology that's selling it as an embedded controller(?) - which means that it has significantly different tradeoffs than most of the workstation chip wars we hear about most often. But apart from quoting the marketing literature (7 MIPS, listing the IS) I don't have any meaningful data. If anyone connected with the ARM wishes to start a meaningful discussion, I would probably join in, with questions like: did you intend to sell it into the low market, or did you design the chip and have it just happen that way? did you make the right tradeoffs? etc.? Is there anyone out there that can lead this? In general, I start discussions on things that I am interested in. I provide information on things that I know about, when that information is already in the public domain. And I'll comment about things other folks say. So, Herman, what can you tell us about the ARM and Archimedes? Andy "Krazy" Glew. at: Motorola Microcomputer Division, Champaign-Urbana Development Center (formerly Gould CSD Urbana Software Development Center). mail: 1101 E. University, Urbana, Illinois 61801, USA. email: (Gould addresses will persist for a while) aglew@gould.com - preferred, if you have MX records aglew@fang.gould.com - if you don't ...!uunet!uiucuxc!mcdurb!aglew - paths may still be the only way My opinions are my own, and are not the opinions of my employer, or any other organisation. I indicate my company only so that the reader may account for any possible bias I may have towards our products. PS. I promise to shorten this .signature soon.
kers@otter.hple.hp.com (Christopher Dollin) (10/31/88)
Some remarks about the ARM, following Brian Case's response to the basenote. I speak solely as an owner of an Acorn Archimedes with some experience in compiler writing; my employers won't buy me one for my desk [are you surprised?] | Its biggest problems are implementation and support. The address bus is | only 26 bits, faulting handing and memory management are a totally Memory management is not part of the chip, but has to be provided by an external memory manager. The current version of this (MEMC) does seem to be a little odd, but then, I'm not really familiar with MMU devices. | weird, very fast implementations do not exist, no one is providing | comprehensive support (maybe VLSI tech is doing a better job these days, | but where are the compilers? UNIX implementations? etc.), etc. If I Acorn a *supposed* to be working on a Unix implementation. Last I heard it was die out toward the end of this year. Silence is golden .... | remember correctly, it also has only 25, or some weird number of, registers, | and only some are available to user code. 25 on-chip registers. Each operating mode sees 16; in the three non-user modes, some of the user registers are shadowed out; in SVC and MI (maskable interupt) mode, the R15 (PC) and R14 (return link) are shadowed, and in NMI (non-MI) mode, R11-R15 are shadowed. This is to allow fast context switching, especially in NMI code (where the NMI owner can set up the NMI registers and return to user code; NMIs can then operate with *no* register save-restore). Acorn operating systems are *heavily* interrupt-driven. | One of them is the PC, which is not good for really high-performance | implementations. The ALU and SH instructions take 2 cycles. No, one cycle. An ALU instuction with one operand shifted *by an amount held in a register* takes an additional cycle. Incidentally, one should be careful and distinguish *sequential* cycles from *non-sequential* cycles, as the instruction fetch is in burst mode (I think that's the right term). | It has only an address bus and a combined data/instruction bus. For best | performance, you need more bandwidth. At high clock rates, the bus protocols | will not work. I have seen small graphics kernels on which the ARM does better | than any vanilla RISC in terms of cycle count. But vanilla RISCs win in total | time (shorter cycle). Most of the implementation problems could be solved, | but who is solving them? Well, presumably Acorn. If not, I can imagine I'll be very upset in a few years time .... Regards, Kers. "If anything anyone lacks, they'll find it all ready in stacks."
hjb@otter.hple.hp.com (Harry Barman) (10/31/88)
A friend of mine saw 4.3 running on an Archimedes w/ X windows. If Acorn's famed marketing/shipping depts. can get their act together it may be possible to buy it! The ARM was started as a project in mid-83. Initially, the main reasons for the project were (don't laugh) a successor to the 6502 processor that was used in Acorn's previous machines. I believe this background influenced the design of the MMU, which meant huge page sizes and so wasn't very suitable for Unix implementations. Acorn is in the business of building small PC board personal (cheap) computers, and in that context I think the ARM fits in reasonably well. Harry
cik@l.cc.purdue.edu (Herman Rubin) (06/25/89)
In article <1989Jun24.230056.27774@utzoo.uucp>, henry@utzoo.uucp (Henry Spencer) writes: > In article <57125@linus.UUCP> bs@gauss.UUCP (Robert D. Silverman) writes: > >This, in my opinion is one of the major faults of RISC processors. They > >do not provide basic arithmetic instructions. > > When the list of "basic" arithmetic instructions is pages long, one starts > to wonder how many of them are really "basic". The instruction you ask > for -- divide double length by single length yielding single-length result -- > is not exactly frequently needed. Just how much silicon is it worth to make > it run faster than an implementation as a subroutine? The list of arithmetic instructions is pages long only if the documentation is stupid enough not to put lots of instructions on one page. What is used heavily in his work, reported in sci.math., is division of a double by a single getting simultaneously quotient and remainder. This takes about 6 or 7 instructions inline if some provisions are made. If they are not, it may very well take 20 instructions. One might reduce the number of division instructions by only having double/single yielding quotient and remainder for integer division, signed and unsigned, and floating point division similarly, signed only. This would give three division instructions with different argument sizes, and the only "waste" would be in the moving of unwanted results to an unused register. Even this could be avoided if there were read-only registers, which many machines have. Instead, many machines have separate quotient and remainder operations. Now Bob Silverman is one of the mathematicians who knows how to use the machine instructions efficiently, and absent the instruction, will modify the algorithm considerably to get around it. This is one of the things which does not show up on benchmarks. The presence or absence of a few "minor" instructions can make an algorithm efficient or horribly inefficient. Also, the cost of the entire ALU is typically dwarfed by the costs of memory, IO, etc. Silicon is cheap. Mathematicians have been rightly accused of not sufficient use of computers. It is not the case that fixed-length floating-point operations are what is needed. Multiple precision arithmetic requires good integer arithmetic. There are operation counts and there are operation counts. Is there essentiallyone integer sum, or are there several? I have advocated an integer quotient and remainder with the choice of truncation depending on the signs of the arguments. Is this one instruction or 2^n, where n is between 8 and 16? I consider it one; the tag field can be decoded by the arithmetic unit, not by the control unit. We can get much more at little extra cost, and flexibility is cheap. The same holds for languages. But once the instruction is omitted, it can be expensive to do anything about it. . -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
mash@mips.COM (John Mashey) (11/10/89)
In article <28942@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes: ... >continuing speed advantages). It seems to me that the >real issue is what would the extra die space be used for. >With a deeper pipeline, one could use additional gates >without slowing the clock down. With a CISCy enough >CISC, one might be able to keep the pipeline full. So, >if we were to double the die size tomorrow, what would >go on the chip? Just to throw sand in our eyes, why not >put 2 RISCS on the chip? Big research area - should the I don't think we're anywhere near this yet, and this can be seen by analyzing the layout and nature of million-transistor chips [like i860s]. If you look at the i860 die, you find that: a) Most of the transistors are in the caches. b) Most of the space is the FPU, registers, integer datapath, etc. Some of this stuff is wires, and it doesn't shrink as well as transistors do. c) At the top speed claimed for it, eventually [50Mhz], 12KB of cache is NOWHERE near big enough for efficiency, by itself. 1) As the CPU gets faster, the cache miss cost goes up, and the cache miss ratio must go down enough to maintain a constant amountof memory-system degradation. 2) Although 8-16K of cache on a million-transistor chip is certainly useful, serious cache simulations say that it just isn't enough for a well-balanced machine at the higher clock rates [50Mhz or so] that one would naturally use with the kind of technology that gives you a million transistors. 3) Thus, you still end up with secondary cache being needed in many configurations and application environments. So that says, that when you get up to 4M transistors, maybe you get close to having big enough caches on the chip to balance the CPU+FPU that are there....except that now you'll want to boost the clock rate some more, which means the caches are not as improved as you'd think [although getting close]. Well, maybe if somebody wants to build 100MHz parts, with about 8M transistors, 128K caches, that's a sort-of balanced thing. Maybe at the 16M-transistor point, if you still can't think of anything else to do with more silicon [and note that the current million transistor chips on the market or coming soon have not run out of intersting things to do with more silicon], you put 2 CPUs on one chip, if you can figure out a sensible cache hierarchy, and a package with few enough pins that people can use, because, as usual, the issue is not so much in making the CPUs run fast, it's getting the data in and out, and packaging technology will be "interesting". Of course, some of the numbers change if you built chips with different mixtures. Specifically, if you didn't care about FP, you could omit the FPU, which is inherently a big space hog. If you didn't need an MMU, that would save space also. However, I think this only moves the potential switch point from 1 CPU to 2 around a little. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
frazier@oahu.cs.ucla.edu (Greg Frazier) (11/10/89)
In article <31097@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >In article <28942@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes: >... [me proposing multiple CPU chips] >I don't think we're anywhere near this yet, and this can be seen by >analyzing the layout and nature of million-transistor chips [like i860s]. >If you look at the i860 die, you find that: > a) Most of the transistors are in the caches. > b) Most of the space is the FPU, registers, integer datapath, etc. > Some of this stuff is wires, and it doesn't shrink as well as > transistors do. > c) At the top speed claimed for it, eventually [50Mhz], 12KB > of cache is NOWHERE near big enough for efficiency, by itself. [ further discussion of how the $ is still too small ] Yeah, I've been wondering why people are bothering with on-chip caches. Admittedly, a 2-chip CPU would probably be significantly more expensive than a single-chip CPU, but I think the product would be more flexible, and achieve better performance, if, instead of putting the $ on the CPU chip, the $ was provided on a companion chip. This should only increase the latency to the cache by a single clock cycle, while significantly boosting the hit ratio. Particularly if this were a data $ (leave the instruction $ on chip - it belongs there). With a separate cache chip, one could a) put some intelligence on it and/or b) make it expandable, such that a user could choose to provide two or three $ chips. Of course, this is not a terribly new idea - but I don't know why it isn't being used in the newer chips. Realtime people would love it - they would put NO $ chips on, and simply populate the board with fast local memory (almost the same thing, but more predictable behavior). This scheme would provide more space on chip for... two cpu's, or multiple FPU's, or single-cycle FPU's (that's an attractive one!), or any other of a host of performance boosting schemes. A disadvantage is that the CPU chip would have to have multiple ports - a chip with inst. $ and data $ on chip can have both $'s share a single port to memory, since (presumably) neither is using it very often. However, unless one wants 128 bit paths to memory (which one might), I don't think having 2 ports is a big deal. Quick performance analysis hack: on-chip $, assume 85% hit ratio, 1 cycle delay on hit, 14 cycle delay on miss off-chip $, assume 95% hit ratio, 2 cycle delay on hit, 14 cycle delay on miss (a smart $ will not incur extra miss delay) on-chip memory speed: .85*1 + .15*14 = 2.95 cycles/reference off-chip memory speed: .95*2 + .05*14 = 2.60 cycle/reference - a win! With the hit ratios I have assumed, the break-even point is a memory delay of 10 cycles - below that, the on-chip cache becomes a win. Of course, change the hit ratios, and one changes the break-even point, so your mileage will vary. As a final note, if the $ chip is closely married to the CPU chip, there is no reason why the 2 cycle dealay can't be achieved, I think. Greg Frazier "They thought to use and shame me but I win out by nature, because a true freak cannot be made. A true freak must be born." - Geek Love Greg Frazier frazier@CS.UCLA.EDU !{ucbvax,rutgers}!ucla-cs!frazier
gerry@zds-ux.UUCP (Gerry Gleason) (11/10/89)
In article <31027@obiwan.mips.COM> mark@mips.COM (Mark G. Johnson) writes: >In article <1579@crdos1.crd.ge.COM> davidsen@crdos1.UUCP (bill davidsen) writes: >> I don't think any major chips are being designed in 6 months, or 18. >>At least not CPUs. I believe Intel said that the 486 design cycle was >>started about five years ago. Feel free to correct this if you have a >>better source than _Info World_. >Fujitsu SPARC . . . >Or how about the Cypress full-custom SPARC? Both of these were designed >and taped-out in well under 18 months. > . . . R2000 . . . in 9 months. I assume you are only stating the design cycles for these RISC processors, and not disputing his guess about the 486. Even if it is more like three years, that's still a big win in design cycle time for RISCs, and what about man years invested in these processors. These wins in design cycle are probably more significant than the performance issue (not insignificant itself). Gerry Gleason
davidsen@crdos1.crd.ge.COM (Wm E Davidsen Jr) (11/10/89)
In article <31027@obiwan.mips.COM>, mark@mips.COM (Mark G. Johnson) writes: | Just to name a couple from our esteemed colleagues in the SPARC camp, | how about the original Fujitsu SPARC (the one in the 4/260)? Its | design schedule & history was published in _High_Performance_Systems. | Or how about the Cypress full-custom SPARC? Both of these were designed | and taped-out in well under 18 months. I think we're talking different things as design time here, I was talking about the time from "let's build a CPU" to a working part. Taking a part with known word size, register layout and instruction set is a subset of that. A SPARC port looks to me like "how do we do it" with out the "what do we do" phase. Since you mentioned the R2000, can you determine the elapsed time for the whole process? -- bill davidsen (davidsen@crdos1.crd.GE.COM -or- uunet!crdgw1!crdos1!davidsen) "The world is filled with fools. They blindly follow their so-called 'reason' in the face of the church and common sense. Any fool can see that the world is flat!" - anon
baum@Apple.COM (Allen J. Baum) (11/10/89)
[] >In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >ARGUMENT 1: RISC is better because it's smaller, for new technologies >When die size is a limit, RISC is better, because you can do it, Agreed. However, current trends appear to indicate that new technologies mature reasonably quickly. This argument works for just a few (<10?) years. >ARGUMENT 2: RISC is better for cost reasons, because it's smaller When you get to 1M transistors on a chip, the space for some extra decoding logic is neglible. This argument works only during the initial (e.g. 'new technology' phase. >ARGUMENT 3: RISC is better, because even if there is enough space on a die > to put a whole CPU plus other things, the RISC can afford more space > for caches and other good things, and so it will be faster. See argument above. What is the performance difference between a 32k cache and a 31k cache? >ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster > time to design and test the chips. > COMMENT: maybe, maybe not. Agreed. Notice as we get more and more transistors, we start to do more complex (superscalar, hi-perf FP, graphics) things, not just add more cache. Superscalar may double your performance. It would literally impossible to add enough cache to do that, so simple, regular hardware is probably not where the transistors will go. (for disbelievers, if your cache miss penalty * miss ratio <1, then even if every access hits, you'll save less than a cycle, so CPI approaches 1. Superscalar does better.) >ARGUMENT 4: But when you can get zillions** of transistors on a die, it > it doesn't matter. > ... More transistors will help everything, however, it may be > that the "limiting factor" will be not die space, > but COMPLEXITY in critical paths and exception-handling. > >This last point is illustrated by the recent i486 bugs, and also, by the >errata list carried for a long time by the i386 Ah, yes. I think its safe to say that our simple RISCs are going to start to get fairly complex, as we start to play all the little hardware tricks we've known from the supercomputer world, and some that we'll invent. Note that a lot of errors come from exception handling kinds of problems, and while CISC machines have them, superscalar, out-of-order execution RISC machines will have them in spades. I'm beginning to believe Nick Treddinick when he says that RISCs aren't better; just newer. In a few years, I think we may find that CISCs will be back, to some extent. The difference is that these 'CISC's will be a bit more carefully tuned to allow the hardware techniques now being pioneered by RISCs to work on them. Certainly, the compiler technology will permit them to be used efficiently; as noted in an earlier posting this week, RISC vs. CISC compilers seem to just trade off the problems of instruction scheduling for instruction selection. Both are do-able. To give a flavor of what I mean, I'll summarize the RT/2 papers from IBM. Note that one paper referenced the other as having the title "An IBM post-RISC Processor Architecture", although that wasn't the title it was given. Superscalar w/ 1 fix, 1 float, 1 branch, & 1 cond. code op simultaneously Branch instructions: branch on any bit of CC reg each has bit that enables storing of PC+1 into -->dedicated<-- link reg. -->no<-- delayed branching taken conditional branch: 0 to 3 cycles (depends when CC is set) -->dedicated<-- counter reg, for decr&branch if 0 ops, which can be combined with test of any bit in CC reg. Cond Code. ops: Any boolean operation on any two of 32 bits in CC reg. Useful for generating compound Boolean expressions. Frequently used booleans can be kept in CC reg. Fixed ops: Multiply & divide included, with dedicated MQ reg. Support for min, max, & abs Support for arbitrarily aligned byte string compare & move, both length specified & null terminated. Hardware dedicated byte count & comparision register included in state. String instructions defined to permit max. theoretical bux bandwidth to be use, w/ very low overhead for short strings. Auto-incr & decr address modes Hardware handling for load/store of misaligned data (as long as its within a cache line). Optional fault if it crosses cache lines. Floating Point ops: Multiply & Add w/ only one round, takes same time as either add or mult. Reg. renaming Overall: all interrupts/traps are precise Icache: 8K byte, 64 byte line, 32 entry 2 way set assoc. TLB Dcache: 64K byte, 4 way set assoc, 128 byte line, 128 entry 2 way set assoc. TLB -->hardware<-- table walking Dcache has load & store buffers (store buffers so load can be performed before cache writeback, load buffer so loads can proceed during filling). Mem system has ECC & bit steering (allows spare bit to be substituted for failing bit). 4-bit DRAMs scattered across ECC groups so a chip failure is detectable. 4 deep 'pending store queue' permit address translation & checking even if data is still being calculated. Memory addressing: 52 bit virtual, 32 bit physical upper 4 bits of 32 bit address selects one of 16 24-bit seg. regs.(24+28=52) Seg. regs. have an I/O bit & lock enable bits Lock enable turns on the hardware lock & transaction ID hardware (801 & RT style.) Hardware can use low 20 bits of virtual address for translation lookup. Software must ensure that aliasing is avoided -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
marc@oahu.cs.ucla.edu (Marc Tremblay) (11/10/89)
In article <28985@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes: > (Some stuff on why an on-chip cache may not be a good idea) >Quick performance analysis hack: >on-chip $, assume 85% hit ratio, 1 cycle delay on hit, 14 cycle delay > on miss >off-chip $, assume 95% hit ratio, 2 cycle delay on hit, 14 cycle delay > on miss (a smart $ will not incur extra miss delay) > >on-chip memory speed: .85*1 + .15*14 = 2.95 cycles/reference >off-chip memory speed: .95*2 + .05*14 = 2.60 cycle/reference - a win! > >With the hit ratios I have assumed, the break-even point is a memory >delay of 10 cycles - below that, the on-chip cache becomes a win. Regarding data caches, most implementation with a 2-cycle load delay (on a hit) allow overlapping of instruction so that a decent compiler can schedule an instruction in the delay slot. If we assume that the load-delay can be filled let's say 50% of the time, we obtain a break-even point of around 6 for the miss delay, which makes the on-chip cache even more questionnable if *only* this factor is considered. To really evaluate the impact of an on-chip cache though we have to look at other factors such as: 1) With an on-chip cache it is a lot easier to implement a wide datapath (64 or 128 bits) between the cache and the register file than it is with an off-chip cache, which requires lots of pins and lots of routing. A wide datapath allows the use of instructions that take advantage of the extra bandwidth to i) save/restore the register file quicker (for example on calls/returns) and ii) load and store double precision operands in one cycle (two double precision operands can be loaded with 128 bits). 2) What applications is the processor-cache target for? For example if the chip is used mostly for applications showing lots of relatively small loops with heavy floating-point computations then an on-chip instruction cache makes a lot of sense since the hit ratio will be high. 3) Cost of flushing the cache on a context switch. Cost of maintaining cache coherency in a multiprocessor environment, etc... Marc Tremblay marc@CS.UCLA.EDU
mash@mips.COM (John Mashey) (11/10/89)
In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>ARGUMENT 1: RISC is better because it's smaller, for new technologies >>When die size is a limit, RISC is better, because you can do it, > Agreed. However, current trends appear to indicate that new technologies > mature reasonably quickly. This argument works for just a few (<10?) years. Agreed, although 10 years is almost an eternity, epsecially given the structure and trends in the computer business, i.e., it's getting harder and harder to introduce new architectures successfully.... >>ARGUMENT 2: RISC is better for cost reasons, because it's smaller > When you get to 1M transistors on a chip, the space for some extra decoding > logic is neglible. This argument works only during the initial (e.g. > 'new technology' phase. Sorry, I should have been more specific: I was assuming architectures upward-compatible with any of the current prevalent CISCS, i.e., that would typically use large microcode ROMS for some instructions, even if the most frequent ones were hardwired as in 486s, etc. >>ARGUMENT 3: RISC is better, because even if there is enough space on a die >> to put a whole CPU plus other things, the RISC can afford more space >> for caches and other good things, and so it will be faster. > See argument above. What is the performance difference between a 32k cache > and a 31k cache? Of course, if there is that little difference, it doesn't make very much difference, EXCEPT if to fit the size of chip you can actualy make, it makes the difference between 16K and 32K, which can happen quite easily. >>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster >> time to design and test the chips. >> COMMENT: maybe, maybe not. > Agreed. Notice as we get more and more transistors, we start to do more > complex (superscalar, hi-perf FP, graphics) things, not just add more > cache. Superscalar may double your performance. It would literally > impossible to add enough cache to do that, so simple, regular hardware is > probably not where the transistors will go. (for disbelievers, if your > cache miss penalty * miss ratio <1, then even if every access hits, you'll > save less than a cycle, so CPI approaches 1. Superscalar does better.) > >>ARGUMENT 4: But when you can get zillions** of transistors on a die, it >> it doesn't matter. >> ... More transistors will help everything, however, it may be >> that the "limiting factor" will be not die space, >> but COMPLEXITY in critical paths and exception-handling. >> >>This last point is illustrated by the recent i486 bugs, and also, by the >>errata list carried for a long time by the i386 > >Ah, yes. I think its safe to say that our simple RISCs are going to start to >get fairly complex, as we start to play all the little hardware tricks we've >known from the supercomputer world, and some that we'll invent. Note that a >lot of errors come from exception handling kinds of problems, and while CISC >machines have them, superscalar, out-of-order execution RISC machines will >have them in spades. Yes, but superscalar, out-of-order CISCs would have them in spades & hearts :-) The new IBM stuff looks interesting.... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
kolding@cs.washington.edu (Eric Koldinger) (11/10/89)
In article <28985@shemp.CS.UCLA.EDU> frazier@oahu.UUCP (Greg Frazier) writes: >Yeah, I've been wondering why people are bothering with on-chip >caches. Admittedly, a 2-chip CPU would probably be significantly >more expensive than a single-chip CPU, but I think the product >would be more flexible, and achieve better performance, if, instead >of putting the $ on the CPU chip, the $ was provided on a companion >chip. This should only increase the latency to the cache by a >single clock cycle, while significantly boosting the hit ratio. >Particularly if this were a data $ (leave the instruction $ on >chip - it belongs there). With a separate cache chip, one could [Greg goes on to explain his logic.] Why not provide both on-chip and off-chip cache ($). In the chip of the future, a 50-100Mhz part, keeping the processor fed from an off chip cache will be quite a bear. The off-chip communication speeds are probably going to be substantially slower than what you can achieve on-chip, so feeding an instruction per cycle into the chip might improve almost impossible, especially if you put two CPUs on the chip (as Greg proposes somewhere in the article). Instead, why don't you use a multi-level caching scheme. Put a "small" cache on-chip, keep it simple, possibly I-cache only, and have a larger backing cache off-chip. This would allow you to build the off-chip cache as large as you like, and still keep the single cycle cache access most of the time, with a low (2-4 cycle) delay on most cache misses, and a larger delay (multiple bus cycles) when the second level cache misses. You can also place cache-coherency mechanisms in the second level cache so that snooping will not cause the CPU to ever be locked out of the cache, except in the exceptional event of a cache-coherency action, in which case the second level cache can "interupt" the first level cache and invalidate any necesary entries. The second level cache could easily service two processors, allowing more versatility than the 2 CPU chip that Greg proposes. -- _ /| Eric Koldinger \`o_O' University of Washington ( ) "Gag Ack Barf" Department of Computer Science U kolding@cs.washington.edu
baum@Apple.COM (Allen J. Baum) (11/11/89)
[] >In article <31149@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > In article <blah blah> Allen Baum says: >>Ah, yes. I think its safe to say that our simple RISCs are going to start to >>get fairly complex, as we start to play all the little hardware tricks we've >>known from the supercomputer world, and some that we'll invent. Note that a >>lot of errors come from exception handling kinds of problems, and while CISC >>machines have them, superscalar, out-of-order execution RISC machines will >>have them in spades. >Yes, but superscalar, out-of-order CISCs would have them in spades&hearts :-) > >The new IBM stuff looks interesting.... I guess what I was trying to say is that I think its possible to have an architecute that we might today consider "CISC" (for any number of reasons: Reg-Mem instructions, 2 word instructions, fancier address modes, etc.) that would be architected (I hate verbing nouns) to permit superscalar and out-of- order execution, unlike current CISCs that gave no thought to the issues at all (but I love run-on sentences). This being the case, I believe that compiler technology can take advantage of CISCy features, and that we'll see a resurgence of CISCs. But, they won't be compatible with the ones we are familiar with. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
schow@bcarh61.bnr.ca (Stanley T.H. Chow) (11/11/89)
In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: > >The die area issue has been widely misinterpreted. Let me summarize some >of the various arguments for RISC vs CISC and die size: > [...] >ARGUMENT 4: But when you can get zillions** of transistors on a die, it > it doesn't matter. > COMMENT 1: We haven't gotten enough transistors to make anybody > happy yet ("happy" means VLSI designers wandering around > saying "I have so much space I don't know what to do with it") > and we're not likely to get enough any time real soon. > COMMENT 2: Of course, this all remains to be seen, but: > COMMENT 3: More transistors will help everything, however, it may be > that the "limiting factor" will be not die space, > but COMPLEXITY in critical paths and exception-handling. ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Please note that "complexity" is essentially independent of the RSIC/CSIC debate. I defy anyone to call the M88k as simple :-) It is certainly true that i86 (or any other micro) has had its share of problems. However, I would score the problems to the "Venerable Architecture" instead. It is by no means clear that CISC architecture must always be nasty in this respect. The "Zillion transistor" question is essentially another design trade-off: is it better to use the transistors for bigger cache? more pipeline stages? fatter pipelines? more execution units? .... Of course, once the architecture is chosen, many of the trade-offs become obviouse. IMHO, CSIC leaves many of the choices open, allowing more paths to be determined later. [Structured Software people think delaying choices is a good stratagy. You may or may not agree. :-)]. > >Do weird and occasional effects matter? Maybe, maybe not. Observe that >a 6-month difference is a long time these days, and if it takes 6 months >more to get a design really right enough that reputable companies will >ship it, that's a noticable difference. > Certainly 6 months make a difference. It is also true that even 2 years make little difference. It all depends on the market and the perspective. How many RISC chips beat the 486 to market? What difference did it make? These days, software is KING. In the commercial market place, Compatibility is also very important. Even in the new designs that don't care about old software, I would think designers (at least the design managers) should look at the long term evolution of the architecture before settling on a particular CPU chip. The only occasions when 6 months *really* matter is a race to open up a new area. It's been a long time since those kinds of excitment in the CPU & memory market. Examples are the *first* one-chip CPU (the 4004/8008), the first DRAM, the first EEPROM, etc. When I can also buy the same functionality, 6 months is only cost-optimization. Stanley Chow BitNet: schow@BNR.CA BNR UUCP: ..!psuvax1!BNR.CA.bitnet!schow (613) 763-2831 ..!utgpu!bnr-vpa!bnr-rsc!schow%bcarh61 Me? Represent other people? Don't make them laugh so hard.
gerry@zds-ux.UUCP (Gerry Gleason) (11/11/89)
In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >[] >>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: >>ARGUMENT 4: RISC is better, because it's simpler, and hence there is faster >> time to design and test the chips. >> COMMENT: maybe, maybe not. > Agreed. Notice as we get more and more transistors, we start to do more > complex (superscalar, hi-perf FP, graphics) things, not just add more > cache. Superscalar may double your performance. It would literally > impossible to add enough cache to do that, so simple, regular hardware is > probably not where the transistors will go. (for disbelievers, if your > cache miss penalty * miss ratio <1, then even if every access hits, you'll > save less than a cycle, so CPI approaches 1. Superscalar does better.) I think part of the problem is that not everyone means the same thing by RISC. As I said in an earlier posting, it's not RISC vs CISC, but simply apply the KISS principle to every aspect of design, or maybe more to the point, before you put in a feature you had better be sure it is a big win. Small wins are not worth the headaches that the complexity brings. In fact, if you can throw out a large chunk of complexity (from the instruction set or addressing modes for example), you will have capacity for more hot-rod features (i.e. pipelines, and some of the things you mentioned). Another aspect of RISC is that it is a move to more modular, less ad-hoc architectures. The architects that are pressing the RISC debate are applying the same ideas that help software engineers deal with million line programs to help hardware engineers deal with million transistor chips. The issues are the same. Gerry Gleason My cohorts here at Zenith think I need a disclaimer to say this is my opinion and not Zenith's. IMHO, anyone who assumes a posting to the net represents the organization they work for is nuts. I know it happens, but how often is the posters opinion even represented. So I will just waist net bandwith this once with a disclaimer: "No complany will let me speak for them anyway, and I reserve the right to change my opinion without notice"
mash@mips.COM (John Mashey) (11/11/89)
In article <9769@june.cs.washington.edu> kolding@june.cs.washington.edu.cs.washington.edu (Eric Koldinger) writes: > Why not provide both on-chip and off-chip cache ($). In the chip of the > future, a 50-100Mhz part, keeping the processor fed from an off chip cache > will be quite a bear. The off-chip communication speeds are probably going > to be substantially slower than what you can achieve on-chip, so feeding an > instruction per cycle into the chip might improve almost impossible, .... Some relevant recent papers include: Wen-Hann Wang, Jean-Loup Baer, Henry M. Levy, "Organization and Performance of a Two-Level Virtual-Real Cache Hierarchy", 16th Ann. Int. Symposium on Computer Architecture, May-June 1989, Jerusalem, Israel. ACM SIGARCH 17, 3 (June 1989), 140-148. [Univ. of Washington people, using virtual first-level and real second-level caches: get fast cycle from first level, high hit-rate from second. Grossly similar scheme to MIPS MC6280, for same reasons. Steven Pryzbylski, Mark Horowitz, John Hennessy, "Characteristics of Performance-Optimal Multi-Level Cache Hierarachies", (same as above), 114-121. A good quote from the abstract: "The increasing speed of new generation processors will exacerbate the already large difference between CPU cycle times and main memory access times. As this difference grows, itwill be increasingly difficult to build single-level caches that are both fast enough to match these fast cycle times and large enough to effectively hide the main memory access times.... This change in relative importance of cycle time and miss rate makes associativity more attractive and increases the optimal cache size for second-level caches over what they would be for an equivalent single-level cache system." Note, of course, that many 68020 systems used external caches along with the internal ones, and various sperminis and mainframes have used such things for some time. -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: {ames,decwrl,prls,pyramid}!mips!mash OR mash@mips.com DDD: 408-991-0253 or 408-720-1700, x253 USPS: MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (11/12/89)
1) With an on-chip cache it is a lot easier to implement a wide datapath (64 or 128 bits) between the cache and the register file than it is with an off-chip cache, which requires lots of pins and lots of routing. A wide datapath allows the use of instructions that take advantage of the extra bandwidth to i) save/restore the register file quicker (for example on calls/returns) and ii) load and store double precision operands in one cycle (two double precision operands can be loaded with 128 bits). I'm a wide datapath proponent myself, but some of my contacts have responded that so many signals simulataneously changing state at the same time => huge instantaneous power demand. How much of a problem is this *really*? Can techniques such as slightly delaying groups of lines help? Eg. provide the LSBs earlier so that caries can propagate while waiting for the MSBs to arrive on different lines? -- Andy "Krazy" Glew, Motorola MCD, aglew@urbana.mcd.mot.com 1101 E. University, Urbana, IL 61801, USA. {uunet!,}uiucuxc!udc!aglew My opinions are my own; I indicate my company only so that the reader may account for any possible bias I may have towards our products.
cs4g6ag@maccs.dcss.mcmaster.ca (Stephen M. Dunn) (11/13/89)
In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: $>In article <31031@winchester.mips.COM> mash@mips.COM (John Mashey) writes: $>ARGUMENT 2: RISC is better for cost reasons, because it's smaller $ When you get to 1M transistors on a chip, the space for some extra decoding $ logic is neglible. This argument works only during the initial (e.g. $ 'new technology' phase. $>ARGUMENT 3: RISC is better, because even if there is enough space on a die $> to put a whole CPU plus other things, the RISC can afford more space $> for caches and other good things, and so it will be faster. $ See argument above. What is the performance difference between a 32k cache $ and a 31k cache? I have beside me an article stating that on the Berkeley RISC I and II chips, between 6 and 10 percent of the chip area was used for decoding and control logic. Compare this with a figure of 50-60% on a 68000 or Z8000 ... now, I must admit that I've never designed an on-chip cache for a RISC chip, but somehow I think it would be quite easy to get more than a 1k cache out of 40-54% of the chip size. After all, if each 1k of cache size takes even 40% of the chip's area, how would you get a 32k (or, for that matter, 31k) cache on a chip in the first place? -- Stephen M. Dunn cs4g6ag@maccs.dcss.mcmaster.ca <std_disclaimer.h> = "\nI'm only an undergraduate!!!\n"; **************************************************************************** They say the best in life is free // but if you don't pay then you don't eat
baum@Apple.COM (Allen J. Baum) (11/14/89)
[] >In article <255E15BB.12770@maccs.dcss.mcmaster.ca> cs4g6ag@maccs.dcss.mcmaster.ca (Stephen M. Dunn) writes: >In article <36340@apple.Apple.COM> baum@apple.UUCP (Allen Baum) writes: >$ What is the performance difference between a 32k cache and a 31k cache? > >I have beside me an article stating that on the Berkeley RISC I and II chips, >between 6 and 10 percent of the chip area was used for decoding and control >logic. Compare this with a figure of 50-60% on a 68000 or Z8000 ... now, I >must admit that I've never designed an on-chip cache for a RISC chip, but >somehow I think it would be quite easy to get more than a 1k cache out of >40-54% of the chip size. After all, if each 1k of cache size takes even 40% >of the chip's area, how would you get a 32k (or, for that matter, 31k) cache >on a chip in the first place? It is not clear that the percentage of transistors used by control logic is constant as the total number of transistors increases. That is, 50-60% of the 68,000 transistors in a 68000 may have been control, but its a lot less (percentage-wise) of the 250k (?) transistors of a 68030. By the time you start throwing large caches on board, the control percentage may be neglible, even for a CISC. It was my contention that when you reach this point, the cost of CISC control hardware is neglible, and the supposed superiority of RISC (in this "area" {heh, heh} only!! :-) is in serious question. -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
toms@omews44.intel.com (Tom Shott) (11/16/89)
To throw some more gas on this fire. As we top 50 MHz for chip speed the biggest problem becomes getting data on and off chip. You start needing delay cycles between reads and writes on a IO pin to turn the bus around or a chip w/ lots of pins to get data in and out faster. Putting a cache on the die has been beat into the ground. One solution that has not been discussed is flip chip technology. In flip chip technology many die are mounted directly on a ceramic carrier. (This is used in IBM mainframes). The result is lower interconnect capacitance (smaller feature size for lines, no pins) and the ability to match an exact SRAM to an exact CPU. IOs (not pins) are cheap w/ flip chip technology. You get higher bandwidth to the cache (wider and faster lines) and the ability to optomize your process for digital on one die and RAM cells on the other. Other ways to deal w/ the interconnect speed problem are architectural. Delayed loads have been spoken about. It just takes more smarts in the compiler to use those slots. A longer pipeline w/ the load at the start will also hide off chip delays. A novel architecture from the Computer Systems Group at UIUC published by Dave Archer, et el used multiple task running on one CPU to hide delays. For example w/ a 4 stage pipeline, the CPU chip would run four tasks at once. I don't remember the details but it worked out that each task executed at 1/4 of full speed. (I think dummy pipeline stages were used between the stages). But during that delay time memory fetch latency was hidden. (Also data dependices). Realistically I might expect this technique only to be used for large systems aimed at multiuser applications. You need four tasks always ready to run. -- ----------------------------------------------------------------------------- Tom Shott INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520 toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com INTeL.. Designers of the 960 Superscalar uP and other uP's
lee@iris.davis.EDU (Peng Lee) (11/17/89)
In article <TOMS.89Nov15114003@omews44.intel.com>, toms@omews44.intel.com (Tom Shott) writes: > > A novel architecture from the Computer Systems Group at UIUC published by > Dave Archer, et el used multiple task running on one CPU to hide delays. > For example w/ a 4 stage pipeline, the CPU chip would run four tasks at > once. I don't remember the details but it worked out that each task > executed at 1/4 of full speed. (I think dummy pipeline stages were used > between the stages). But during that delay time memory fetch latency was > hidden. (Also data dependices). Realistically I might expect this technique > only to be used for large systems aimed at multiuser applications. You need > four tasks always ready to run. > I have been looking at the implementation of this kind of design. By implement a task controller unit that would swap to a new register set when there is a data dependency or branch or a load (delay), you don't need 4 tasks fully utilize the excution pipe. At most two tasks should be enough to fill the pipe. And if standard RISC optimizing technique (delay branch, etc) is applied to these tasks, the single task exexution time shouldn't be slower then the regular RISC when there is only one task. This design have a advantage guarantee full utilization of execution pipling when there is two tasks. One very interesting aspect I am currently looking at is possibility of implementing at semaphore in these register sets. > > -- > ----------------------------------------------------------------------------- > Tom Shott INTeL, 2111 NE 25th Ave., Hillsboro, OR 97123, (503) 696-4520 > toms@omews44.intel.com OR toms%omews44.intel.com@csnet.relay.com > INTeL.. Designers of the 960 Superscalar uP and other uP's -Peng (lee@iris.ucdavis.edu)
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (11/17/89)
In article <15126@haddock.ima.isc.com>, news@haddock.ima.isc.com (overhead) writes: > In article <503@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes: > >It also seems to me that the current practice of throwing away computers > >every 3 years or so cannot continue forever; it is evidence of an > >immature technology that experiences quantum performance leaps every 3 > >years (that's what makes it fun :^). Eventually, performance improvements > >may taper off a bit, so the advantage of being able to design a chip in > >6 months instead of 18 will not be as great as it currently is. > > I used to think these things. I even worked for a company that > had a 25 year old CDC 6600 in production. The problem is this: > the maintenance on the old machines is higher than the cost of > new machines. It is still true, only less so. Why keep the > current machine on contract when you could just let the machine > break in six months (with some risk), and use the contract money > to buy a newer (and maybe) faster machine? > I *know* that it is cheaper to throw away old technology and bring in the latest and greatest. I have a three year old VAX that I would *love* to get rid of, if I could find anyone foolish enough to buy it. And you're right, the maintenance costs are a major portion of the justification. So is increased productivity and functionality on the part of the users. > We keep hearing how one physical limit or another will soon be > reached. I make this prediction: computers will continue getting > faster by the same expontial growth for the next ten years. Some > of the speed will come from raw hardware. Much of it will come > from more parallelism. RAM density will continue to increase. > Disk speeds and densities will get better. Backup media will get My point exactly, exponential growth for another 10 years. Maybe 15. At some point, *all* curves flatten. Besides the economic reasons for getting rid of old technology, there are economic penalties for constantly throwing away working systems. Aside from the obvious hardware waste (I've heard of working mainframes dismantled with hammers and thrown in garbage trucks) there are very substantial software porting costs. My point is that there will come a time when economics will dictate that machines not be thrown out every 3 years. Whether this will be in 5, 10 or 20 years I'm not sure. The technology curve will then remain flat until a quantum break- through takes place (analagous to the vacuum tube -> semiconductor revolution). I'm placing my bets on optical gates. When that breakthrough takes place, the whole thing will take off again and the added power will allow previously undreamable applications to be written. -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Systems or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
ingoldsb@ctycal.UUCP (Terry Ingoldsby) (11/17/89)
In article <2@zds-ux.UUCP>, gerry@zds-ux.UUCP (Gerry Gleason) writes: > But really, I'm beginning to think that the simplicity and speed of design > and testing is the really big win with RISC. Also, product quality goes up > because there are fewer ways for things to go wrong, fewer tests to write and > run, and fewer places to make errors in design, layout, etc. Note that Intel Exactly. As long as getting to market 6 months faster than the other guy is the critical factor in marketing a design, then RISC will win. Once the curve flattens, then timing will be far less important. Consider the early automobile industry. In the beginning radical changes were made to engines, suspensions drivetrains every few years. Compare a 1960 car to a 1980 car an the improve- ments have become much more evolutionary than revolutionary. Does a car manufacturer really have a significant advantage over his competitor because he introduces a new feature a year before the competition? Probably not. Why? Because the performance gain is a few percent, not orders of magnitude. Anyway, I'm drifting further from comp.arch with each posting. Also, I'm not sure I believe my own arguments, if for no other reason then I can't picture a stagnated computer industry. -- Terry Ingoldsby ctycal!ingoldsb@calgary.UUCP Land Information Systems or The City of Calgary ...{alberta,ubc-cs,utai}!calgary!ctycal!ingoldsb
baum@Apple.COM (Allen J. Baum) (11/17/89)
[] >In article <5952@ucdavis.ucdavis.edu> lee@iris.davis.EDU (Peng Lee) writes: >In article <TOMS.89Nov15114003@omews44.intel.com>, >toms@omews44.intel.com (Tom Shott) writes: >> >> A novel architecture from the Computer Systems Group at UIUC published by >> Dave Archer, et el used multiple task running on one CPU to hide delays. >> For example w/ a 4 stage pipeline, the CPU chip would run four tasks at >> once. >One very interesting >aspect I am currently looking at is possibility of implementing at >semaphore in these register sets. Not so novel; this scheme was used in the Denelcor HEP, and by Stellar in their new machine. The Stellar had some kind of semphore feature so that the parallel running tasks could communicate with each other. The HEP had a semophore bit/memory location! -- baum@apple.com (408)974-3385 {decwrl,hplabs}!amdahl!apple!baum
mac@ra.cs.Virginia.EDU (Alex Colvin) (11/17/89)
In article <36564@apple.Apple.COM>, baum@Apple.COM (Allen J. Baum) writes: > Not so novel; this scheme was used in the Denelcor HEP, and by Stellar in > their new machine. The Stellar had some kind of semphore feature so that the > parallel running tasks could communicate with each other. The HEP had a > semophore bit/memory location! wasn't this called a barrel processor? I believe this was discussed here a year or two ago.
casey@gauss.llnl.gov (Casey Leedom) (11/21/89)
| From: ingoldsb@ctycal.UUCP (Terry Ingoldsby) | | Aside from the obvious hardware waste (I've heard of working mainframes | dismantled with hammers and thrown in garbage trucks) there are very | substantial software porting costs. Nothing against the rest of your argument, but I think you over- estimate the cost of porting software in the contemporary world. Companies are tending to write less and less of their software in a machine dependent manner (excepting companies intent on destroying themselves.) There is machine dependent code being written, but it's being isolated from the main line code in well defined abstractions. Casey
rcd@ico.isc.com (Dick Dunn) (11/21/89)
ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes: > gerry@zds-ux.UUCP (Gerry Gleason) writes: > > But really, I'm beginning to think that the simplicity and speed of design > > and testing is the really big win with RISC... [other advantages to simplicity...] > Exactly. As long as getting to market 6 months faster than the other guy is > the critical factor in marketing a design, then RISC will win. Once the curve > flattens, then timing will be far less important... Depends on (a) how and (b) whether the curve flattens. Even if we get off the current roughly exponential curve (e.g., the MIPS "double per year" goal) and drop way back to a linear growth, will that really let CISC come close enough to say it's "caught up"? Another way to look at it is to ask whether, once RISC has held the lead for a while, there's any reason to go back to CISC? What does CISC have to offer, if performance somehow manages to become a secondary consideration (which I have a hard time imagining)? Will it become, as one might extrapolate from Terry's analogy to cars, a competition over tailfins and chrome? (This is less sarcastic and more serious than it might seem at first.) -- Dick Dunn rcd@ico.isc.com uucp: {ncar,nbires}!ico!rcd (303)449-2870 ...`Just say no' to mindless dogma.
peter@ficc.uu.net (Peter da Silva) (11/22/89)
In article <508@ctycal.UUCP> ingoldsb@ctycal.UUCP (Terry Ingoldsby) writes: > Does a car > manufacturer really have a significant advantage over his competitor because > he introduces a new feature a year before the competition? Depends on the feature and how well they can sell it. More efficient engines, no. More powerful engines, no. Fuel injection, turbochargers, etc..? No. But the early anti-lock braking systems did rather well, as did Oldsmobile's marketing of what's really a fairly ordinary engine: the Quad-4. And then there's the Ford Taurus effect. -- `-_-' Peter da Silva <peter@ficc.uu.net> <peter@sugar.lonestar.org>. 'U` -------------- +1 713 274 5180. "I agree 0bNNNNN would have been nice, however, and I sure wish X3J11 had taken time off from rabbinical hairsplitting to add it." -- Tom Neff <tneff@bfmny0>
lm@slovax.Eng.Sun.COM (Larry McVoy) (04/08/91)
In article <1991Apr7.064855.25469@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >Or for a *really* straight comparison, due to John Mashey I think, compare >the i860 to the i486: same tools, same process, same chip size, roughly >the same release time... and the RISC machine is faster, much faster, in >every way. I respect Henry, despite his annoying signatures, but I can't agree with this comparison. The i860 is, to the best of my knowledge, a clean start. The i486 is carrying baggage from all the way back to 8080's. (I personally think the i486 is a cool chip, if you look closely, it is quite RISC like in the most common instruction uses. First the 386, then the 486. Hmm. If Intel keeps this up, they might make a decent CPU one day :-) Anyway, it is not a fair comparison. Not by a long stretch. Let's see how the Nth generation SPARC, MIPS, and 88K's do (assuming they last) compared to some new design from scratch. --- Larry McVoy, Sun Microsystems (415) 336-7627 ...!sun!lm or lm@sun.com
mash@mips.com (John Mashey) (04/18/91)
WARNING: you may want to print this one to read it... In article <537@appserv.Eng.Sun.COM> lm@slovax.Eng.Sun.COM (Larry McVoy) writes: >In article <1991Apr7.064855.25469@zoo.toronto.edu> henry@zoo.toronto.edu (Henry Spencer) writes: >>Or for a *really* straight comparison, due to John Mashey I think, compare >>the i860 to the i486: same tools, same process, same chip size, roughly >>the same release time... and the RISC machine is faster, much faster, in >>every way. > >I respect Henry, despite his annoying signatures, but I can't agree >with this comparison. The i860 is, to the best of my knowledge, a >clean start. The i486 is carrying baggage from all the way back to >8080's. (I personally think the i486 is a cool chip, if you look >closely, it is quite RISC like in the most common instruction uses. >First the 386, then the 486. Hmm. If Intel keeps this up, they might >make a decent CPU one day :-) > >Anyway, it is not a fair comparison. Not by a long stretch. Let's see >how the Nth generation SPARC, MIPS, and 88K's do (assuming they last) >compared to some new design from scratch. Well, there is baggage and there is BAGGAGE. One must be careful to distinguish between ARCHITECTURE and IMPLEMENTATION: a) Architectures persist longer than implementations, especially user-level Instruction-Set Architecture. b) The first member of an architecture family is usually designed with the current implementation constraints in mind, and if you're lucky, software people had some input. c) If you're really lucky, you anticipate 5-10 years of technology trends, and that modifies your idea of the ISA you commit to. d) It's pretty hard to delete anything from an ISA, except where: 1) You can find that NO ONE uses a feature (the 68020->68030 deletions mentioned by someone else). 2) You believe that you can trap and emulate the feature "fast enough". i.e., microVAX support for decimal ops, 68040 support for transcendentals. Now, one might claim that the i486 and 68040 are RISC implementations of CISC architectures .... and I think there is some truth to this, but I also think that it can confuse things badly: Anyone who has studied the history of computer design knows that high-performance designs have used many of the same techniques for years, for all of the natural reasons, that is: a) they use as much pipelining as they can, in soem cases, if this means a high gate-count, then so be it. b) They use caches (separate I & D if convenient). c) They use hardware rather than micro-code for the simpler operations. (For instance, look at the evolution of the S/360 products. Recall that the 360/85 used caches, back around 1969, and within a few years, so did any mainframe or supermini.) So, what difference is there among machines if similar implementation ideas are used? A: there is a very specific set of characteristics shared by most machines labeled RISCs, most of which are not shared by most CISCs. The RISC characteristics: a) Are aimed at more performance from current compiler technology (i.e., enough registers). OR b) Are aimed at fast pipelining in a virtual-memory environment with the ability to still survive exceptions without inextricably increasing the number of gate delays (notice that I say gate delays, NOT just how many gates). Even though various RISCs have made various decisions, most of them have been very careful to omit those things that CPU designers have found difficult and/or expensive to implement, and especially, things that are painful, for relatively little gain. I would claim, that even as RISCs evolve, they may have certain baggage that they'd wish weren't there .... but not very much. In particular, there are a bunch of objective characteristics shared by RISC ARCHITECTURES that clearly distinguish them from CISC architectures. I'll give a few examples, followed by the detailed analysis from an upcoming talk: MOST RISCs: 3a) Have 1 size of instruction in an instruction stream 3b) And that size is 4 bytes 3c) Have a handful (1-4) addressing modes) (* it is VERY hard to count these things; will discuss later). 3d) Have NO indirect addressing in any form (i.e., where you need one memory access to get the address of another operand in memory) 4a) Have NO operations that combine load/store with arithmetic, i.e., like add from memory, or add to memory. 4b) Have no more than 1 memory-addressed operand per instruction 5a) Do NOT support arbitrary alignment of data for loads/stores 5b) Use an MMU for a data address no more than once per instruction 6a) Have >=5 bits per integer register specifier 6b) Have >= 4 bits per FP register specifier These rules provide a rather distinct dividing line among architectures, and I think there are rather strong technical reasons for this, such that there is one more interesting attribute: almost every architecture whose first instance appeared on the market from 1986 onward obeys the rules above ..... Note that I didn't say anything about counting the number of instructions.... So, here's a table: C: number of years since first implementation sold in this family (or first thing of which this is binary compatible with) 3a: # instruction sizes 3b: maximum instruction size in bytes 3c: number of distinct addressing modes for accessing data (not jumps)> I didn't count register or literal, but only ones that referenced memory, and I counted different formats with different offset sizes separately. This was hard work... Also, even when a machine had different modes for register-relative and PC_relative addressing, I counted them only once. 3d: indirect addressing: 0: no, 1: yes 4a: load/store combined with arithmetic: 0: no, 1:yes 4b: maximum number of memory operands 5a: unaligned addressing of memory references allowed in load/store, without specific instructions 0: no never (MIPS, SPARC, etc) 1: sometimes (as in RS/6000) 2: just about any time 5b: maximum number of MMU uses for data operands in an instruction 6a: number of bits for integer register specifier 6b: number of bits for 64-bit or more FP register specifier, distinct from integer registers Note that all of this are ARCHIECTURE issues, and it is usually quite difficult to either delete a feature (3a-5b) or increase the number of real registers (6a-6b) given an initial isntruction set design. (yes, register renaming can help, but...) Now: items 3a, 3b, and 3c are an indication of the decode complexity 3d-5b hint at the ease or difficulty of pipelining, especially in the presence of virtual-memory requirements, and need to go fast while still taking exceptions sanely items 6a and 6b are more related to ability to take good advantage of current compilers. There are some other attributes that can be useful, but I couldn't imagine how to create metrics for them without being very subjective; for example "degree of sequential decode", "number of writebacks that you might want to do in the middle of an instruction, but can't, because you have to wait to make sure you see all of the instruction before committing any state, because the last part might cause a page fault," or "irregularity/assymetricness of register use", or "irregularity/complexity of instruction formats". I'd love to use those, but just don't know how to measure them. Also, I'd be happy to hear corrections for some of these. So, here's a table of 12 implementations of various architectures, one per architecture, with the attributes above. Just for fun, I'm going to leave the architectures coded at first, although I'll identify them later. I'm going to draw a line between H1 and L4 (obviously, the RISC-CISC Line), and also, at the head of each column, I'm going to put a rule, which, in that column, most of the RISCs obey. Any RISC that does not obey it is marked with a +; any CISC that DOES obey it is marked with a *. So... CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- A1 4 1 4 1 0 0 1 0 1 8 3+ 1 B1 5 1 4 1 0 0 1 0 1 5 4 - C1 2 1 4 2 0 0 1 0 1 5 4 - D1 2 1 4 3 0 0 1 0 1 5 0+ 1 E1 5 1 4 10+ 0 0 1 0 1 5 4 1 F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 G1 1 1 4 4 0 0 1 1 1 5 5 - H1 2 1 4 4 0 0 1 0 1 5 4 - RISC --------------------------------------------------------------- L4 26 4 8 2* 0* 1 2 2 4 4 2 2 CISC M2 12 12 12 15 0* 1 2 2 4 3 3 1 N1 10 21 21 23 1 1 2 2 4 3 3 - O3 11 11 22 44 1 1 2 2 8 4 3 - P3 13 56 56 22 1 1 6 2 24 4 0 - An interesting exercise is to analyze the ODD cases. First, observe that of 12 architectures, in only 2 cases does an architecture have an attribute that puts it on the wrong side of the line. of the RISCs: -A1 is slightly unusual in having more integer registers, and less FP than usual. -D1 is unusual in sharing integer and FP registers (that's what the D1:6b == 0). -E1 seems odd in having a large number of address modes. I think most of this is an artifact of the way that I counted, as this architecture really only has a fundamentally small number of ways to create addresses, but has several different-sized offsets and combinations, but all within 1 4-byte instruction; I believe that it's addressing mechanisms are fundamentally MUCH simpler than, for example, M2, or especially N1, O3, or P3, but the specific number doesn't capture it very well. -F1 .... is not sold any more. -H1 one might argue that this process has 2 sizes of instructions, but I'd observe that at any point in the instruction stream, the instructions are either 4-bytes long, or 8-bytes long, with the setting done by a mode bit, i.e., not dynamically encoded in every instruction. Of the processors called CISCs: -L4 happens to be one in which you can tell the length of the instruction from the first few bits, has a fairly regular instruction decode, has relatively few addressing modes, no indirect addressing. In fact, a big subset of its instructions are actually fairly RISC-like, although another subset is very CISCy. -M2 has a myriad of instruction formats, but fortunately avoided indirect addressing, and actually, MOST of instructions only have 1 address, except for a small set of string operations with 2. I.e., in this case, the decode complexity may be high, but most instructions cannot turn into multiple-memory-address-with-side-effects things. -N1,O3, and P3 are actually fairly clean, orthogonal architectures, in which most operations can consistently have operands in either memory or registers, and there are relatively few weirdnesses of special-cased uses of registers. Unfortunately, they also have indirect addressing, instruction formats whose very orthogonality almost guarantees sequential decoding, where it's hard to even know how long an instruction is unitl you parse each piece, and that may have side-effects where you'd like to do a register write-back early, but either: must wait until you see all of the instruction until you commit state or must have "undo" shadow-registers or must use instruction-continuation with fairly tricky exception handling to restore the state of the machine It is also interesting to note that the original member of the family to which O3 belongs was rather simpler in some of the critical areas, with only 5 instruction sizes, of maximum size 10 bytes, and no indirect addressing, and requiring alignment (i.e., it was a much more RISC-like design, and it would be a fascinating speculation to know if that extra complexity was useful in practice). Now, here's the table again, with the labels: CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- A1 4 1 4 1 0 0 1 0 1 8 3+ 1 AMD 29K B1 5 1 4 1 0 0 1 0 1 5 4 - R2000 C1 2 1 4 2 0 0 1 0 1 5 4 - SPARC D1 2 1 4 3 0 0 1 0 1 5 0+ 1 MC88000 E1 5 1 4 10+ 0 0 1 0 1 5 4 1 HP PA F1 5 2+ 4 1 0 0 1 0 1 4+ 3+ 3 IBM RT/PC G1 1 1 4 4 0 0 1 1 1 5 5 - IBM RS/6000 H1 2 1 4 4 0 0 1 0 1 5 4 - Intel i860 --------------------------------------------------------------- L4 26 4 8 2* 0* 1 2 2 4 4 2 2 IBM 3090 M2 12 12 12 15 0* 1 2 2 4 3 3 1 Intel i486 N1 10 21 21 23 1 1 2 2 4 3 3 - NSC 32016 O3 11 11 22 44 1 1 2 2 8 4 3 - MC 68040 P3 13 56 56 22 1 1 6 2 24 4 0 - VAX General comment: this may sound weird, but in the long term, it might be easier to deal with a really complicated bunch of instruction formats, than with a complex set of addressing modes, because at least the former is more amenable to pre-decoding into a cache of decoded instructions that can be pipelined reasonably, whereas the pipeline on the latter can get very tricky (examples to follow). This can lead to the funny effect that a relatively "clean", orthogonal archiecture may actually be harder to make run fast than one that is less clean. Obviously, every weirdness has it's penalties.... But consider the fundamental difficulty of pipelining something like (on a VAX): ADDL @(R1)+,@(R1)+,@(R2)+ (I.e., something that, might theoretically arise from: int **r1, **r2; **r2++ = **r1++ + **r1++; and which a RISC machine would do (most straightforwardly) as: lw r3,0(r1) *r1 add r1,4 r1++ lw r4,0(r1) *r1 again add r1,4 r1++ lw r5,0(r2) r5 = *r2 add r6,r3,r4 sum in r6 sw r6,0(r5) **r2 = sum add r5,4 r2++ (Now, some RISCs might use auto-increment to get rid of, for example, the last add; in any case, samrt compilers are quite likely to generate something more like: lw r3,0(r1) *r1 lw r4,4(r1) *r1 again add r1,8 r1++ lw r5,0(r2) r5 = *r2 add r6,r3,r4 sum in r6 sw r6,0(r5) **r2 = sum add r5,4 r2++ which has no stalls anywhere on most RISCs.) Now, consider what the VAX has to do: 1) Decode the opcode (ADD) 2) Fetch first operand specifier from I-stream and work on it. a) Compute the memory address from (r1) If aligned run through MMU if MMU miss, fixup access cache if cache miss, do write-back/refill Elseif unaligned run through MMU for first part of data if MMU miss, fixup access cache for that part of data if cache miss, do write-back/refill run through MMU for second part of data if MMU miss, fixup access cache for second part of data if cache miss, do write-back/refill Now, in either case, we now have a longword that has the address of the actual data. b) Increment r1 [well, this is where you'd LIKE to do it, or in parallel with step 2a).] However, see later why not... c) Now, fetch the actual data from memory, using the address just obtained, doing everything in step 2a) again, yielding the actual data, which we needto stick in a temporary buffer, since it doesn't actually go in a register. 3) Now, decode the second operand specifier, which goes thru everything that we did in step 2, only again, and leaves the results in a second temporary buffer. Note that we'd like to be starting this before we get done with all of 2 (and I THINK the VAX9000 probably does that??) but you have to be careful to bypass/interlock on potential side-effects to registers .... actually, you may well have to keep shadow copies of every register that might get written in the instruction, since every operand can use auto-increment/decrement. You'd probably want badly to try to compute the address of the second argument and do the MMU access interleaved with the memory access of the first, although the ability of any operand to need 2-4 MMU accesses probably makes this tricky. [Recall that any MMU access may well cause a page fault....] 4) Now, do the add. [could cause exception] 5) Now, do the third specifier .... only, it might be a little different, depending on the nature of the cache, that is, you cannot modify cache or memory, unless you know it will complete. (Why? well, suppose that the location you are storing into overlaps with one of the indirect-addressing words pointed to by r1 or 4(r1), and suppose that the store was unaligned, and suppose that the last byte of the store crossed a page boundary and caused a page fault, and that you'd already written the first 3 bytes. If you did this straightforwardly, and then tried to restart the instruction, it wouldn't do the same thing the second time. 6) When you're sure all is well, and the store is on its way, then you can safely update the two registers, but you'd better wait until the end, or else, keep copies of any modified registers until you're sure it's safe. (I think both have been done ??) 7) You may say that this code is unlikely, but it is legal, so the CPU must do it. This style has the following effects: a) You have to worry about unlikely cases. b) You'd like to do the work, with predictable uses of functional units, but instead, they can make unpredictable demands. c) You'd like to minimize the amount of buffering and state, but it costs you in both to go fast. d) Simple pipelining is very, very tough: for example, it is pretty hard to do much about the next instruction following the ADDL, (except some early decode, perhaps), without a lot of gates for special-casing. (I've always been amazed that CVAX chips are fast as they are, and VAX 9000s are REALLY impressive...) e) EVERY memory operand can potentially cause 4 MMU uses, and hence 4 MMU faults that might actually be page faults... 8) Consider how "lazy" RISC designers can be, with the RISC sequence shown: a) Every load/store uses exactly 1 MMU access. b) The compilers are often free to re-arrange the order, even across what would have been the next instruction on a CISC. This gets rid of some stalls that the CISC may be stuck with (especially memory accesses). c) The alignment requirement avoids especially the problem with sending the first part of a store on the way before you're SURE that the second part of it is safe to do. Finally, to be fair, let me add the two cases that I knew of that were more on the borderline: i960 and Clipper: CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- J1 5 4+ 8+ 9+ 0 0 1 0 2 4+ 3+ 5 Clipper K1 3 2+ 8+ 9+ 0 0 1 2+ - 5 3+ 5 Intel 960KB SUMMARY: 1) RISCs share certain architectural characteristics, although there are differences, and some of those differences matter a lot. 2) However, the RISCs, as a group, are much more alike than the CISCs as a group. 3) At least some of these architectural characteristics have fairly serious consequences on the pipelinability of the ISA, especially in a virtual-memory, cached environment. 4) Counting instructions turns out to be fairly irrelevant: a) It's HARD to actually count instructions in a meaningful way... (if you disagree, I'll claim that the VAX is RISCier than any RISC, at least for part of its instruction set :-) b) More instructions aren't what REALLY hurts you, anywhere near as much features that are hard to pipeline: c) RISCs can perfectly well have string-support, or decimal arithmetic support, or graphics transforms ... or lots of strange register-register transforms, and it won't cause problems ..... but compare that with the consequence of adding a single instruction that has 2-3 memory operands, each of which can go indirect, with auto-increments, and unaligned data... -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
chased@rbbb.Eng.Sun.COM (David Chase) (04/19/91)
Hey, you left out the Acorn Risc Machine. Not a big name in the workstation market, but I understand that they sell a good number of them, and the instruction set is "interesting" (a bit-twiddler/ superoptimizer's wet dream, if you ask me, but probably "risc" by many definitions of the term). VTI sells this chip in the US, they should be able to give you a spec if you want it. David Chase Sun
henry@zoo.toronto.edu (Henry Spencer) (04/19/91)
In article <2419@spim.mips.COM> mash@mips.com (John Mashey) writes: >-A1 [29k] is slightly unusual in having more integer registers, and less FP >than usual. ... >-D1 [88k] is unusual in sharing integer and FP registers This is slightly out of date. AMD appears to have (wisely) decided to ditch the peculiar 29027 FPU architecture. The 29050's on-chip floating point uses the integer register bank (disregarding one or two odd instructions), subject to a constraint that double-precision arithmetic use even-odd pairs. -- And the bean-counter replied, | Henry Spencer @ U of Toronto Zoology "beans are more important". | henry@zoo.toronto.edu utzoo!henry
mash@mips.com (John Mashey) (04/19/91)
In article <11810@exodus.Eng.Sun.COM> chased@rbbb.Eng.Sun.COM (David Chase) writes: >Hey, you left out the Acorn Risc Machine. Not a big name in the >workstation market, but I understand that they sell a good number of >them, and the instruction set is "interesting" (a bit-twiddler/ >superoptimizer's wet dream, if you ask me, but probably "risc" by many >definitions of the term). VTI sells this chip in the US, they should >be able to give you a spec if you want it. Sorry, it could have been included, but I just ran out of time and space, and I thought I had enough data to make the point, which was that there were noticable differences between RISC and CISC architectures, regardless of the implementation. The ARM would certainly get classified as a RISC, with 32-bit instructions, 1 size a handful of memory address modes no indirect addressing only loads/sotres access memory no more than 1 memory address/instruction alignment (I think) 1 use of memory control/TLB per instruction for data 4 bits available for integer register specifiers The manual I have shows it with no FP at all, but then it's 2 years' old. Although I didn't post them, the more complete tables that I was working from contained multiple implementations of some of the architecture families, from which one may find several trends: a) Only occasionally does anyone (RISC or CISC) subtract instructions. b) Both RISC and CISC often add instructions as time goes on. c) Sometimes CISCs got CISCier in their ISAs, i.e. ,by adding addressing modes the way the 68020 did, or by deleting alignment characteristics the way 360->370 did. d) Sometimes CISC implementations were done in more RISC-like fashion (i.e., trap certain opcodes and emulate)> e) I could find no architecture that clearly started as a RISC, and then seriously evolved into a CISC, or took on any of the attributes that I used to distinguish CISCs and RISCs. (So much for this idea that CRISP = Complex RIS Processor (not AT&T CRISP) is a merger of RISC and CISC, and that current RISC and CISC architectures are evolving towards each other. Nonsense.) -- -john mashey DISCLAIMER: <generic disclaimer, I speak for me only, etc> UUCP: mash@mips.com OR {ames,decwrl,prls,pyramid}!mips!mash DDD: 408-524-7015, 524-8253 or (main number) 408-720-1700 USPS: MIPS Computer Systems MS 1/05, 930 E. Arques, Sunnyvale, CA 94086
torbenm@diku.dk (Torben [gidius Mogensen) (04/19/91)
mash@mips.com (John Mashey) writes: >Age: number of years since first implementation sold in this family >(or first thing of which this is binary compatible with) >3a: # instruction sizes >3b: maximum instruction size in bytes >3c: number of distinct addressing modes for accessing data (not jumps)> >3d: indirect addressing: 0: no, 1: yes >4a: load/store combined with arithmetic: 0: no, 1:yes >4b: maximum number of memory operands >5a: unaligned addressing of memory references allowed in load/store, > without specific instructions > 0: no never (MIPS, SPARC, etc) > 1: sometimes (as in RS/6000) > 2: just about any time >5b: maximum number of MMU uses for data operands in an instruction >6a: number of bits for integer register specifier >6b: number of bits for 64-bit or more FP register specifier, > distinct from integer registers You might add ARM to the list CPU Age 3a 3b 3c 3d 4a 4b 5a 5b 6a 6b # ODD RULE <6 =1 =4 <5 =0 =0 =1 <2 =1 >4 >3 ------------------------------------------------------------------------- 6+ (7?) 1 4 13+ 0 0 1? 0 1? 4+ 0+ 4 ARM Notes: There are actually 4 bits that specify addressing mode, but four of the 16 modes have the same effect. This is due to a very orthogonal specification. The (?) in 4b and 5b are due to the load/store multiple register instructions, which use one memory access per register. There are no FP unit on the chip, thus no specific FP registers. There has been an announcement of an FPU to appear this year, but I don't know anything about it. I'm not certain about the age, but it was, I think, the first commercially available RISC processor. Torben Mogensen (torbenm@diku.dk)
torbenm@diku.dk (Torben [gidius Mogensen) (04/19/91)
I made an error in the number of addressing modes. The stated 13 modes (16 with 4 identical) should be 10, since in 6 (8) of the modes, one of the specifying bits has no effect in user mode. You could also say that there are 20 modes, as each mode can transfer either 8 or 32 bits. Torben Mogensen (torbenm@diku.dk)
richard@aiai.ed.ac.uk (Richard Tobin) (04/22/91)
In article <1991Apr19.133634.15241@odin.diku.dk> torbenm@diku.dk (Torben [gidius Mogensen) writes: >You might add ARM to the list >I'm not certain about the age, but it was, I think, the first >commercially available RISC processor. I'm not certain either (surely there's someone from Acorn out there?) but it was certainly being discussed at Acorn in the summer of 1983. -- Richard -- Richard Tobin, JANET: R.Tobin@uk.ac.ed AI Applications Institute, ARPA: R.Tobin%uk.ac.ed@nsfnet-relay.ac.uk Edinburgh University. UUCP: ...!ukc!ed.ac.uk!R.Tobin
johng@OCE.ORST.EDU (John A. Gregor) (06/11/91)
In article <1991Jun9.214548.23661@syacus.acus.oz.au> william@syacus.acus.oz.au (William Mason) writes: > > If RISC is *it* ... How come the guy with shor legs > can't winn the olympic foot races ??? I don't know, but he'll still beat the guy trying to run on all fours, wearing an army boot on one foot, a running shoe on the other, a boxing glove on his left hand, and a football helment. :-) Sorry world to contribute noise to the RISC vs CISC noisefest, but I couldn't resist. JohnG -- John A. Gregor College of Oceanography E-mail: johng@oce.orst.edu Oregon State University Voice #: +1 503 737-3022 Oceanography Admin Bldg. #104 Fax #: +1 503 737-2064 Corvallis, OR 97331-5503
wkk@cbnewsl.att.com (wesley.k.kaplow) (06/18/91)
Well, this was not just another code size measurement. I thought the original poster request static code information. Well, I wanted to stay away from any performance data but here goes anyway: Note: See the previous posting by me to see what the benchmarks are. DYNAMIC Instruction Count: Benchmark MIPS Instructions/CRISP Instructions ---------------------------------------------------- BSC 1.56 Dhrystone 1.24 RP 1.50 Static MIPS Instruction Distribution ('cat' + library) Instr/Opereration % of Distribution ------------------------------------------------- load/store 26% branch+Funcall 19% nop 13% arith 12% move reg/load immed 18% misc 12% I hope that gives you some more information. It was clear to us, and to MIPS, that you can sacrifice some characteristics and gain in cycles/instr efficiency. with disclamer; use disclamer; Wesley Kaplow AT&T Bell Labs Whippany, NJ 201-386-4634