doug@terak.UUCP (Doug Pardee) (05/28/85)
> If we created the chip > using only the instructions the compiler needed, we could use less logic. > We could decode the instructions faster because the microcode is simpler > (more ooomph per MHz), production is simpler, yield is higher, speeds are > faster, and everyone is happy except the assembler programmer. > ... > This is the concept behind the RISC architecture ... RISC is an interesting concept, but I have a major reservation. Perhaps someone can explain to me how on earth you're going to feed instructions to a RISC machine fast enough? All of the popular 8 and 16 bit microprocessors are speed limited by instruction fetch, not by instruction complexity. I will entertain the objection that the 6502, with its critical shortage of on-chip registers, is also limited by operand accesses. The usual RISC machine has lots of registers, so operand accesses shouldn't be a problem. And nobody says that a non-RISC cpu can't have lots of registers. The knee-jerk answer is "cache". But that's only an answer if one refuses to allow non-RISC cpus to use cache; they can fit more logic into any given cache than can RISC cpus, thereby having a better "hit ratio" than RISC. Of course, one could design a RISC machine with a super-high-speed ROM or cache in which one could store the commonly used functions like multiplication and division, and then one would only have to fetch a subroutine call from the (slow) instruction stream. But doesn't that sound like your everyday, garden variety microcoded non-RISC cpu? -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/30/85)
The knee jerk answer to the instruction fetch bandwidth problem, cache, is a valid answer. The argument that one can give as much cache to a complicated instruction set engine and therby get as much performance is not valid. The performance reduction for the complicated instruction set comes from the time spent running microcode decode and execute instructions. A good example of this is the VAX instruction set vs the Ridge instruction set. The Ridge achieves the same performance as the VAX as a much lower hardware cost. The performance increase arises in the simplicity of the instruction set. This difference also shows up when you emulate these architectures in software. The instruction decode for such emulators on the vax is a nigtmare and the designers of the VAX faced the same nightmare when the designed the hardware and wrote the microcode for it.
thoth@tellab3.UUCP (Marcus Hall) (05/31/85)
>RISC is an interesting concept, but I have a major reservation. Perhaps >someone can explain to me how on earth you're going to feed instructions >to a RISC machine fast enough? One way to help with this problem is to use fairly wide memory accesses (at least for instruction fetches). Thus, in one memory cycle, many instructions may be fetched simultaniously. Of course this is done for non-RISC machines as well, but a non-RISC machine will become execution-bound sooner. Also, since the RISC instruction set is simpler, the op-codes require fewer bits, so a memory fetch will get more RISC instructions in one cycle than it would non-RISC instructions. marcus hall ..!ihnp4!tellab1!tellab2!thoth
mat@amdahl.UUCP (Mike Taylor) (05/31/85)
> All of the popular 8 and 16 bit microprocessors are speed limited by > instruction fetch, not by instruction complexity. I will entertain the > objection that the 6502, with its critical shortage of on-chip > registers, is also limited by operand accesses. The usual RISC machine > has lots of registers, so operand accesses shouldn't be a problem. > And nobody says that a non-RISC cpu can't have lots of registers. > > The knee-jerk answer is "cache". But that's only an answer if one > refuses to allow non-RISC cpus to use cache; they can fit more logic > into any given cache than can RISC cpus, thereby having a better "hit > ratio" than RISC. > > Of course, one could design a RISC machine with a super-high-speed ROM > or cache in which one could store the commonly used functions like > multiplication and division, and then one would only have to fetch a > subroutine call from the (slow) instruction stream. > > But doesn't that sound like your everyday, garden variety microcoded > non-RISC cpu? > -- > Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug > ^^^^^--- soon to be CalComp Fetching one instruction per cycle is pretty much necessary to execute one instruction per cycle. Whether this is a bug or a feature is a good question. The difference between fetching a microword every cycle and fetching an instruction is that the microcoded machine has a fixed microsequence in ROM while the instructions come from RAM. Since they come from RAM, they can be optimized by the compiler for the particular job being done. Clearly you pay for this by diverting fetches from microstore to RAM. Some RISC implementations try to get this cost back by using the microstore's chip space for registers and/or instruction queues, reducing total fetch traffic. How can you tell a machine is not a RISC ? Proposal... If it is possible to replace some instructions with sequences of other instructions with only small performance penalties (or gains, a la VAX) then those instructions probably have no business in the instruction set. Their presence is probably slowing down the whole system. Take them out. Redesign to use the space you got back by removing them. Repeat the above cycle until no more instructions can be removed or the cycle doesn't improve performance. You now have a RISC. Arguments ? -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
joel@peora.UUCP (Joel Upchurch) (05/31/85)
I think your argument ignores the fact that with a given fabrication technology that there is only so much function you put on a given chip. If you chose to have a large control store ROM for a complex instruction set then you must make sacrifices in other parts of the chip. This may mean less registers, or less cache, or a slower ALU, or other trade- offs. RISC argues that this is not a good trade, that it is difficult to write compilers to use complicated instruction sets and that these high level operations are not necessarily any faster than a series of low level operations. Also with simple instruction sets you have the option of hardwireing it for additional performance.
doug@terak.UUCP (Doug Pardee) (06/03/85)
Responses to early comments on my question about how a RISC cpu can fetch instructions fast enough to keep up with non-RISC cpus: > The knee jerk answer to the instruction fetch bandwidth problem, cache, > is a valid answer. The argument that one can give as much cache to a > complicated instruction set engine and therby get as much performance is > not valid. The performance reduction for the complicated instruction set > comes from the time spent running microcode decode and execute instructions. I'd believe this except that it ain't so. The performance reduction on current non-RISC single-chip cpus comes from instruction fetch. For example, the NS32016 can do your basic RISC-type operations in 3 or 4 clock cycles, but it takes 5 clock cycles to fetch the next instruction. RISC-type instructions on the 32016 therefore take 5 clock cycles each. And the 32016 is "burdened" by non-RISC "microcode decode and execute instructions." > One way to help with this problem is to use fairly wide memory accesses > (at least for instruction fetches). Thus, in one memory cycle, many > instructions may be fetched simultaniously. Of course this is done for > non-RISC machines as well, but a non-RISC machine will become > execution-bound sooner. The NS32032 and MC68020 both fetch 32 bits of instruction data at one time. Both have "zillions" of pins on the package in order to pull that off; I can't imagine building a RISC chip with 128-bit wide instruction fetch, requiring over 250 pins on the package. And if we're not talking single-chip architectures, we're going to have a devil of a time making rational comparisons. After all, IBM has used a non-RISC architecture for some time and it goes pretty fast :-) I'll believe that a non-RISC machine will become execution-bound sooner, but this just another way of saying that the RISC machine is much more limited by instruction fetch than the non-RISC machine is. > Also, since the RISC instruction set is simpler, the op-codes require fewer > bits, so a memory fetch will get more RISC instructions in one cycle > than it would non-RISC instructions. I thought this too, until I looked into some RISC machines. They use 32-bit instruction words, twice as wide as the equivalent instructions in, say, the 680xx and 320xx cpus. Still looking for the answer... -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
steveg@hammer.UUCP (Steve Glaser) (06/05/85)
In article <1590@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes: >> All of the popular 8 and 16 bit microprocessors are speed limited by >> instruction fetch, not by instruction complexity. I will entertain the >> objection that the 6502, with its critical shortage of on-chip >> registers, is also limited by operand accesses. The usual RISC machine >> has lots of registers, so operand accesses shouldn't be a problem. >> And nobody says that a non-RISC cpu can't have lots of registers. >> >> The knee-jerk answer is "cache". But that's only an answer if one >> refuses to allow non-RISC cpus to use cache; they can fit more logic >> into any given cache than can RISC cpus, thereby having a better "hit >> ratio" than RISC. >> >> Of course, one could design a RISC machine with a super-high-speed ROM >> or cache in which one could store the commonly used functions like >> multiplication and division, and then one would only have to fetch a >> subroutine call from the (slow) instruction stream. >> The basic tenants of RISC are that you throw out the useless bagage that's in there "cause somebody wanted it and microcode was cheap" and get back to a reaconable minimal set (do chuck out an instruction; measure performance; if better or same, repeat). With the real estate gained, you then start putting reasonable sized caches and/or prefetch queues on chip instead of useless instructions. Some folks have decided that RISC means that you eliminate the microcode and directly decode the operations. This is too narrow of a view. It doesn't matter how you implement what's left, as long as you only implement what is really needed. As for caches, one issue that hasn't been mentioned is the use of block mode transfers for fast cache filling. There are ram chips out there that can access "nearby" ram cells significantly faster that random ones (so called page mode, nibble mode, or static column rams). These were mainly developed for the raster display markets (so you don't waste all your memory bandwith refreshing the screen and you get a chance to make updates to it), but are useful in other areas as well. Since caches blocks are typically a power of 2 in size and start on nice boundaries, it's pretty easy to map these into the notion of "nearby" supported by a ram chip. The only problem you have with this kind of architecture is (1) you need a new bus protocol, and (2) ECC gets harder cause you need to do it faster (anybody for a pipelined ECC chip that can keep up with 50nsec ram cycles?). Steve Glaser
henry@utzoo.UUCP (Henry Spencer) (06/06/85)
> I thought this too, until I looked into some RISC machines. They use > 32-bit instruction words, twice as wide as the equivalent instructions > in, say, the 680xx and 320xx cpus. Yes and no. The Berkeley RISC project adopted 32-bit instructions for simplicity in initial work, not because they thought it was right for the final design. If you look, you'll find at least one paper from them discussing a support chip which is (a) an instruction cache, and (b) an instruction-encoding expander. The latter function makes a large difference to instruction density without introducing any extra delays. Also, don't be too sure that the 68* and 32* chips use 16-bit instructions a lot. Remember that things like offsets take extra bytes, and those get used *a lot* on those machines -- virtually every memory reference needs one. On the pdp11, a 16-bit machine if there ever was one, the average instruction length was in fact about 32 bits. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
doug@terak.UUCP (Doug Pardee) (06/07/85)
> Some folks have decided that RISC means that you eliminate the > microcode and directly decode the operations. This is too narrow of a > view. An elephant is much like a tree, or is it a rope, or is it a snake... If we're going to talk about RISC, we'd better decide what we're going to call RISC. Is RISC: 1) a way to save silicon real estate on single-chip cpus? or 2) a concept that's equally applicable to board-level cpus? Is RISC: 1) strictly limited instruction set, essentially microcode level? or 2) any cpu without "useless" instructions? (define "useless") or 3) any cpu with oodles of registers? For the moment, I'm going to talk about non-microcoded single-chip cpus with strictly limited instruction sets. Something you'd put in a TTL system with dynamic RAMs. I've come to the conclusion that for this kind of RISC, its time has passed. It made a lot of sense in the late '70s when cpu cycles were 250-400 ns and memory cycles were 300-400 ns. But since then the cpu manufacturers have concentrated on speed while the memory manufacturers concentrated on capacity. With cpu cycles now at 60-100 ns and memory cycles at 225-325 ns, the way to improve speed in a system is to decrease memory cycles, not cpu cycles. That means making instructions even *more* complex, so that each instruction fetched does as much work as possible. At 100 ns per cpu cycle, you can afford to let the cpu do a lot of work that isn't always needed, because you can throw away *half* of it and still have the cpu be faster than if you fetched the microinstructions from main memory (using 120 ns access/220 ns cycle time 256K DRAMs). By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in 8.3 usec. The RISC-II cpu would have to be able to do the multiply in only 25 cpu cycles in order to compete. All the cache in the world ain't gonna help... -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
hal@cornell.UUCP (Hal Perkins) (06/09/85)
In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes: >By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The >10 MHz NS32016 can do a 32-bit register-to-register signed multiply in >8.3 usec. The RISC-II cpu would have to be able to do the multiply in >only 25 cpu cycles in order to compete. All the cache in the world >ain't gonna help... Now just a moment... The Berkeley chips were built by academic folks using conservative design rules, etc. If they had been built by experienced professional chip designers using state-of-the-art technology they would have been a lot faster. It's not fair to pick on the relatively slow clock cycle of the experimental chips -- they were built to prove a point (which they did), not to produce a product. (I'm not making this up -- if you read some of the RISC papers by Patterson and friends, et. al., they make the same point.) Hal Perkins UUCP: {decvax|vax135|...}!cornell!hal Cornell Computer Science ARPA: hal@cornell BITNET: hal@crnlcs
jans@mako.UUCP (Jan Steinman) (06/10/85)
In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions >a lot. Remember that things like offsets take extra bytes... In my experience 16 bit instructions get used quite often on the NS32000 in well written assembly code. The NS32000 has an edge over the MC68000 in code density in that instructions need not be word-aligned. Henry is correct in stating that offsets are used on virtually every memory reference, but those offsets are usually 8 bits. Following is some statistics for one module pulled at random from my present project. The code is highly optimized for speed, so code density is not even as good as it should be, i.e. "jump" is used instead of "br", saving one clock, but costing two bytes, etc. Size Count 8 0 16 14 24 13 32 1 40 0 48 2 Total 30 Ave 22.1 bits per instruction. The average instruction size of 22.1 bits is indeed more than the 16 bits the original poster assumed, but much less than the 32 bits Henry expected, due in large part to the large number of 24 bit instructions available on the NS32000. I believe the MC68000 would have a larger number of 32 bit instructions with the consequent increase in average bits-per-instruction. Of course, no analogy can be drawn for compiled code. Many compilers under- use register space and I suspect the average bits-per-instruction would be much higher for compiled code on both machines. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
doug@terak.UUCP (Doug Pardee) (06/11/85)
Once again, I find that I didn't do a good job of explaining myself. Let me try again... In the following comment, notice the word "equivalent": me> I thought this too, until I looked into some RISC machines. They use me> 32-bit instruction words, twice as wide as the equivalent instructions me> in, say, the 680xx and 320xx cpus. > Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > a lot. Remember that things like offsets take extra bytes, and those get > used *a lot* on those machines -- virtually every memory reference needs > one. My point was supposed to be that you could take a 680xx/320xx and limit yourself to the RISC instruction set (the "equivalent" instructions) and have 16-bit instructions instead of RISC's 32-bit instructions. Presumably, the reason that the longer instructions, with their offsets etc., are used so much is because those instructions are more effective than the RISC instructions for the task at hand (this sounds like another discussion that I'm involved in :-) > instruction-encoding expander. The latter function makes a large > difference to instruction density without introducing any extra delays. In a microcoded cpu, the opcode is decoded and causes a sequence of very simple micro-instructions to be executed. In the proposed system, the opcode is decoded and causes a sequence of very simple RISC instructions to be executed. What's the difference? Where does the much-vaunted performance gain come from? -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
john@frog.UUCP (John Woods) (06/12/85)
> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > >a lot. Remember that things like offsets take extra bytes... >In my experience 16 bit instructions get used quite often on the NS32000 in >well written assembly code. The NS32000 has an edge over the MC68000 in code >density in that instructions need not be word-aligned. Henry is correct in >stating that offsets are used on virtually every memory reference, but those >offsets are usually 8 bits. Following is some statistics for one module >pulled at random from my present project. The code is highly optimized for >speed, so code density is not even as good as it should be, i.e. "jump" is >used instead of "br", saving one clock, but costing two bytes, etc. > Size Count Size Count > 8 0 16 14 > 24 13 32 1 > 40 0 48 2 > Total 30 Ave 22.1 bits per instruction. >The average instruction size of 22.1 bits is indeed more than the 16 bits the >original poster assumed, but much less than the 32 bits Henry expected, due >in large part to the large number of 24 bit instructions available on the >NS32000. I believe the MC68000 would have a larger number of 32 bit >instructions with the consequent increase in average bits-per-instruction. > I counted words for some random compiled code here for the 68000 (the famous Knight's Tour, which I still have lying around). Size (bytes) 2 4 6 8 Sample 1: 23 4 3 0 Average 21.3 bits / 30 main() Sample 2: 18 7 4 0 Average 23.4 bits instr. try() The Greenhills compiler tries to make good use of registers. On the other hand, I don't know how much these instructions actually accomplished. I wouldn't be surprised to find that the 32?32 does more with those 22 bits per instruction, but I wouldn't be horribly shocked to find that the 68000 does more [despite my employer, I am a closet 32032 fan, by the way- it is almost as nice as the PDP-11 !-) ]. I suppose the true lesson is that "bits per instruction" is yet another Red Herring in Heavy Oil: a computer with 64 bit instructions would have "poor" "code density", but if one of those instructions was "Solve the (reg0)x(reg0) Knight's Tour and output the table to channel (reg1)", the program would be awfully short !-). -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA Five tons of flax!
west@sdcsla.UUCP (Larry West) (06/12/85)
In article <2335@cornell.UUCP> hal@gvax.UUCP (Hal Perkins) writes: >In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes: >>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The >>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in >>8.3 usec. The RISC-II cpu would have to be able to do the multiply in >>only 25 cpu cycles in order to compete. All the cache in the world >>ain't gonna help... > >Now just a moment... The Berkeley chips were built by academic folks >using conservative design rules, etc. If they had been built by >experienced professional chip designers using state-of-the-art >technology they would have been a lot faster. ... <text omitted> ... Hal's point is valid. Another point to consider is this: what you really want for a general-purpose computer is throughput. How much of a typical instruction stream is composed of 32-bit signed multiply's? Obviously, in many applications, multiplication would take on great significance. But one point of RISC is that optimizing a few special-purpose instructions can hurt you in general-purpose applications. -- Larry West Institute for Cognitive Science (USA+619-)452-6220 UC San Diego (mailcode C-015) [x6220] ARPA: <west@nprdc.ARPA> La Jolla, CA 92093 U.S.A. UUCP: {ucbvax,sdcrdcf,decvax,ihnp4}!sdcsvax!sdcsla!west OR ulysses!sdcsla!west
chuck@dartvax.UUCP (Chuck Simmons) (06/13/85)
> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > >a lot. Remember that things like offsets take extra bytes... > > In my experience 16 bit instructions get used quite often on the NS32000 in > well written assembly code. The NS32000 has an edge over the MC68000 in code > density in that instructions need not be word-aligned. Henry is correct in > stating that offsets are used on virtually every memory reference, but those > offsets are usually 8 bits. Following is some statistics for one module > pulled at random from my present project. The code is highly optimized for > speed, so code density is not even as good as it should be, i.e. "jump" is > used instead of "br", saving one clock, but costing two bytes, etc. > > Size Count > 8 0 > 16 14 > 24 13 > 32 1 > 40 0 > 48 2 > Total 30 > Ave 22.1 bits per instruction. > > The average instruction size of 22.1 bits is indeed more than the 16 bits the > original poster assumed, but much less than the 32 bits Henry expected, due > in large part to the large number of 24 bit instructions available on the > NS32000. I believe the MC68000 would have a larger number of 32 bit > instructions with the consequent increase in average bits-per-instruction. > > Of course, no analogy can be drawn for compiled code. Many compilers under- > use register space and I suspect the average bits-per-instruction would be > much higher for compiled code on both machines. > :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: Pulling out a "random" chunk of 68000 code (the only chunk I have sitting near by) gives us the following table: Size Count 16 22 32 17 48 1 Total 40 Ave 23.6 bits per instruction. I'm not sure that code density is really what we want to be looking at, however. I can recode the above algorithm to use more instructions that are smaller on the average and thus acheive a higher code density. (The 17 32-bit instructions can be replaced by 33 16-bit instructions and 2 32-bit instructions, giving an average of 18.4 bits per instruction.) However, this denser code would be both slower and longer. Somewhere we need to take into account the "usefulness" of each instruction. Chuck Simmons
hull@hao.UUCP (Howard Hull) (06/15/85)
In article 195@frog, John Woods writes: > On the other hand, I don't know how much these instructions actually > accomplished. I wouldn't be surprised to find that the 32?32 does more with > those 22 bits per instruction, but I wouldn't be horribly shocked to find that > the 68000 does more [despite my employer, I am a closet 32032 fan, by the way- > it is almost as nice as the PDP-11 !-) ]. Just out of curiousity, what *ARE* you 320xx and 68k fans using for the PDP-11 instruction MOV (SP)+,@(R5)+ eh? (It's useful for initializing tables or mapped video windows from stack images.) I only need one or two examples, please, so read all your micro mail before replying to this. If you see a correct answer, and you think yours is better, please send me mail; In a subsequent posting, I'll summarize whatever I get. Thanks. Howard Hull [If yet unproven concepts are outlawed in the range of discussion... ...Then only the deranged will discuss yet unproven concepts] {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull
henry@utzoo.UUCP (Henry Spencer) (06/17/85)
> By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The > 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in > 8.3 usec. The RISC-II cpu would have to be able to do the multiply in > only 25 cpu cycles in order to compete. All the cache in the world > ain't gonna help... The RISC II was designed by a bunch of grad students, using simplistic design rules and mediocre MOS processing, with limited opportunity to try again if the original chips didn't work. The 10MHz NS32016 was done by a swarm of professional silicon bashers, using professional facilities and souped-up processes, over a long period of time with *many* design iterations. (In fact, too %@$@%#@% many for those of us who were waiting for things to settle down and work...) This comparison means little. Also, multiplies are actually fairly infrequent operations in most programs. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (06/17/85)
> My point was supposed to be that you could take a 680xx/320xx and limit > yourself to the RISC instruction set (the "equivalent" instructions) and > have 16-bit instructions instead of RISC's 32-bit instructions. Try doing memory references in 16 bits on either the 680xx or 320xx. It can't be done, because the opcode alone is 16 bits. And you need memory references more on the commercial machines, because the RISC's register architecture isn't present to reduce memory usage. > > instruction-encoding expander. The latter function makes a large > > difference to instruction density without introducing any extra delays. > > In a microcoded cpu, the opcode is decoded and causes a sequence of > very simple micro-instructions to be executed. > > In the proposed system, the opcode is decoded and causes a sequence of > very simple RISC instructions to be executed. Wrong, it causes *one* very simple RISC instruction to be executed. The expander function is purely an encoding technique, it does not introduce another level of emulation. The point is that the poor code density of current RISCs is the result of optimization for experimental work rather than production. More compact encodings, without change to the basic concept, are both possible and desirable once the basic issues are understood. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
jer@peora.UUCP (J. Eric Roskos) (06/20/85)
henry@utzoo.UUCP (Henry Spencer @ U of Toronto Zoology) writes: > Also, multiplies are actually fairly infrequent operations in most programs. That is, if "most programs" don't use multidimensional arrays... -- Shyy-Anzr: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 Bar ol bar / Gur pbyq rgpurq cyngr / Unf cevagrq gur jnez fgnef bhg.
henry@utzoo.UUCP (Henry Spencer) (06/21/85)
> > Also, multiplies are actually fairly infrequent operations in most programs. > > That is, if "most programs" don't use multidimensional arrays... Except in a few specific environments, most programs indeed do not use multidimensional arrays. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
kds@intelca.UUCP (Ken Shoemaker) (06/21/85)
> introduce another level of emulation. The point is that the poor code > density of current RISCs is the result of optimization for experimental > work rather than production. More compact encodings, without change to > the basic concept, are both possible and desirable once the basic issues > are understood. > -- > Henry Spencer @ U of Toronto Zoology > {allegra,ihnp4,linus,decvax}!utzoo!henry er, I don't think so...one of the basic concepts of RISCs as I understand them is to reduce the number of pipeline stages in the execution of instructions. Adding another just to do an expansion/shuffleing of of opcode bytes flys directly in the face of that thinking. -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.
henry@utzoo.UUCP (Henry Spencer) (06/24/85)
> er, I don't think so...one of the basic concepts of RISCs as I understand > them is to reduce the number of pipeline stages in the execution of > instructions. Adding another just to do an expansion/shuffleing of > of opcode bytes flys directly in the face of that thinking. Remember that the RISC is necessarily decoding its opcodes; not even on a machine that simple is a numeric opcode a direct encoding of the internal control signals. The point is that the current instruction encoding was chosen for simplicity, not compactness, and one can do better *without* compromising the principles of the machine. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
baba@spar.UUCP (Baba ROM DOS) (06/25/85)
> The 10MHz NS32016 was done by > a swarm of professional silicon bashers, using professional facilities > and souped-up processes, over a long period of time with *many* design > iterations. > Don't you see a contradiction lurking in there? Baba ROM DOS
brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/25/85)
> > > Also, multiplies are actually fairly infrequent operations in most programs. > > > > That is, if "most programs" don't use multidimensional arrays... > > Except in a few specific environments, most programs indeed do not use > multidimensional arrays. Perhaps we could do without the integer multiplier :-), but we had better not drop the floating point adder and multiplier, and they had both better operate at one op per clock after filling the pipeline.
meissner@rtp47.UUCP (Michael Meissner) (06/25/85)
Another case where multiplication occurs frequently is contructs of the sort: struct { long l; short s; } *p; int i; main(){ /*...*/ p += i; p[i].l = 1; /*...*/ } Ie, pointer arithmetic involving non-constant integers, particularly if the size is not a multiple of 2. -- Michael Meissner Data General Corporation ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner
guy@sun.uucp (Guy Harris) (06/26/85)
> > > > Also, multiplies are actually fairly infrequent operations in most > > > > programs. > > > > > > That is, if "most programs" don't use multidimensional arrays... > > > > Except in a few specific environments, most programs indeed do not use > > multidimensional arrays. > > Perhaps we could do without the integer multiplier :-), but we had better > not drop the floating point adder and multiplier, and they had both better > operate at one op per clock after filling the pipeline. Methinks "most" is in the eye of the beholder. Most UNIX utilities, and the UNIX kernel, do relatively few multiplications *or* floating point operations (the kernel need not do any; the 4.2BSD scheduler does, but the Sun 4.2BSD scheduler and the 2.9BSD scheduler, which have the same load-average dependent decay on the average weighting for "p_cpu", don't) and rarely use multidimensional arrays. The same is probably true for many other classes of applications, such as the office automation/"personal productivity" applications popular on PCs, for instance. However, I suspect a scientific programmer would answer differently. I used multidimensional arrays in scientific programs I've written; they are quite common in such programs (they are a natural data structure for many types of calculations, like those which use models of two-dimensional or three-dimensional spaces). Those programs also would like the FP adder and multiplier; I don't know how many integer multiplies or divides they do, other than in support of accesses to multi-dimensional arrays. In the case of such accesses, if the indices being used are linear functions of the induction variable of a loop (I don't know whether there are any looser conditions possible), a strength reduction can make the multiplications unnecessary (C programmers may know this trick as "using a pointer instead of an array subscript"). Guy Harris
henry@utzoo.UUCP (Henry Spencer) (06/26/85)
> > The 10MHz NS32016 was done by > > a swarm of professional silicon bashers, using professional facilities > > and souped-up processes, over a long period of time with *many* design > > iterations. > > Don't you see a contradiction lurking in there? I wish I did... Professional people and facilities plus plenty of time would suggest getting it right the first time, except that it just makes them more ambitious instead. The souped-up processes help speed but don't do a lot for correctness. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
ken@turtlevax.UUCP (Ken Turkowski) (06/26/85)
>>> Also, multiplies are actually fairly infrequent operations in most programs. >> >> That is, if "most programs" don't use multidimensional arrays... > >Except in a few specific environments, most programs indeed do not use >multidimensional arrays. Depending on what your product is, you may not need multiplication at all. If the end product is a UN*X machine, most UN*X utilities do not need multiplications. However, if you are running any engineering or scientific applications, the time devoted to multiplication is considerable, and may even dominate execution time if there is no hardware support for it. Saving the status register is an infrequent operation; why not do multiple conditional branches instead? Virtually no application programs need it at all. The operating system just needs it to switch contexts, which is a task done infrequently, and takes so much time as it is... :-) -- Ken Turkowski @ CADLINC, Menlo Park, CA UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken ARPA: turtlevax!ken@DECWRL.ARPA ***** Suppport your local number cruncher: join PIFOM! ***** ***** (Programmers In Favor Of Multiplication) *****
bobbyo@celerity.UUCP (Bob Ollerton) (06/29/85)
Gee, if someone built a computer that did really fast multiplication (what ever "really fast" is) should they call it the "Rabbit"? I am sure that there are many designs that offer both fast, slow, and no hardware support for multiplication. The user pays for the cost of support for multiplication hardware in the final analysis, either in the cost of the system, or the time spent knitting paper clips together while running a job. Maybe designs with out fast multiplication (and floating point!) should be called the "Lizzard". Then somone can introduce the Micro-LizzardII. ::)) <-echo. bob. . -- Bob Ollerton; Celerity Computing; 9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940 {decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo akgua!celerity!bobbyo
sasaki@harvard.ARPA (Marty Sasaki) (06/29/85)
This may be an increadibly stupid question, but I remember reading an article (in CACM, I think) on RISC machines where most of the instructions occupied a single byte of memory. Through cleverness, something like 95% of all instructions executed occupied a single byte (including operands). This meant that programs would be smaller and would run faster by the simple fact that they required fewer memory references to get work done. All of the recent discussion on RISC machines hasn't mentioned this at all. Have things changed sufficiently in the recent past that this small program size doesn't matter? Is this a really dumb question? -- ---------------- Marty Sasaki net: sasaki@harvard.{arpa,uucp} Havard University Science Center phone: 617-495-1270 One Oxford Street Cambridge, MA 02138
mash@mips.UUCP (John Mashey) (06/30/85)
Michael Meissner ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner writes: > Another case where multiplication occurs frequently is contructs of the > sort: > struct { > long l; > short s; > } *p; > int i; > main(){ > /*...*/ > p += i; > p[i].l = 1; > /*...*/ > } > Ie, pointer arithmetic involving non-constant integers, particularly if > the size is not a multiple of 2. 1) In this example, sizeof(*p) == 8 anyway. To get the intended effect, use short l[2], for example. 2) In any case, at least some compilers not only do multiplies of powers of 2 by adds or shifts, but do so for constants that are almost powers of 2 (i.e., few 1-bits) by shifts and adds or subtracts. 3) Indeed, multiplication does occur frequently in these cases; the real question, is how frequent are these cases, really? [Not an answer, but a question whose answer needs to be known when you're making tradeoffs in CPU design]. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
larry@mips.UUCP (Larry Weber) (07/01/85)
> Another case where multiplication occurs frequently is contructs of the > sort: > > struct { > long l; > short s; > } *p; > > int i; > > main(){ > /*...*/ > p += i; > p[i].l = 1; > /*...*/ > } > Almost all the examples provided actually use multiplication by a constant. Thus, on all but the largest (and most expensive) of machines there are shift/add/subtract sequences that do the trick nicely: x * 6 == ((x<<1)+x)<<1) three instructions x * 7 == (x<<3)-x two instructions x *119== ((x<<4)-x)<<3)-x four instructions These sequences are often smaller than a call to a multiply routine and almost always faster than a general purpose multiply instruction. Most of the constants are powers of two, near powers of two (31, 255) or simple sums of powers of two (6, 10, 24). Some machines are better than others at this sort of thing so study those timings carefully. This is a winner on the 68k. -- -Larry B Weber UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!larry DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
kds@intelca.UUCP (Ken Shoemaker) (07/02/85)
> Remember that the RISC is necessarily decoding its opcodes; not even on > a machine that simple is a numeric opcode a direct encoding of the internal > control signals. The point is that the current instruction encoding was > chosen for simplicity, not compactness, and one can do better *without* > compromising the principles of the machine. I didn't say that it didn't decode them...My understanding is that the various bits of the instructions always come into the processor on the same set of data lines, which eliminates the need for some other unit out there to steer the bits from the appropriate data lines to the correct decoding unit. -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.