paulsc@tekecs.UUCP (06/21/83)
I have only seen a couple articles on RISCs, and I don't believe I understand just what is new about a RISC. Could someone explain the difference between programming a RISC and writing microcode? Paul H. Scherf P. O. Box 1000 Del. Sta. 61-201 Tektronix Engineering Computing Systems Wilsonville, Oregon, USA UUCP: ...!XXX!teklabs!tekecs!paulsc (where XXX is one of: aat cbosg chico decvax harpo ihnss lbl-unix ogcvax pur-ee reed ssc-vax ucbvax zehntel) CSNET: tekecs!paulsc @ tektronix ARPA: tekecs!paulsc.tektronix @ rand-relay
bcase@uiucdcs.UUCP (06/28/83)
#R:tekecs:-145500:uiucdcs:27800011:000:1529 uiucdcs!bcase Jun 27 19:03:00 1983 First, to my mind, using 'RISC' may mean only the Berkeley RISC machine, but it is rapidly becomming a generic term used to describe an architecture which is designed to execute simple instructions quickly. For some of the 'RISC'-like machines, there is very little difference between the language of the machine and traditional microcode; it just depends on which one you are talking about. Compiling for the Stanford MIPS is very much like compiling to microcode. Rather than make this a dissertation, please accept the following as recommended readings: "The Case for the Reduced Instruction Set Computer," Computer Arch. News, Vol. 8, No. 6, Oct. 1980. "RISC I: A Reduced Instruction Set VLSI Computer," Proceedings from the Eighth Symp. on Comp. Arch., May 1981. "Comments on 'The Case for the Reduced Instruction Set Computer,'" CAN, Vol. 8, No. 6, Oct. 1980. "MIPS: A Microprocessor Architecture," Proceedings from the 15th Workshop on Microprogramming, Nov. 1982. "The 801 Minicomputer," Proceedings from the Symp. on Arch. Support for Programming Langs. and OSs, March 1982. There are many more, especially by Patterson, but they are mostly repetitive. There is one in VLSI design which describes how the first RISC Is performed. Have fun. This is an interesting area; everybody seems to think that everyone else's design is wrong ("They'll never succeed, they're doing it all wrong."). (I didn't intend to show favoritism in my list of references, if anyone has any other suggestions, I want to read 'em.)
doug@terak.UUCP (Doug Pardee) (05/28/85)
> If we created the chip > using only the instructions the compiler needed, we could use less logic. > We could decode the instructions faster because the microcode is simpler > (more ooomph per MHz), production is simpler, yield is higher, speeds are > faster, and everyone is happy except the assembler programmer. > ... > This is the concept behind the RISC architecture ... RISC is an interesting concept, but I have a major reservation. Perhaps someone can explain to me how on earth you're going to feed instructions to a RISC machine fast enough? All of the popular 8 and 16 bit microprocessors are speed limited by instruction fetch, not by instruction complexity. I will entertain the objection that the 6502, with its critical shortage of on-chip registers, is also limited by operand accesses. The usual RISC machine has lots of registers, so operand accesses shouldn't be a problem. And nobody says that a non-RISC cpu can't have lots of registers. The knee-jerk answer is "cache". But that's only an answer if one refuses to allow non-RISC cpus to use cache; they can fit more logic into any given cache than can RISC cpus, thereby having a better "hit ratio" than RISC. Of course, one could design a RISC machine with a super-high-speed ROM or cache in which one could store the commonly used functions like multiplication and division, and then one would only have to fetch a subroutine call from the (slow) instruction stream. But doesn't that sound like your everyday, garden variety microcoded non-RISC cpu? -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
lucius@tardis.UUCP (Lucius Chiaraviglio) (05/30/85)
_ > RISC is an interesting concept, but I have a major reservation. Perhaps > someone can explain to me how on earth you're going to feed instructions > to a RISC machine fast enough? > > All of the popular 8 and 16 bit microprocessors are speed limited by > instruction fetch, not by instruction complexity. I will entertain the > objection that the 6502, with its critical shortage of on-chip > registers, is also limited by operand accesses. The usual RISC machine > has lots of registers, so operand accesses shouldn't be a problem. > And nobody says that a non-RISC cpu can't have lots of registers. How about giving the processor an instruction and operand queue in addition to a cache and give both the processor-to-cache and cache-to-memory paths a width that is some multiple of the length of a normal data word. The idea is to do memory access in parallel (even more than it already is -- this means parallel not only at the bit level but at the word level as well; I believe this is not a foreign idea -- if I remember correctly the VAX has a cache-to-memory data path of this type) so as to make up for the slow serial speed of these operations. > The knee-jerk answer is "cache". But that's only an answer if one > refuses to allow non-RISC cpus to use cache; they can fit more logic > into any given cache than can RISC cpus, thereby having a better "hit > ratio" than RISC. Why can't you fit as much logic into the cache of a RISC machine as into the cache of a standard CPU? The rest of the CPU would still be RISC; I thought part of the idea of some RISC projects was to save hardware space for things like this. Of course, some RISC projects also have the idea of making the compiler and/or the operating system take care of managing the cache. > Of course, one could design a RISC machine with a super-high-speed ROM > or cache in which one could store the commonly used functions like > multiplication and division, and then one would only have to fetch a > subroutine call from the (slow) instruction stream. > > But doesn't that sound like your everyday, garden variety microcoded > non-RISC cpu? Yes, but with simpler and faster hardware, and no fixed microcode (unless you take the step of putting these routines in ROM, which could be useful for some applications). Some CPUs have writable control store than be loaded without having to reboot the system but with a RISC software (the compiler, for one thing) is more responsible for taking care of this, so you can arrange to have even more flexibility and less hassle with context switches (for the non-assembly programmer that is -- at the low level the hardware and the software are doing a great amount of work (the new hardware that RISC machines put in the space they save on conventional hardware is for this kind of thing) to make it easy for the upper levels, and thus to make the whole system generally run more smoothly). So in other words you still get the advantages of your "everyday, garden variety microcoded non-RISC cpu" (except that you have to spend more money on lots of fast memory, including the cache, but memory is getting cheap enough these days to make it worth the additional expense by improving performance), but lack some of the disadvantages. -- -- Lucius Chiaraviglio { seismo!tardis!lucius | lucius@tardis.ARPA | lucius@tardis.UUCP }
brooks@lll-crg.ARPA (Eugene D. Brooks III) (05/30/85)
The knee jerk answer to the instruction fetch bandwidth problem, cache, is a valid answer. The argument that one can give as much cache to a complicated instruction set engine and therby get as much performance is not valid. The performance reduction for the complicated instruction set comes from the time spent running microcode decode and execute instructions. A good example of this is the VAX instruction set vs the Ridge instruction set. The Ridge achieves the same performance as the VAX as a much lower hardware cost. The performance increase arises in the simplicity of the instruction set. This difference also shows up when you emulate these architectures in software. The instruction decode for such emulators on the vax is a nigtmare and the designers of the VAX faced the same nightmare when the designed the hardware and wrote the microcode for it.
thoth@tellab3.UUCP (Marcus Hall) (05/31/85)
>RISC is an interesting concept, but I have a major reservation. Perhaps >someone can explain to me how on earth you're going to feed instructions >to a RISC machine fast enough? One way to help with this problem is to use fairly wide memory accesses (at least for instruction fetches). Thus, in one memory cycle, many instructions may be fetched simultaniously. Of course this is done for non-RISC machines as well, but a non-RISC machine will become execution-bound sooner. Also, since the RISC instruction set is simpler, the op-codes require fewer bits, so a memory fetch will get more RISC instructions in one cycle than it would non-RISC instructions. marcus hall ..!ihnp4!tellab1!tellab2!thoth
mat@amdahl.UUCP (Mike Taylor) (05/31/85)
> All of the popular 8 and 16 bit microprocessors are speed limited by > instruction fetch, not by instruction complexity. I will entertain the > objection that the 6502, with its critical shortage of on-chip > registers, is also limited by operand accesses. The usual RISC machine > has lots of registers, so operand accesses shouldn't be a problem. > And nobody says that a non-RISC cpu can't have lots of registers. > > The knee-jerk answer is "cache". But that's only an answer if one > refuses to allow non-RISC cpus to use cache; they can fit more logic > into any given cache than can RISC cpus, thereby having a better "hit > ratio" than RISC. > > Of course, one could design a RISC machine with a super-high-speed ROM > or cache in which one could store the commonly used functions like > multiplication and division, and then one would only have to fetch a > subroutine call from the (slow) instruction stream. > > But doesn't that sound like your everyday, garden variety microcoded > non-RISC cpu? > -- > Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug > ^^^^^--- soon to be CalComp Fetching one instruction per cycle is pretty much necessary to execute one instruction per cycle. Whether this is a bug or a feature is a good question. The difference between fetching a microword every cycle and fetching an instruction is that the microcoded machine has a fixed microsequence in ROM while the instructions come from RAM. Since they come from RAM, they can be optimized by the compiler for the particular job being done. Clearly you pay for this by diverting fetches from microstore to RAM. Some RISC implementations try to get this cost back by using the microstore's chip space for registers and/or instruction queues, reducing total fetch traffic. How can you tell a machine is not a RISC ? Proposal... If it is possible to replace some instructions with sequences of other instructions with only small performance penalties (or gains, a la VAX) then those instructions probably have no business in the instruction set. Their presence is probably slowing down the whole system. Take them out. Redesign to use the space you got back by removing them. Repeat the above cycle until no more instructions can be removed or the cycle doesn't improve performance. You now have a RISC. Arguments ? -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
joel@peora.UUCP (Joel Upchurch) (05/31/85)
I think your argument ignores the fact that with a given fabrication technology that there is only so much function you put on a given chip. If you chose to have a large control store ROM for a complex instruction set then you must make sacrifices in other parts of the chip. This may mean less registers, or less cache, or a slower ALU, or other trade- offs. RISC argues that this is not a good trade, that it is difficult to write compilers to use complicated instruction sets and that these high level operations are not necessarily any faster than a series of low level operations. Also with simple instruction sets you have the option of hardwireing it for additional performance.
doug@terak.UUCP (Doug Pardee) (06/03/85)
Responses to early comments on my question about how a RISC cpu can fetch instructions fast enough to keep up with non-RISC cpus: > The knee jerk answer to the instruction fetch bandwidth problem, cache, > is a valid answer. The argument that one can give as much cache to a > complicated instruction set engine and therby get as much performance is > not valid. The performance reduction for the complicated instruction set > comes from the time spent running microcode decode and execute instructions. I'd believe this except that it ain't so. The performance reduction on current non-RISC single-chip cpus comes from instruction fetch. For example, the NS32016 can do your basic RISC-type operations in 3 or 4 clock cycles, but it takes 5 clock cycles to fetch the next instruction. RISC-type instructions on the 32016 therefore take 5 clock cycles each. And the 32016 is "burdened" by non-RISC "microcode decode and execute instructions." > One way to help with this problem is to use fairly wide memory accesses > (at least for instruction fetches). Thus, in one memory cycle, many > instructions may be fetched simultaniously. Of course this is done for > non-RISC machines as well, but a non-RISC machine will become > execution-bound sooner. The NS32032 and MC68020 both fetch 32 bits of instruction data at one time. Both have "zillions" of pins on the package in order to pull that off; I can't imagine building a RISC chip with 128-bit wide instruction fetch, requiring over 250 pins on the package. And if we're not talking single-chip architectures, we're going to have a devil of a time making rational comparisons. After all, IBM has used a non-RISC architecture for some time and it goes pretty fast :-) I'll believe that a non-RISC machine will become execution-bound sooner, but this just another way of saying that the RISC machine is much more limited by instruction fetch than the non-RISC machine is. > Also, since the RISC instruction set is simpler, the op-codes require fewer > bits, so a memory fetch will get more RISC instructions in one cycle > than it would non-RISC instructions. I thought this too, until I looked into some RISC machines. They use 32-bit instruction words, twice as wide as the equivalent instructions in, say, the 680xx and 320xx cpus. Still looking for the answer... -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
steveg@hammer.UUCP (Steve Glaser) (06/05/85)
In article <1590@amdahl.UUCP> mat@amdahl.UUCP (Mike Taylor) writes: >> All of the popular 8 and 16 bit microprocessors are speed limited by >> instruction fetch, not by instruction complexity. I will entertain the >> objection that the 6502, with its critical shortage of on-chip >> registers, is also limited by operand accesses. The usual RISC machine >> has lots of registers, so operand accesses shouldn't be a problem. >> And nobody says that a non-RISC cpu can't have lots of registers. >> >> The knee-jerk answer is "cache". But that's only an answer if one >> refuses to allow non-RISC cpus to use cache; they can fit more logic >> into any given cache than can RISC cpus, thereby having a better "hit >> ratio" than RISC. >> >> Of course, one could design a RISC machine with a super-high-speed ROM >> or cache in which one could store the commonly used functions like >> multiplication and division, and then one would only have to fetch a >> subroutine call from the (slow) instruction stream. >> The basic tenants of RISC are that you throw out the useless bagage that's in there "cause somebody wanted it and microcode was cheap" and get back to a reaconable minimal set (do chuck out an instruction; measure performance; if better or same, repeat). With the real estate gained, you then start putting reasonable sized caches and/or prefetch queues on chip instead of useless instructions. Some folks have decided that RISC means that you eliminate the microcode and directly decode the operations. This is too narrow of a view. It doesn't matter how you implement what's left, as long as you only implement what is really needed. As for caches, one issue that hasn't been mentioned is the use of block mode transfers for fast cache filling. There are ram chips out there that can access "nearby" ram cells significantly faster that random ones (so called page mode, nibble mode, or static column rams). These were mainly developed for the raster display markets (so you don't waste all your memory bandwith refreshing the screen and you get a chance to make updates to it), but are useful in other areas as well. Since caches blocks are typically a power of 2 in size and start on nice boundaries, it's pretty easy to map these into the notion of "nearby" supported by a ram chip. The only problem you have with this kind of architecture is (1) you need a new bus protocol, and (2) ECC gets harder cause you need to do it faster (anybody for a pipelined ECC chip that can keep up with 50nsec ram cycles?). Steve Glaser
henry@utzoo.UUCP (Henry Spencer) (06/06/85)
> I thought this too, until I looked into some RISC machines. They use > 32-bit instruction words, twice as wide as the equivalent instructions > in, say, the 680xx and 320xx cpus. Yes and no. The Berkeley RISC project adopted 32-bit instructions for simplicity in initial work, not because they thought it was right for the final design. If you look, you'll find at least one paper from them discussing a support chip which is (a) an instruction cache, and (b) an instruction-encoding expander. The latter function makes a large difference to instruction density without introducing any extra delays. Also, don't be too sure that the 68* and 32* chips use 16-bit instructions a lot. Remember that things like offsets take extra bytes, and those get used *a lot* on those machines -- virtually every memory reference needs one. On the pdp11, a 16-bit machine if there ever was one, the average instruction length was in fact about 32 bits. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
atbowler@watmath.UUCP (Alan T. Bowler [SDG]) (06/06/85)
In article <591@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes: > >I thought this too, until I looked into some RISC machines. They use >32-bit instruction words, twice as wide as the equivalent instructions >in, say, the 680xx and 320xx cpus. > Actually a brief examination of code actually produced by compilers on a 68000 shows that the compiler seldom generates anything but 32 bit instructions. (including addressing). It generates 16 bit instructions formats less often than it generates 48 bit ones (absolute address). This is not to say I am a big fan of what is currently being called RISC. It strikes me that a good deal of the claims about performance, don't jive with what happened, and where there were actual performance gains they where attributed to the wrong part of the archetecture. I am also not a big fan of very complex instructions either. My experience shows that supplied "CALL" instructions that do any more than store a return address in a register make subroutine call design difficult, and usually slower. There is entirely too much religious faith and too little fact in most "RISC" discussion.
doug@terak.UUCP (Doug Pardee) (06/07/85)
> Some folks have decided that RISC means that you eliminate the > microcode and directly decode the operations. This is too narrow of a > view. An elephant is much like a tree, or is it a rope, or is it a snake... If we're going to talk about RISC, we'd better decide what we're going to call RISC. Is RISC: 1) a way to save silicon real estate on single-chip cpus? or 2) a concept that's equally applicable to board-level cpus? Is RISC: 1) strictly limited instruction set, essentially microcode level? or 2) any cpu without "useless" instructions? (define "useless") or 3) any cpu with oodles of registers? For the moment, I'm going to talk about non-microcoded single-chip cpus with strictly limited instruction sets. Something you'd put in a TTL system with dynamic RAMs. I've come to the conclusion that for this kind of RISC, its time has passed. It made a lot of sense in the late '70s when cpu cycles were 250-400 ns and memory cycles were 300-400 ns. But since then the cpu manufacturers have concentrated on speed while the memory manufacturers concentrated on capacity. With cpu cycles now at 60-100 ns and memory cycles at 225-325 ns, the way to improve speed in a system is to decrease memory cycles, not cpu cycles. That means making instructions even *more* complex, so that each instruction fetched does as much work as possible. At 100 ns per cpu cycle, you can afford to let the cpu do a lot of work that isn't always needed, because you can throw away *half* of it and still have the cpu be faster than if you fetched the microinstructions from main memory (using 120 ns access/220 ns cycle time 256K DRAMs). By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in 8.3 usec. The RISC-II cpu would have to be able to do the multiply in only 25 cpu cycles in order to compete. All the cache in the world ain't gonna help... -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
brownc@utah-cs.UUCP (Eric C. Brown) (06/07/85)
In article <14882@watmath.UUCP> atbowler@watmath.UUCP (Alan T. Bowler [SDG]) writes: > I am also not a big fan of very complex instructions either. My >experience shows that supplied "CALL" instructions that do any >more than store a return address in a register make subroutine call >design difficult, and usually slower. I would agree with you to a point: the return address should be saved on a stack. Jamming it into a register only means you blow an instruction inside the subroutine saving the old pc anyway. (Recursion? recursion? all I want to do is call another routine inside the first...) Eric C. Brown ..!{decvax, ihnp4, seismo}!utah-cs!brownc brownc@utah-cs
mash@mips.UUCP (John Mashey) (06/09/85)
Eric C. Brown ..!{decvax, ihnp4, seismo}!utah-cs!brownc writes: > I would agree with you to a point: the return address should be saved on > a stack. Jamming it into a register only means you blow an instruction > inside the subroutine saving the old pc anyway. (Recursion? recursion? > all I want to do is call another routine inside the first...) Not necessarily: 1) Leaf routines (i.e., those that make no further calls) might not want to burn memory references for saving/restoring the register when it need not be saved/restored at all. 2) As usual, one must make all sorts of design tradeoffs - simplifying something somewhere may complexify something else. In this case, one must carefully weigh the usefullness of having subroutine call/return save/restore regs, versus complexity of fault-handling and slowdown of basic cycle to allow for it, especially in heavily pipelined designs. This is not to say that that mechanism is a bad idea, merely that the ramifications can surprise you and must be taken into account. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
hal@cornell.UUCP (Hal Perkins) (06/09/85)
In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes: >By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The >10 MHz NS32016 can do a 32-bit register-to-register signed multiply in >8.3 usec. The RISC-II cpu would have to be able to do the multiply in >only 25 cpu cycles in order to compete. All the cache in the world >ain't gonna help... Now just a moment... The Berkeley chips were built by academic folks using conservative design rules, etc. If they had been built by experienced professional chip designers using state-of-the-art technology they would have been a lot faster. It's not fair to pick on the relatively slow clock cycle of the experimental chips -- they were built to prove a point (which they did), not to produce a product. (I'm not making this up -- if you read some of the RISC papers by Patterson and friends, et. al., they make the same point.) Hal Perkins UUCP: {decvax|vax135|...}!cornell!hal Cornell Computer Science ARPA: hal@cornell BITNET: hal@crnlcs
jans@mako.UUCP (Jan Steinman) (06/10/85)
In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions >a lot. Remember that things like offsets take extra bytes... In my experience 16 bit instructions get used quite often on the NS32000 in well written assembly code. The NS32000 has an edge over the MC68000 in code density in that instructions need not be word-aligned. Henry is correct in stating that offsets are used on virtually every memory reference, but those offsets are usually 8 bits. Following is some statistics for one module pulled at random from my present project. The code is highly optimized for speed, so code density is not even as good as it should be, i.e. "jump" is used instead of "br", saving one clock, but costing two bytes, etc. Size Count 8 0 16 14 24 13 32 1 40 0 48 2 Total 30 Ave 22.1 bits per instruction. The average instruction size of 22.1 bits is indeed more than the 16 bits the original poster assumed, but much less than the 32 bits Henry expected, due in large part to the large number of 24 bit instructions available on the NS32000. I believe the MC68000 would have a larger number of 32 bit instructions with the consequent increase in average bits-per-instruction. Of course, no analogy can be drawn for compiled code. Many compilers under- use register space and I suspect the average bits-per-instruction would be much higher for compiled code on both machines. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
doug@terak.UUCP (Doug Pardee) (06/11/85)
Once again, I find that I didn't do a good job of explaining myself. Let me try again... In the following comment, notice the word "equivalent": me> I thought this too, until I looked into some RISC machines. They use me> 32-bit instruction words, twice as wide as the equivalent instructions me> in, say, the 680xx and 320xx cpus. > Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > a lot. Remember that things like offsets take extra bytes, and those get > used *a lot* on those machines -- virtually every memory reference needs > one. My point was supposed to be that you could take a 680xx/320xx and limit yourself to the RISC instruction set (the "equivalent" instructions) and have 16-bit instructions instead of RISC's 32-bit instructions. Presumably, the reason that the longer instructions, with their offsets etc., are used so much is because those instructions are more effective than the RISC instructions for the task at hand (this sounds like another discussion that I'm involved in :-) > instruction-encoding expander. The latter function makes a large > difference to instruction density without introducing any extra delays. In a microcoded cpu, the opcode is decoded and causes a sequence of very simple micro-instructions to be executed. In the proposed system, the opcode is decoded and causes a sequence of very simple RISC instructions to be executed. What's the difference? Where does the much-vaunted performance gain come from? -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
john@frog.UUCP (John Woods) (06/12/85)
> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > >a lot. Remember that things like offsets take extra bytes... >In my experience 16 bit instructions get used quite often on the NS32000 in >well written assembly code. The NS32000 has an edge over the MC68000 in code >density in that instructions need not be word-aligned. Henry is correct in >stating that offsets are used on virtually every memory reference, but those >offsets are usually 8 bits. Following is some statistics for one module >pulled at random from my present project. The code is highly optimized for >speed, so code density is not even as good as it should be, i.e. "jump" is >used instead of "br", saving one clock, but costing two bytes, etc. > Size Count Size Count > 8 0 16 14 > 24 13 32 1 > 40 0 48 2 > Total 30 Ave 22.1 bits per instruction. >The average instruction size of 22.1 bits is indeed more than the 16 bits the >original poster assumed, but much less than the 32 bits Henry expected, due >in large part to the large number of 24 bit instructions available on the >NS32000. I believe the MC68000 would have a larger number of 32 bit >instructions with the consequent increase in average bits-per-instruction. > I counted words for some random compiled code here for the 68000 (the famous Knight's Tour, which I still have lying around). Size (bytes) 2 4 6 8 Sample 1: 23 4 3 0 Average 21.3 bits / 30 main() Sample 2: 18 7 4 0 Average 23.4 bits instr. try() The Greenhills compiler tries to make good use of registers. On the other hand, I don't know how much these instructions actually accomplished. I wouldn't be surprised to find that the 32?32 does more with those 22 bits per instruction, but I wouldn't be horribly shocked to find that the 68000 does more [despite my employer, I am a closet 32032 fan, by the way- it is almost as nice as the PDP-11 !-) ]. I suppose the true lesson is that "bits per instruction" is yet another Red Herring in Heavy Oil: a computer with 64 bit instructions would have "poor" "code density", but if one of those instructions was "Solve the (reg0)x(reg0) Knight's Tour and output the table to channel (reg1)", the program would be awfully short !-). -- John Woods, Charles River Data Systems, Framingham MA, (617) 626-1101 ...!decvax!frog!john, ...!mit-eddie!jfw, jfw%mit-ccc@MIT-XX.ARPA Five tons of flax!
west@sdcsla.UUCP (Larry West) (06/12/85)
In article <2335@cornell.UUCP> hal@gvax.UUCP (Hal Perkins) writes: >In article <601@terak.UUCP> doug@terak.UUCP (Doug Pardee) writes: >>By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The >>10 MHz NS32016 can do a 32-bit register-to-register signed multiply in >>8.3 usec. The RISC-II cpu would have to be able to do the multiply in >>only 25 cpu cycles in order to compete. All the cache in the world >>ain't gonna help... > >Now just a moment... The Berkeley chips were built by academic folks >using conservative design rules, etc. If they had been built by >experienced professional chip designers using state-of-the-art >technology they would have been a lot faster. ... <text omitted> ... Hal's point is valid. Another point to consider is this: what you really want for a general-purpose computer is throughput. How much of a typical instruction stream is composed of 32-bit signed multiply's? Obviously, in many applications, multiplication would take on great significance. But one point of RISC is that optimizing a few special-purpose instructions can hurt you in general-purpose applications. -- Larry West Institute for Cognitive Science (USA+619-)452-6220 UC San Diego (mailcode C-015) [x6220] ARPA: <west@nprdc.ARPA> La Jolla, CA 92093 U.S.A. UUCP: {ucbvax,sdcrdcf,decvax,ihnp4}!sdcsvax!sdcsla!west OR ulysses!sdcsla!west
chuck@dartvax.UUCP (Chuck Simmons) (06/13/85)
> In article <5673@utzoo.UUCP> henry@utzoo.UUCP (Henry Spencer) writes: > >Also, don't be too sure that the 68* and 32* chips use 16-bit instructions > >a lot. Remember that things like offsets take extra bytes... > > In my experience 16 bit instructions get used quite often on the NS32000 in > well written assembly code. The NS32000 has an edge over the MC68000 in code > density in that instructions need not be word-aligned. Henry is correct in > stating that offsets are used on virtually every memory reference, but those > offsets are usually 8 bits. Following is some statistics for one module > pulled at random from my present project. The code is highly optimized for > speed, so code density is not even as good as it should be, i.e. "jump" is > used instead of "br", saving one clock, but costing two bytes, etc. > > Size Count > 8 0 > 16 14 > 24 13 > 32 1 > 40 0 > 48 2 > Total 30 > Ave 22.1 bits per instruction. > > The average instruction size of 22.1 bits is indeed more than the 16 bits the > original poster assumed, but much less than the 32 bits Henry expected, due > in large part to the large number of 24 bit instructions available on the > NS32000. I believe the MC68000 would have a larger number of 32 bit > instructions with the consequent increase in average bits-per-instruction. > > Of course, no analogy can be drawn for compiled code. Many compilers under- > use register space and I suspect the average bits-per-instruction would be > much higher for compiled code on both machines. > :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: Pulling out a "random" chunk of 68000 code (the only chunk I have sitting near by) gives us the following table: Size Count 16 22 32 17 48 1 Total 40 Ave 23.6 bits per instruction. I'm not sure that code density is really what we want to be looking at, however. I can recode the above algorithm to use more instructions that are smaller on the average and thus acheive a higher code density. (The 17 32-bit instructions can be replaced by 33 16-bit instructions and 2 32-bit instructions, giving an average of 18.4 bits per instruction.) However, this denser code would be both slower and longer. Somewhere we need to take into account the "usefulness" of each instruction. Chuck Simmons
hull@hao.UUCP (Howard Hull) (06/15/85)
In article 195@frog, John Woods writes: > On the other hand, I don't know how much these instructions actually > accomplished. I wouldn't be surprised to find that the 32?32 does more with > those 22 bits per instruction, but I wouldn't be horribly shocked to find that > the 68000 does more [despite my employer, I am a closet 32032 fan, by the way- > it is almost as nice as the PDP-11 !-) ]. Just out of curiousity, what *ARE* you 320xx and 68k fans using for the PDP-11 instruction MOV (SP)+,@(R5)+ eh? (It's useful for initializing tables or mapped video windows from stack images.) I only need one or two examples, please, so read all your micro mail before replying to this. If you see a correct answer, and you think yours is better, please send me mail; In a subsequent posting, I'll summarize whatever I get. Thanks. Howard Hull [If yet unproven concepts are outlawed in the range of discussion... ...Then only the deranged will discuss yet unproven concepts] {ucbvax!hplabs | allegra!nbires | harpo!seismo } !hao!hull
henry@utzoo.UUCP (Henry Spencer) (06/17/85)
> By the way, the Berkeley RISC-II chip has a 330 ns cpu cycle time. The > 10 MHz NS32016 can do a 32-bit register-to-register signed multiply in > 8.3 usec. The RISC-II cpu would have to be able to do the multiply in > only 25 cpu cycles in order to compete. All the cache in the world > ain't gonna help... The RISC II was designed by a bunch of grad students, using simplistic design rules and mediocre MOS processing, with limited opportunity to try again if the original chips didn't work. The 10MHz NS32016 was done by a swarm of professional silicon bashers, using professional facilities and souped-up processes, over a long period of time with *many* design iterations. (In fact, too %@$@%#@% many for those of us who were waiting for things to settle down and work...) This comparison means little. Also, multiplies are actually fairly infrequent operations in most programs. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
henry@utzoo.UUCP (Henry Spencer) (06/17/85)
> My point was supposed to be that you could take a 680xx/320xx and limit > yourself to the RISC instruction set (the "equivalent" instructions) and > have 16-bit instructions instead of RISC's 32-bit instructions. Try doing memory references in 16 bits on either the 680xx or 320xx. It can't be done, because the opcode alone is 16 bits. And you need memory references more on the commercial machines, because the RISC's register architecture isn't present to reduce memory usage. > > instruction-encoding expander. The latter function makes a large > > difference to instruction density without introducing any extra delays. > > In a microcoded cpu, the opcode is decoded and causes a sequence of > very simple micro-instructions to be executed. > > In the proposed system, the opcode is decoded and causes a sequence of > very simple RISC instructions to be executed. Wrong, it causes *one* very simple RISC instruction to be executed. The expander function is purely an encoding technique, it does not introduce another level of emulation. The point is that the poor code density of current RISCs is the result of optimization for experimental work rather than production. More compact encodings, without change to the basic concept, are both possible and desirable once the basic issues are understood. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
jer@peora.UUCP (J. Eric Roskos) (06/20/85)
henry@utzoo.UUCP (Henry Spencer @ U of Toronto Zoology) writes: > Also, multiplies are actually fairly infrequent operations in most programs. That is, if "most programs" don't use multidimensional arrays... -- Shyy-Anzr: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642 Bar ol bar / Gur pbyq rgpurq cyngr / Unf cevagrq gur jnez fgnef bhg.
henry@utzoo.UUCP (Henry Spencer) (06/21/85)
> > Also, multiplies are actually fairly infrequent operations in most programs. > > That is, if "most programs" don't use multidimensional arrays... Except in a few specific environments, most programs indeed do not use multidimensional arrays. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
kds@intelca.UUCP (Ken Shoemaker) (06/21/85)
> introduce another level of emulation. The point is that the poor code > density of current RISCs is the result of optimization for experimental > work rather than production. More compact encodings, without change to > the basic concept, are both possible and desirable once the basic issues > are understood. > -- > Henry Spencer @ U of Toronto Zoology > {allegra,ihnp4,linus,decvax}!utzoo!henry er, I don't think so...one of the basic concepts of RISCs as I understand them is to reduce the number of pipeline stages in the execution of instructions. Adding another just to do an expansion/shuffleing of of opcode bytes flys directly in the face of that thinking. -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.
henry@utzoo.UUCP (Henry Spencer) (06/24/85)
> er, I don't think so...one of the basic concepts of RISCs as I understand > them is to reduce the number of pipeline stages in the execution of > instructions. Adding another just to do an expansion/shuffleing of > of opcode bytes flys directly in the face of that thinking. Remember that the RISC is necessarily decoding its opcodes; not even on a machine that simple is a numeric opcode a direct encoding of the internal control signals. The point is that the current instruction encoding was chosen for simplicity, not compactness, and one can do better *without* compromising the principles of the machine. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
baba@spar.UUCP (Baba ROM DOS) (06/25/85)
> The 10MHz NS32016 was done by > a swarm of professional silicon bashers, using professional facilities > and souped-up processes, over a long period of time with *many* design > iterations. > Don't you see a contradiction lurking in there? Baba ROM DOS
brooks@lll-crg.ARPA (Eugene D. Brooks III) (06/25/85)
> > > Also, multiplies are actually fairly infrequent operations in most programs. > > > > That is, if "most programs" don't use multidimensional arrays... > > Except in a few specific environments, most programs indeed do not use > multidimensional arrays. Perhaps we could do without the integer multiplier :-), but we had better not drop the floating point adder and multiplier, and they had both better operate at one op per clock after filling the pipeline.
meissner@rtp47.UUCP (Michael Meissner) (06/25/85)
Another case where multiplication occurs frequently is contructs of the sort: struct { long l; short s; } *p; int i; main(){ /*...*/ p += i; p[i].l = 1; /*...*/ } Ie, pointer arithmetic involving non-constant integers, particularly if the size is not a multiple of 2. -- Michael Meissner Data General Corporation ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner
guy@sun.uucp (Guy Harris) (06/26/85)
> > > > Also, multiplies are actually fairly infrequent operations in most > > > > programs. > > > > > > That is, if "most programs" don't use multidimensional arrays... > > > > Except in a few specific environments, most programs indeed do not use > > multidimensional arrays. > > Perhaps we could do without the integer multiplier :-), but we had better > not drop the floating point adder and multiplier, and they had both better > operate at one op per clock after filling the pipeline. Methinks "most" is in the eye of the beholder. Most UNIX utilities, and the UNIX kernel, do relatively few multiplications *or* floating point operations (the kernel need not do any; the 4.2BSD scheduler does, but the Sun 4.2BSD scheduler and the 2.9BSD scheduler, which have the same load-average dependent decay on the average weighting for "p_cpu", don't) and rarely use multidimensional arrays. The same is probably true for many other classes of applications, such as the office automation/"personal productivity" applications popular on PCs, for instance. However, I suspect a scientific programmer would answer differently. I used multidimensional arrays in scientific programs I've written; they are quite common in such programs (they are a natural data structure for many types of calculations, like those which use models of two-dimensional or three-dimensional spaces). Those programs also would like the FP adder and multiplier; I don't know how many integer multiplies or divides they do, other than in support of accesses to multi-dimensional arrays. In the case of such accesses, if the indices being used are linear functions of the induction variable of a loop (I don't know whether there are any looser conditions possible), a strength reduction can make the multiplications unnecessary (C programmers may know this trick as "using a pointer instead of an array subscript"). Guy Harris
henry@utzoo.UUCP (Henry Spencer) (06/26/85)
> > The 10MHz NS32016 was done by > > a swarm of professional silicon bashers, using professional facilities > > and souped-up processes, over a long period of time with *many* design > > iterations. > > Don't you see a contradiction lurking in there? I wish I did... Professional people and facilities plus plenty of time would suggest getting it right the first time, except that it just makes them more ambitious instead. The souped-up processes help speed but don't do a lot for correctness. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
ken@turtlevax.UUCP (Ken Turkowski) (06/26/85)
>>> Also, multiplies are actually fairly infrequent operations in most programs. >> >> That is, if "most programs" don't use multidimensional arrays... > >Except in a few specific environments, most programs indeed do not use >multidimensional arrays. Depending on what your product is, you may not need multiplication at all. If the end product is a UN*X machine, most UN*X utilities do not need multiplications. However, if you are running any engineering or scientific applications, the time devoted to multiplication is considerable, and may even dominate execution time if there is no hardware support for it. Saving the status register is an infrequent operation; why not do multiple conditional branches instead? Virtually no application programs need it at all. The operating system just needs it to switch contexts, which is a task done infrequently, and takes so much time as it is... :-) -- Ken Turkowski @ CADLINC, Menlo Park, CA UUCP: {amd,decwrl,hplabs,nsc,seismo,spar}!turtlevax!ken ARPA: turtlevax!ken@DECWRL.ARPA ***** Suppport your local number cruncher: join PIFOM! ***** ***** (Programmers In Favor Of Multiplication) *****
rbt@sftig.UUCP (R.Thomas) (06/28/85)
> Saving the status register is an infrequent operation; why not do > multiple conditional branches instead? Virtually no application > programs need it at all. The operating system just needs it to switch > contexts, which is a task done infrequently, and takes so much time as > it is... :-) > -- > > Ken Turkowski @ CADLINC, Menlo Park, CA It's not as funny as you think! The old UNIVAC 1107 did just exactly that to save the status of the carry and overflow flags. And in order to restore them, it had to actually do an arithmetic sequence that would force the flags to have the right values. There were *no* instructions to directly save or restore these flags! An old time hacker, Rick Thomas attunix!rbt For that matter, some benighted hardwares protect the instructions that load the flags register because there are flags in that register that you would rather not have the user mucking around with. On those machines, if you want to save and restore the flags from user mode, you have to resort to some such trick as the above. I know, there are better ways to design systems, but some designers are not as smart as some designers.
bobbyo@celerity.UUCP (Bob Ollerton) (06/29/85)
Gee, if someone built a computer that did really fast multiplication (what ever "really fast" is) should they call it the "Rabbit"? I am sure that there are many designs that offer both fast, slow, and no hardware support for multiplication. The user pays for the cost of support for multiplication hardware in the final analysis, either in the cost of the system, or the time spent knitting paper clips together while running a job. Maybe designs with out fast multiplication (and floating point!) should be called the "Lizzard". Then somone can introduce the Micro-LizzardII. ::)) <-echo. bob. . -- Bob Ollerton; Celerity Computing; 9692 Via Excelencia; San Diego, Ca 92126; (619) 271 9940 {decvax || ucbvax || ihnp4}!sdcsvax!celerity!bobbyo akgua!celerity!bobbyo
sasaki@harvard.ARPA (Marty Sasaki) (06/29/85)
This may be an increadibly stupid question, but I remember reading an article (in CACM, I think) on RISC machines where most of the instructions occupied a single byte of memory. Through cleverness, something like 95% of all instructions executed occupied a single byte (including operands). This meant that programs would be smaller and would run faster by the simple fact that they required fewer memory references to get work done. All of the recent discussion on RISC machines hasn't mentioned this at all. Have things changed sufficiently in the recent past that this small program size doesn't matter? Is this a really dumb question? -- ---------------- Marty Sasaki net: sasaki@harvard.{arpa,uucp} Havard University Science Center phone: 617-495-1270 One Oxford Street Cambridge, MA 02138
mash@mips.UUCP (John Mashey) (06/30/85)
Michael Meissner ...{ ihnp4, decvax }!mcnc!rti-sel!rtp47!meissner writes: > Another case where multiplication occurs frequently is contructs of the > sort: > struct { > long l; > short s; > } *p; > int i; > main(){ > /*...*/ > p += i; > p[i].l = 1; > /*...*/ > } > Ie, pointer arithmetic involving non-constant integers, particularly if > the size is not a multiple of 2. 1) In this example, sizeof(*p) == 8 anyway. To get the intended effect, use short l[2], for example. 2) In any case, at least some compilers not only do multiplies of powers of 2 by adds or shifts, but do so for constants that are almost powers of 2 (i.e., few 1-bits) by shifts and adds or subtracts. 3) Indeed, multiplication does occur frequently in these cases; the real question, is how frequent are these cases, really? [Not an answer, but a question whose answer needs to be known when you're making tradeoffs in CPU design]. -- -john mashey UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!mash DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
larry@mips.UUCP (Larry Weber) (07/01/85)
> Another case where multiplication occurs frequently is contructs of the > sort: > > struct { > long l; > short s; > } *p; > > int i; > > main(){ > /*...*/ > p += i; > p[i].l = 1; > /*...*/ > } > Almost all the examples provided actually use multiplication by a constant. Thus, on all but the largest (and most expensive) of machines there are shift/add/subtract sequences that do the trick nicely: x * 6 == ((x<<1)+x)<<1) three instructions x * 7 == (x<<3)-x two instructions x *119== ((x<<4)-x)<<3)-x four instructions These sequences are often smaller than a call to a multiply routine and almost always faster than a general purpose multiply instruction. Most of the constants are powers of two, near powers of two (31, 255) or simple sums of powers of two (6, 10, 24). Some machines are better than others at this sort of thing so study those timings carefully. This is a winner on the 68k. -- -Larry B Weber UUCP: {decvax,ucbvax,ihnp4}!decwrl!mips!larry DDD: 415-960-1200 USPS: MIPS Computer Systems, 1330 Charleston Rd, Mtn View, CA 94043
kds@intelca.UUCP (Ken Shoemaker) (07/02/85)
> Remember that the RISC is necessarily decoding its opcodes; not even on > a machine that simple is a numeric opcode a direct encoding of the internal > control signals. The point is that the current instruction encoding was > chosen for simplicity, not compactness, and one can do better *without* > compromising the principles of the machine. I didn't say that it didn't decode them...My understanding is that the various bits of the instructions always come into the processor on the same set of data lines, which eliminates the need for some other unit out there to steer the bits from the appropriate data lines to the correct decoding unit. -- ...and I'm sure it wouldn't interest anybody outside of a small circle of friends... Ken Shoemaker, Microprocessor Design for a large, Silicon Valley firm {pur-ee,hplabs,amd,scgvaxd,dual,qantel}!intelca!kds ---the above views are personal. They may not represent those of the employer of its submitter.
jeff@gatech.CSNET (Jeff Lee) (07/04/85)
> > Saving the status register is an infrequent operation; why not do > > multiple conditional branches instead? Virtually no application > > programs need it at all. The operating system just needs it to switch > > contexts, which is a task done infrequently, and takes so much time as > > it is... :-) > > It's not as funny as you think! The old UNIVAC 1107 did just exactly that > to save the status of the carry and overflow flags. And in order to > restore them, it had to actually do an arithmetic sequence that would > force the flags to have the right values. There were *no* instructions to > directly save or restore these flags! And then (moving slightly off the subject) there was the code that was needed to save all your registers on a CDC-6000 type machine. It had an interesting architecture where when you loaded register A1-A5 with a number the corresponding X1-X5 received the contents of that memory location. When you loaded registers A6-A7 with a number the contents of X6-X7 was stored in that location. Registers A0 and X0 were not associated like this. In order to store a register, you had to change an A-register value. As I remember it, the code used one of the index registers (B-registers) and performed a series of 18 conditional Return-Jumps based on the sign of the chosen B-register and shifting it 1 bit to the left at a time. Then it could nuke that B-register and save A6 or A7 in it and start storing values. It then could reconstruct the killed B-register by tracing through the memory locations used as targets for the return jumps. Can you say CONVOLUTED boys and girls?? Needless to say, the software designers decided that if YOU want a register then YOU better save it because they aren't about to save them all. Enjoy, -- Jeff Lee CSNet: Jeff @ GATech ARPA: Jeff%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff
johnl@ima.UUCP (07/07/85)
Having a hardware multiplication instruction isn't as much of a win as you might think. On everybody's favorite chip, the 8088, a 16 x 16 multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles respectively. This means that for practically any constant multiplier you'll get faster code by constructing your multiply from shifts, adds, and substracts. Even on other higher powered machines a multiply is still typically 10 times slower than an add or shift, so that it's still usually faster to do the addition chain for small multipliers such as the ones you usually run into when doing subscripting. John Levine, ima!johnl
gottlieb@cmcl2.UUCP (Allan Gottlieb) (07/09/85)
Yes the 6600 was a poor architecture for saving all the registers, but it did multiply fast :-). We are acutally quite fond of the convoluted code sequence that you mentioned since it was discovered here when we had an early timesharing system for the 6600. Allan Gottlieb GOTTLIEB@NYU {floyd,ihnp4}!cmcl2!gottlieb <---the character before the 2 is an el -- Allan Gottlieb GOTTLIEB@NYU {floyd,ihnp4}!cmcl2!gottlieb <---the character before the 2 is an el
franka@mmintl.UUCP (Frank Adams) (07/09/85)
In article <149@mips.UUCP> mash@mips.UUCP (John Mashey) writes: (concerning integer multiplies) > the real >question, is how frequent are these cases, really? [Not an answer, but >a question whose answer needs to be known when you're making tradeoffs >in CPU design]. Actually, this still isn't quite the right question. The right question is how common *should* these cases be? By which I mean, assuming good programming techniques. I have seen too many structure definitions and arrays which were padded or shrunk to make the size a power of two. Programmers shouldn't (and shouldn't have to) worry about such things! Analysis of current programming practices reflects the reactions to current (and past) architectures, and tends to propogate mistakes. All right, I will admit that such analysis is better than no analysis at all. But be aware of its limitations.
richardt@orstcs.UUCP (richardt) (07/10/85)
As I read, I see a *lot* of comments about how "most 68xxx & 32xxx instructions take 32 bits anyway!" I would like to suggest that a large factor in this may be due to *sloppy compilers*! I have written 68000 code as a hobbyist for about the last three years. I have started writing BASIC interpreters several times and a FORTH-relative once. In all of those cases, I found that the average instruction width # of instructions / # of bits for code = ~18 bits That's a rough estimate; It may be a little high. When you add in data, then things get interesting. A program that uses a lot of predetermined data runs up that average a lot more than you think. When I sat down and started to write op system routines, the average instruction was about 20 bits. (Note: I am using a mathematical average here.) My point is, If most of your programs have larger average instruction widths, maybe you've got a sloppily-written compiler. The average instruction width on a 'thousand needn't be anywhere near 32 bits. So look at your compiler before you go running of screaming 'Give me a RISC.' I'll step down off my soapbox now, and someone else can defend the National chips. I don't have the info (yet.) ---------------------------------------- Is there an assembly-language programmer in the house? orstcs!richardt /* ---------- */
mat@amdahl.UUCP (Mike Taylor) (07/12/85)
> > Having a hardware multiplication instruction isn't as much of a win as > you might think. On everybody's favorite chip, the 8088, a 16 x 16 > multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles > respectively. This means that for practically any constant multiplier > you'll get faster code by constructing your multiply from shifts, adds, > and substracts. > Unless the multiplier is very wide and smart enough to do things like sum partial products in parallel with each multiplication. This gives speed you couldn't get with the above. However, it sure is not obvious what the right thing to do is. -- Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat [ This may not reflect my opinion, let alone anyone else's. ]
peterb@pbear.UUCP (07/13/85)
/* Written 9:12 pm Jul 11, 1985 by amdahl!mat in pbear:net.arch */ >> >> Having a hardware multiplication instruction isn't as much of a win as >> you might think. On everybody's favorite chip, the 8088, a 16 x 16 >> multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles >> respectively. This means that for practically any constant multiplier >> you'll get faster code by constructing your multiply from shifts, adds, >> and substracts. >> >Unless the multiplier is very wide and smart enough to do things like sum >partial products in parallel with each multiplication. This gives speed >you couldn't get with the above. However, it sure is not obvious what >the right thing to do is. >-- >Mike Taylor ...!{ihnp4,hplabs,amd,sun}!amdahl!mat /* End of text from pbear:net.arch */ From the timing, one can see that since a 8 bit multiply can vary in cylcles from 70 - 77 and that a 16 bit multiply takes from 118 - 133, the difference in timing is 7 and 15 respectively. This is a by product of shift/add sequences to do the multiply, and it is slower than need be. A 26116 can do 16 by 16 multiply in about 19 clock cylces. that's about 1.9 microseconds for 100ns cycles time. Why it is so slow I don't know. could anybody explain this??? Peter Barada {ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb
henry@utzoo.UUCP (Henry Spencer) (07/15/85)
> As I read, I see a *lot* of comments about how "most 68xxx & 32xxx > instructions take 32 bits anyway!" I would like to suggest that a large > factor in this may be due to *sloppy compilers*! ... > ... > Is there an assembly-language programmer in the house? Speaking as the one who started this particular line of discussion, the numbers I originally quoted were from people working in assembler, doing implementations of the same little interpretive-language kernel on various different machines. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
jeff@gatech.CSNET (Jeff Lee) (07/16/85)
> > > > Having a hardware multiplication instruction isn't as much of a win as > > you might think. On everybody's favorite chip, the 8088, a 16 x 16 > > multiply takes about 115 cycles, while shifts and adds are 2 and 3 cycles > > respectively. This means that for practically any constant multiplier > > you'll get faster code by constructing your multiply from shifts, adds, > > and substracts. Most engineers would disagree with this statement (I am NOT an engineer), that having hardware multiplication is not a big win. > Unless the multiplier is very wide and smart enough to do things like sum > partial products in parallel with each multiplication. This gives speed > you couldn't get with the above. However, it sure is not obvious what > the right thing to do is. I remember looking over the algorithms for the Cyber 170/750 hardware (or was it a 170/755; those old machines). The machines central clock was on order of 50ns. It performed wierd splits of the bits to produce the intermediate products and performed a heap of adding in parallel. The result was that the first stage of the pipeline (I think it was a 2 stage job) could accept a pair of operands (48-bit mantissa, 11-bit exponent, 1-bit sign) every 2 cycles. What sort of algorithm does Cray now use for his Cray-II machine? Does he get his multiplies done in a single clock cycle or are they still pipelined? A friend of mine (previously an electrical engineer, but turned computer science) took a class in hardware algorithms. It looked pretty interesting but also not your typical "seat-of-the-pants" type algorithms. Quite a bit of thought went into the magic.... -- Jeff Lee CSNet: Jeff @ GATech ARPA: Jeff%GATech.CSNet @ CSNet-Relay.ARPA uucp: ...!{akgua,allegra,hplabs,ihnp4,linus,seismo,ulysses}!gatech!jeff
franka@mmintl.UUCP (Frank Adams) (07/16/85)
With all the discussion about multiplies, let me express one of my pet peeves. Which is that integer multiplies are hardly ever (never, in my experience) defined in a manner suitable for their most common use. I would guess, conservatively, that well over 90% of the integer multiplies that are done generate a result with no more bits than the larger of the two operands. That is, one multiplies two words and wants a word result. But hardware multiplies invariably generate a two word result, leaving the high-order word to be allowed for and/or disposed of. This is a problem equally for compilers and hand-written assembler code. The same problem occurs for division. Incidently, re the subscript computation problem, why not a command which multiplies a word by a one-byte constant? This would deal cleanly with most two-dimensional arrays and arrays of structures, yet should be implementable as a reasonably efficient instruction.
nather@utastro.UUCP (Ed Nather) (07/18/85)
> I would guess, conservatively, that well over 90% of the integer multiplies > that are done generate a result with no more bits than the larger of the > two operands. I didn't know that over 90% of multiplication operations were multiplying something by 1. Wow. You learn something new on the net every day! -- Ed Nather Astronomy Dept, U of Texas @ Austin {allegra,ihnp4}!{noao,ut-sally}!utastro!nather nather%utastro.UTEXAS@ut-sally.ARPA
mahar@weitek.UUCP (mahar) (07/18/85)
In article <493@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes: > Which is that integer multiplies are hardly ever (never, in my > experience) defined in a manner suitable for their most common use. > I would guess, conservatively, that well over 90% of the integer multiplies > that are done generate a result with no more bits than the larger of the > two operands. That is, one multiplies two words and wants a word result. > But hardware multiplies invariably generate a two word result, leaving the > high-order word to be allowed for and/or disposed of. The Decsystem-10 & 20 has an instruction IMUL which does exactly what you want. It Multiplies a 36bit value by another 36 bit value and gives a 36 bit result. They also have a divide which divides a 36 bit number.
crandell@ut-sally.UUCP (Jim Crandell) (07/19/85)
> > I would guess, conservatively, that well over 90% of the integer multiplies > > that are done generate a result with no more bits than the larger of the > > two operands. > > I didn't know that over 90% of multiplication operations were multiplying > something by 1. Wow. You learn something new on the net every day! They're not. Actually, 45% multiply by 1, and another 45%+ multiply by 0. -- Jim Crandell, C. S. Dept., The University of Texas at Austin {ihnp4,seismo,ctvax}!ut-sally!crandell
jss@sjuvax.UUCP (J. Shapiro) (07/22/85)
> > ... one multiplies two words and wants a word result. > > But hardware multiplies invariably generate a two word result, leaving the > > high-order word to be allowed for and/or disposed of. > > The Decsystem-10 & 20 has an instruction IMUL which does exactly what you > want. It Multiplies a 36bit value by another 36 bit value and gives a 36 > bit result. They also have a divide which divides a 36 bit number. For comparison and elucidation, so do: VAX PDP-11 National 32016 and family Motorola 68000, 68020, 6800, 6809, ... Zilog Z8000 Z80 Intel 8086, 80186, 82086 ----- Bell Laboratories is an Artificial Inteligence project of the UNIX (tm) operating system... (borrowed from an unknown wit) Jon Shapiro
franka@mmintl.UUCP (Frank Adams) (07/22/85)
[Me] >> I would guess, conservatively, that well over 90% of the integer multiplies >> that are done generate a result with no more bits than the larger of the >> two operands. [Ed Nather] >I didn't know that over 90% of multiplication operations were multiplying >something by 1. Wow. You learn something new on the net every day! Are you being intentionally obtuse? The number of bits in an operand is the number of bits being used to store the operand, not the number of significant bits in the operand. Sheesh.
jans@mako.UUCP (Jan Steinman) (07/23/85)
In article <2397@ut-sally.UUCP> Jim Crandell writes, quotes: >>>I would guess, conservatively, that well over 90% of the integer multiplies >>>that are done generate a result with no more bits than the larger of the ^^^^ >>>two operands. >> >>I didn't know that over 90% of multiplication operations were multiplying >>something by 1. Wow. You learn something new on the net every day! > >They're not. Actually, 45% multiply by 1, and another 45%+ multiply by 0. > *** Sarcasm Mode Off *** One would assume the original poster meant atomic units, such as machine words, rather than (the obviously wrong) bits. It should be stressed that such determination is essentially a run-time operation, and the additional overhead of overflow detection MUST then be done. -- :::::: Jan Steinman Box 1000, MS 61-161 (w)503/685-2843 :::::: :::::: tektronix!tekecs!jans Wilsonville, OR 97070 (h)503/657-7703 ::::::
louie@umd5.UUCP (07/23/85)
In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes: >> > ... one multiplies two words and wants a word result. >> > But hardware multiplies invariably generate a two word result, leaving the >> > high-order word to be allowed for and/or disposed of. >> The Decsystem-10 & 20 has an instruction IMUL which does exactly what you >> want. It Multiplies a 36bit value by another 36 bit value and gives a 36 >> bit result. They also have a divide which divides a 36 bit number. > >For comparison and elucidation, so do: > > VAX > PDP-11 > National 32016 and family > Motorola 68000, 68020, 6800, 6809, ... > Zilog Z8000 > Z80 Huh? You think that a Z80 is clever enough to MULTIPLY? Mine sure doesn't. By the way, the Sperry 1100 series also had a Multiple Single Integer (MSI) instruction that give a 36 bit result from two 36 bit operands. This seems to get used much more often (at least in my code) than the Multipy Integer(MI) instruction that give a 72 bit result. > Intel 8086, 80186, 82086 > >Jon Shapiro -- Louis A. Mamakos WA3YMH University of Maryland, Computer Science Center Internet: louie@umd5.arpa UUCP: {seismo!umcp-cs, ihnp4!rlgvax}!cvl!umd5!louie
darrell@sdcsvax.UUCP (Darrell Long) (07/23/85)
In article <233@weitek.UUCP> mahar@weitek.UUCP (mahar) writes: >In article <493@mmintl.UUCP>, franka@mmintl.UUCP (Frank Adams) writes: >> Which is that integer multiplies are hardly ever (never, in my >> experience) defined in a manner suitable for their most common use. >> I would guess, conservatively, that well over 90% of the integer multiplies >> that are done generate a result with no more bits than the larger of the >> two operands. That is, one multiplies two words and wants a word result. >> But hardware multiplies invariably generate a two word result, leaving the >> high-order word to be allowed for and/or disposed of. > >The Decsystem-10 & 20 has an instruction IMUL which does exactly what you >want. It Multiplies a 36bit value by another 36 bit value and gives a 36 >bit result. They also have a divide which divides a 36 bit number. This is also true for the VAX-11 and the AT&T 3B series. If you want a two word result then you must ask for it explicitly (if you can get it at all). -- Darrell Long Department of Electrical Engineering and Computer Science University of California, San Diego USENET: sdcsvax!darrell ARPA: darrell@sdcsvax
doug@terak.UUCP (Doug Pardee) (07/23/85)
> But hardware multiplies invariably generate a two word result, leaving the > high-order word to be allowed for and/or disposed of. This is a problem > equally for compilers and hand-written assembler code. > > The same problem occurs for division. National's NS320xx CPUs do byte*byte->byte, word*word->word, and long*long->long multiplication (and division) as the normal cases. They also have "extended" multiply and divide instructions with the "conventional" double-length product/dividend, but these are a bit more difficult to use because they are oriented toward extended-precision arithmetic (for example, these are strictly unsigned operations). > Incidently, re the subscript computation problem, why not a command which > multiplies a word by a one-byte constant? ... The NS320xx instruction set includes the INDEX instruction, which multiplies a subscript value by a length and adds the result to a register. The length can be contained in a register or memory or can be a constant, and can be 8, 16, or 32 bits long (unsigned). The catch is that it must be the same length as the subscript. -- Doug Pardee -- Terak Corp. -- !{ihnp4,seismo,decvax}!noao!terak!doug ^^^^^--- soon to be CalComp
peterb@pbear.UUCP (07/24/85)
Jon, The 8086 / 80186 do NOT produce same magnitude results from multiplication and division. AX is mulitiplied by the operand and the result is left in DX:AX. This is 32 bits, not the original 16 that the operands are! Peter Barada {ihnp4!inmet|{harvard|cca}!ima}!pbear!peterb
tc@amd.UUCP (Tom Crawford) (07/24/85)
The SDS 9300 had an instruction called Twin Multiply. It treated two 24-bit operands as pairs of 12-bit integers and performed two 12 x 12 => 24 multiplies. Who's SDS, you ask? Scientific Data Systems, of course! Tom Crawford ...amd!tc
jer@peora.UUCP (J. Eric Roskos) (07/25/85)
> Incidently, re the subscript computation problem, why not a command which > multiplies a word by a one-byte constant? This would deal cleanly with > most two-dimensional arrays and arrays of structures, yet should be > implementable as a reasonably efficient instruction. Why not have instructions that contain small constants built into the opcode? What? Never heard of that? I guess it's because it's not a RISC... Seriously, though, the above sorts of things are quite common in current CISC research... in fact, there are some good papers (e.g., by Flynn) on optimally encoding instructions; this is where the recently-much-maligned ideas on having machines with different instruction sets for different applications/compilers come from (whether or not different COMPILERS need different instruction sets, or whether it's really the APPLICATION that determines it, depends on who you talk to.). Actually it's much more general than that, though, since you use an optimization scheme to minimize the length of instructions for some desired set of programs; and a result of this is that short constants come out encoded in the opcodes for some common classes of programs. -- Shyy-Anzr: J. Eric Roskos UUCP: ..!{decvax,ucbvax,ihnp4}!vax135!petsd!peora!jer US Mail: MS 795; Perkin-Elmer SDC; 2486 Sand Lake Road, Orlando, FL 32809-7642
henry@utzoo.UUCP (Henry Spencer) (07/25/85)
> > The Decsystem-10 & 20 has an instruction IMUL which does exactly what you > > want. It Multiplies a 36bit value by another 36 bit value and gives a 36 > > bit result. They also have a divide which divides a 36 bit number. > > For comparison and elucidation, so do: > > VAX > PDP-11 > National 32016 and family > Motorola 68000, 68020, 6800, 6809, ... > Zilog Z8000 > Z80 > Intel 8086, 80186, 82086 From personal experience, I can assure you that the PDP-11 and the 6809 most definitely do not. And the 6800 has no multiply instruction at all. I don't think the Z80 does either. The 68000 has 16x16->32, but neither 32x32->32 nor 16x16->16 that I remember. And my (admittedly dim) memory of the Z8000 is that its multiply instructions yield double-width results. -- Henry Spencer @ U of Toronto Zoology {allegra,ihnp4,linus,decvax}!utzoo!henry
franka@mmintl.UUCP (Frank Adams) (07/25/85)
In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes: >> > ... one multiplies two words and wants a word result. >> > But hardware multiplies invariably generate a two word result, leaving the >> > high-order word to be allowed for and/or disposed of. >> >> The Decsystem-10 & 20 has an instruction IMUL which does exactly what you >> want. > >For comparison and elucidation, so do: > > VAX > PDP-11 > National 32016 and family > Motorola 68000, 68020, 6800, 6809, ... > Zilog Z8000 > Z80 > Intel 8086, 80186, 82086 > All right, I shouldn't have said *invariably*. The PDP-10 is a good machine. But while I am not sufficiently familiar with the others in this list, the 8086 does *not* have such an instruction. You can multiply two 8-bit quantities to get a 16-bit result, or two 16-bit quantities to get a 32-bit result. The MUL/IMUL distinction on the 8086 is for signed vs unsigned operands.
mhs@enmasse.UUCP (Mike Schloss) (07/26/85)
> > The Decsystem-10 & 20 has an instruction IMUL which does exactly what you > > want. It Multiplies a 36bit value by another 36 bit value and gives a 36 > > bit result. They also have a divide which divides a 36 bit number. > > For comparison and elucidation, so do: > > VAX > PDP-11 > National 32016 and family > Motorola 68000, 68020, 6800, 6809, ... > Zilog Z8000 > Z80 > Intel 8086, 80186, 82086 > > > Jon Shapiro I may be missing something but my Motorola 68000 manual says nothing about 32 bit mults. Just 16*16 signed and unsigned. The 68020 does have 32*32 but I doubt any in the 6800 family do. Mike Schloss
jack@boring.UUCP (08/05/85)
In article <1202@sjuvax.UUCP> jss@sjuvax.UUCP (J. Shapiro) writes: <Included text about a machine that multiplies and divides two N bit numbers, giving an N bit result> >For comparison and elucidation, so do: > > VAX Dunno, could be. > PDP-11 Wrong. Both MUL and DIV (If you don't own an 11/10, which doesn't even have a DIV:-) use a register *pair* for a 32 bit quantity. > National 32016 and family Right. > Motorola 68000, 68020 More-or-less right (though there's some tricky sign bit stuff, I remember). > 6800 MUL? DIV???? WHoeaaaaaaa!!! You can be glad that it has an ADD and a SUB!! > 6809 Wrong. 8x8->16 multiply, and no DIV, I believe. > Zilog Z8000 Dunno. > Z80 Cannot DIV or MUL. > Intel 8086, 80186, 82086 Dunno. Well, that kind of invalidates your point, I believe. -- Jack Jansen, jack@mcvax.UUCP The shell is my oyster.