mdeale@sargas.acs.calpoly.edu (Myron Deale) (06/03/90)
Hello, I know this is a silly thing to try, but here's an article relating to COMPuter ARCHitecture. If you're a compiler-driven type, press 'n'. I believe that given the ever increasing frequencies for micro- processors, the division operation would do well by using a shift-and-subtract method vs. using a multiplier based technique. An extra shifter and subtractor would take up minimal chip area, and division could procede *simultaneously* with multiplication. On the other hand, the i860 can produce a result every instruction clock cycle. "Newton"s method could be accomplished in about 10 cycles; although I hear the actual figure is more like 21-22 cycles (to account for exponent processing?). But then the multiply unit is tied up for the division operation, which could be a real performance hit on certain applications. Reply via email please. -Myron // mdeale@cosmos.acs.calpoly.edu
aglew@oberon.csg.uiuc.edu (Andy Glew) (06/05/90)
To: mdeale@cosmos.acs.calpoly.edu Subject: Shift Division > I believe that given the ever increasing frequencies for micro- >processors, the division operation would do well by using a >shift-and-subtract method vs. using a multiplier based technique. >An extra shifter and subtractor would take up minimal chip area, >and division could procede *simultaneously* with multiplication. > > On the other hand, the i860 can produce a result every instruction >clock cycle. "Newton"s method could be accomplished in about 10 cycles; >although I hear the actual figure is more like 21-22 cycles (to account >for exponent processing?). But then the multiply unit is tied up for >the division operation, which could be a real performance hit on >certain applications. As usual, "it depends". In fact, at this moment most integer processors implement divide with a shift and add, except for the 88K which has an on-chip multiplier, and the ones which borrow from the FPU chip. For a GaAs microprocessor with very limited device count, shift mechanisms are probably better. For a CMOS processor where, increasingly, the problem is "I can get 2 million transistors on a chip[*], but the processor only takes 350,000. What is the best thing to use the remaining transistors for?" the answer is not so clear. [*] Note that 1.2 million is now. Chips being designed now are a bit larger. What can the extra devices be used for? - large on-chip caches probably for a while longer. however, there is a fall-off point around 64K-256K bytes of cache. - on-chip real memory trouble is, we don't have enough space to provide really interesting amounts of memory (top of the line workstations require ~16M now). There's a market in embedded systems for just about every bit of on-chip memory size, so these chips will be built. And, at some point, bottom-of-the-line workstations will take a sudden drop in price by using on-chip main memory as embedded processors cross over. - add special functional units on-chip A lot of companies are already doing this: - parallel array multipliers - floating point - mmu - graphics ops However, there are only so many functional units that a general purpose microprocessor can use. Once again, embedded processors can use many more, so there will probably be growth in that market segment. And then, in a few years, somebody will realize that an embedded processor with HUD (Heads-up-display) controller for output, a single serial line for keyboard, on-chip timer, and 4M of RAM is a decent wristwatch computer... (Ethernet on-chip, unfortunately, is a good bit further away. Sigh). - parallelism There are several types of on-chip parallelism that can be explored: - superscalar - liw Both of the above are being explored pretty much already. - multiple processors per chip Certainly possible, but this doesn't help you with the single large job (the big selling point), and, to build a really large high performance system you have to solve the off-chip performance problem anyway. However, it is something worth considering, especially when you can make tradeoffs like providing only one multiplier array shared between two processors. Research topic? From time to time I draw out graphs of where I think these trends are going, and when those cut-overs from embedded to general purpose will occur. I'd appreciate any data points for these graphs. Anyway, about divides: Divide is one of those things that, if you don't do it well, you will be excluded from certain markets. My guess would be that if your memory access time (cache *hit* time) is below, say, 16 cycles parallel divide using a multiplier will win if you can fit it in. Otherwise, serial divide; and serial divide when you can't fit a muliplier in. By the way, this is what comp.arch should be talking about. Hence my posting. -- Andy Glew, aglew@uiuc.edu
marvin@oakhill.UUCP (Marvin Denman) (06/06/90)
In article <AGLEW.90Jun4144801@oberon.csg.uiuc.edu> aglew@oberon.csg.uiuc.edu (Andy Glew) writes: >In fact, at this moment most integer processors implement divide with a shift and add, >except for the 88K which has an on-chip multiplier, and the ones which borrow >from the FPU chip. Just to clarify things, the 88k does division by shift and add. The on-chip multiplier is only used for integer and floating point multiplication. At the time of the design there were few if any IEEE 754 implementations that used the multiplier for division. Even now I am only personally aware of 2 implementations of this type that claim IEEE accuracy. There are probably several more. Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin -- Marvin Denman Motorola 88000 Design cs.utexas.edu!oakhill!marvin