mark@mips.COM (Mark G. Johnson) (08/18/89)
In article <21353@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: >I remember reading in an old AFIPS paper that von Neumann believed >computers of the future would all have the SQRT instruction, because of >the importance of square root in coordinate geometry. Several of the RISC camps, armed with gigabytes of traces and simulation results, have decided to include FP square root as well. I suspect it is because of the importance of square root in computing the width of the depletion layer at a semiconductor junction. -- -- Mark Johnson MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086 ...!decwrl!mips!mark (408) 991-0208
oconnordm@CRD.GE.COM (Dennis M. O'Connor) (08/18/89)
mark@mips (Mark G. Johnson) writes: ]In article <21353@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes: ] >I remember reading in an old AFIPS paper that von Neumann believed ] >computers of the future would all have the SQRT instruction, because of ] >the importance of square root in coordinate geometry. ] ]Several of the RISC camps, armed with gigabytes of traces and simulation ]results, have decided to include FP square root as well. I suspect it is ]because of the importance of square root in computing the width of the ]depletion layer at a semiconductor junction. Three years ago, a team working on a RISC at GE went and asked the real-time-control people at GE what they need in the way of complicated floating-point operations. Other than +-*/, the answer wasn't square root. It was ( drum roll ... ) Inverse Square Root ( i.e. X to the negative one-half power ). Seems control theory is riddled with inverse square roots. And it's a lot faster to do Inv.SQRT than SQRT followed by divide, as you would expect. However, by putting a 64-entry "first-guess" table in memory, and doing Newton-Raphson, you can do this with multiplies almost as quickly :-) as doing it in microcode. There ARE other considerations, of course. Right and wrong aren't applicable judgements to make in this domain. -- Dennis O'Connor OCONNORDM@CRD.GE.COM UUNET!CRD.GE.COM!OCONNORDM Rimmer : "This isn't part of my fantasy!" Cat : "No, it's part of mine!"
cik@l.cc.purdue.edu (Herman Rubin) (08/18/89)
There is no reason why a machine with hardware division should not have hardware square root. It costs almost nothing. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)
stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (08/19/89)
In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: > >There is no reason why a machine with hardware division should not have >hardware square root. It costs almost nothing. >-- >Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907 >Phone: (317)494-6054 >hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP) Ah, but that presumes that hardware division is warranted! The Cydra-5 included hardware divide and square-root on the some board so your contention on this point is correct. However, I'm not sure I'd accept the proposition that these functions were mandatory for the majority of applications. I know that both operations sure did a number on scheduling the inner-most loop. Both operations had a long latency, thus caused scheduling headaches. The other point is that this board was fairly expensive. Does the occurance rate of divide/square root in scientific computing justify the cost? How does the scientific computing community feel about this functionality? Steve Wilson The above are my opinions, not those of my employer.
khb@road.Sun.COM (Keith Bierman - Advanced Languages - Floating Point Group ) (08/19/89)
In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes: >In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >The Cydra-5 included hardware divide and square-root on the some >board .... > >I know that both operations sure did a number on scheduling the inner-most >loop. Both operations had a long latency, thus caused scheduling >headaches.> >How does the scientific computing community feel about this functionality? The Cydra-5's very long divide latency proved very harmful when running customer level application benchmarks. Divide may be "rare" but when it happens it happens often. Furthermore the very long (26ish) cycle latency was compounded by a decision to reuse some of the stages .. so it was 26cycles between initations ... as opposed to 1 or 2 cycles for most other operations. This resulted in compile times of HOURS for some simple loops ( / / / /) while the compiler tried to get a sensible schedule (for some reason, my suggestion of simply having acompiler directive to give up after a couple of minutes wasn't accepted). Both divide and sqrt crop up when one wants to be VERY careful about numerics ... as several really good algorithms rely on them ... there are quicker alogorithms for those applications, but usually less numerically robust. The Cydra 5 failed for primarily for business reasons; but there were some suboptimal technical decisions and I'd place 26 cycle II for divide on that list. Key applications went a good 10x slower; compile times went exponetially bad (though that was fixable). The cost of NOT having done a better job on this was quite large. I don't know if the desigers gave thought to giving up sqrt for a pipelined divide .... but it would have been a very good trade. While no one at ardent will 'fess up, I feel pretty confident in guessing that their next machine will have divide, or they will adopt a cray style "divide". Keith H. Bierman |*My thoughts are my own. !! kbierman@sun.com It's Not My Fault | MTS --Only my work belongs to Sun* I Voted for Bill & | Advanced Languages/Floating Point Group Opus | "When the going gets Weird .. the Weird turn PRO"
lgy@blake.acs.washington.edu (Laurence Yaffe) (08/19/89)
In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:
- Ah, but that presumes that hardware division is warranted!
-
- Does the occurrence rate of divide/square root in scientific computing
- justify the cost?
-
- How does the scientific computing community feel about this functionality?
-
- Steve Wilson
From a user's perspective (having designed and used quite a range
of scientific applications):
1) Whether or not divide or square root is done in hardware (per se) is
irrelevant. What matters is the ratio of their speed to that of multiplies.
2) I'm content if divide is (typically) no more than 5 times slower
than multiply. Most of my programs probably wouldn't care if
divide was 100 times slower than multiply, but that's not good
enough - a few of my important programs do need a better divide.
3) 64 bit floating multiply should be no more than 4 or 5 times slower than
integer addition. This is critical.
4) 32 bit floating point operations are almost irrelevant, and design
tradeoffs which speed up 32 operations at the price of slowing down
64 bit ops are a mistake.
5) I'm not a big user of square roots. As long as square roots cost
less than 100 multiplies, I don't think I'd trade faster square roots
for slower performance in other ops.
6) As always, what REALLY matters is the speed of running real programs.
I find the ratio of instruction timings on MIPS machines fairly close to
optimum - I wouldn't trade my M/2000 for a machine with faster integer
performance but significantly slower floating point, or faster divides but
slower multiplies, etc. However, I do wish that (a) integer multiplication
was as fast as floating, and (b) that it was possible to tell the hardware
to set floating point underflows to zero without generating a trap (which
gets handled by software). When I was choosing a new machine a year ago,
I down-rated Sun 4's for poor floating point performance (compared to integer),
and down-rated the Apollo DN1000 for poor integer performance (compared to
floating point).
--
Laurence G. Yaffe Internet: lgy@newton.phys.washington.edu
University of Washington Bitnet: yaffe@uwaphast.bitnet
seibel@cgl.ucsf.edu (George Seibel) (08/19/89)
In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes: >Does the occurance rate of divide/square root in scientific computing >justify the cost? >How does the scientific computing community feel about this functionality? Molecular Dynamics people are rather keen on 1/sqrt(r). It's used in evaluating the coulomb interaction between two charged particles. It wouldn't be unusual to do it 10^10 to 10^11 times in a run. On some machines it may amount to half the cycles. George Seibel, UCSF
nettles@b.gp.cs.cmu.edu (Scott Nettles) (08/19/89)
In article <11758@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes: > Molecular Dynamics people are rather keen on 1/sqrt(r). It's used >in evaluating the coulomb interaction between two charged particles. >It wouldn't be unusual to do it 10^10 to 10^11 times in a run. On some >machines it may amount to half the cycles. Good MD code will NOT use square root this way. By working with potentials that are transformed to use the square of the distance, one can avoid doing these costly operations for every pair potential, resulting in substantial preformance improvements. The fastest MD codes I know of use table look up and interpolation to evaluate potential functions. Another advantage of this technique is that it works for complex potentials, as well as simple ones. At least from my experience (and I've done both many CPU day and week long scientific calculations, and worked with top researchers in computer architecture), the best thing computer designers can do is make the machine as fast as possible on 64 bit floating point multiply and adds (and maybe add hardware assist for divide), and let the software do the rest. Cray has the right idea, but we knew that already. Scott Nettles Scott.Nettles@cs.cmu.edu
hankd@pur-ee.UUCP (Hank Dietz) (08/21/89)
In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes: ... >2) I'm content if divide is (typically) no more than 5 times slower > than multiply. Most of my programs probably wouldn't care if > divide was 100 times slower than multiply, but that's not good > enough - a few of my important programs do need a better divide. ... >5) I'm not a big user of square roots. As long as square roots cost > less than 100 multiplies, I don't think I'd trade faster square roots > for slower performance in other ops. .... Doing this all in software (and restricting ourselves to integer for simplicity), we managed to get divide to be FASTER than multiply... but it doesn't give you modulus/remainder for free. Yeah, it surprised me too. I still haven't been able to find a software technique to make square root comparably fast. Anybody got a screamer of a software technique for computing square root? -hankd@ecn.purdue.edu
drago@bnr-rsc.UUCP (Drago Goricanec) (08/21/89)
In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes: >>5) I'm not a big user of square roots. As long as square roots cost >> less than 100 multiplies, I don't think I'd trade faster square roots >> for slower performance in other ops. >... >Doing this all in software (and restricting ourselves to integer for >simplicity), we managed to get divide to be FASTER than multiply... but it >doesn't give you modulus/remainder for free. Yeah, it surprised me too. I >still haven't been able to find a software technique to make square root >comparably fast. Anybody got a screamer of a software technique for >computing square root? Here's a method that computes the square root directly (ie. this is not an approximation method). I don't remember what it's called, but I originally saw the method done on decimal numbers. The algorithm is much simpler on binary numbers. It resembles long division on paper. I'll present it in 68030 code, then describe it. I apologize for not using a higher level language, but this algorithm is most suited for an environment where bits can be manipulated directly. The code here is for discussion purposes only, and can be optimized. ; D1 contains the 31-bit argument whose square root is computed ; D0 will contain the 16-bit square root on exit ; D2, D3 are scratch registers ; D4 is a loop counter mov.b #15, D4 clr.l D2 ; initialize clr.l D3 clr.l D0 loop lsl.l #1, D1 ; shift D1 into D2 left 2 rol.l #1, D2 lsl.l #1, D1 rol.l #1, D2 add.l D3, D3 ; same as shift D3 left 1 add.l D0, D0 cmp.l D2, D3 ; is D3 < D2? bcc skip ; skip if not sub.l D3, D2 ; D2 := (D2 - D3 - 1) subq.l #1, D2 addq.l #2, D3 addq.l #1, D0 ; set bit in result skip subq.b #1, D4 bpl loop ; while D4 >= 0 ; at this point, D0 contains sqrt( D1 ) ; original contents of D1 lost The reason D1 contains a 31-bit, and not 32-bit result, is that the most significant bit could be lost in the shifts above. The code above was not tested, but has been derived from the C program below, which does work. #include <stdio.h> main() { unsigned int D2, D3, D0, num; int i; scanf( "%d", &num ); D2 = D3 = D0 = 0; for( i=0; i<16; i++ ) { D2 <<= 2; D2 |= (num >> 30) & 0x03; num <<= 2; D3 <<= 1; D0 <<= 1; if( D3<D2 ) { D2 -= (D3 + 1); D3 += 2; D0 += 1; } } printf( "%d\n", D0 ); } Essentially, this algorithm will compute the square root of an n-bit number in approximately n/2 iterations of the loop. The bits are shifted into the result as necessary. To round the result, you need to iterate one more time (taking care not to lose the most significant bits in D2), and round up or down, based on the last computed bit in D0. Drago -- Drago Goricanec (...!bnr-vpa!bnr-rsc!drago) Bell-Northern Research Ltd., Ottawa, Ontario (613) 763-8957
firth@sei.cmu.edu (Robert Firth) (08/21/89)
In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes: >.. I still haven't been able to find a software technique to make square root >comparably fast. Anybody got a screamer of a software technique for >computing square root? A fairly good way os to use bit-level hacking to get a first approximation, and then unroll a few iterations of Newton's method. When I did sqrt for the PDP-11, the first approximation was two shifts and an add, and gave 5 bits of accuracy. Two iterations then yielded 23 bits for a reasonable single-precision result. With everything done in registers, the cost was certainly less than 20 multiplies.
slackey@bbn.com (Stan Lackey) (08/21/89)
In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes: >In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes: >>There is no reason why a machine with hardware division should not have >>hardware square root. It costs almost nothing. >Ah, but that presumes that hardware division is warranted! >Does the occurance rate of divide/square root in scientific computing >justify the cost? >How does the scientific computing community feel about this functionality? I was once involved in a trade-off between doing a divider gate array, or just iterating through the multiplier. Performance was that for a multiply of 3 cycles, the division would take 18. Also, vector multiplication would be one chime, and vector division 18. (We still considered it worthwhile to keep the vector division instructions, in case a future implementation had a faster one and also to make things easier for the compiler.) We found that over a wide range of applications the ratio of mul to div was like 3:1 to 5:1. Worse still, Ahmdahl's Law predicted that with those numbers we'd be spending nearly 50% of our time doing division!! In the applications we wanted to do well at, that is. We did the gate array. And yes it has square root. I personally don't know how important sqrt is, but it was cheap enough. -Stan
holm@handel.mpr.ca (Terrence W. Holm) (08/24/89)
In article <911@bnr-rsc.UUCP> drago@bnr-rsc.UUCP (Drago Goricanec) writes: >>In article <3312@blake.acs.washington.edu> (Laurence Yaffe) writes: >>>5) I'm not a big user of square roots. As long as square roots cost >>> less than 100 multiplies, I don't think I'd trade faster square roots >>> for slower performance in other ops. > >Here's a method that computes the square root directly (ie. this is not >an approximation method). I don't remember what it's called, but I >originally saw the method done on decimal numbers. The algorithm is >much simpler on binary numbers. It resembles long division on paper. > >I'll present it in 68030 code, then describe it. > ... >Drago Goricanec (...!bnr-vpa!bnr-rsc!drago) >Bell-Northern Research Ltd., Ottawa, Ontario (613) 763-8957 The algorithm looked interesting to me, so I rewrote it for the Motorola DSP56000. It takes 188 cycles (7 microseconds). [A multiply on this processor takes 2 cycles.] The actual code is below, if anybody cares (should be in comp.dsp). ----------------------------------------------------------------- opt cc ROUND equ 0 ; Set to 0 to enable truncated results ; Set to 1 to enable rounding results org p:$100 ; sqrt() - Square root of an integer ; [Not for fixed point fractional numbers] ; ; Author: Terrence W. Holm, 1989-August-23 ; ; ; Entry: ; a1 = 24 bit unsigned integer ; m0 = $FFFF ; m1 = $FFFF ; ; Exit: ; a1 = Truncated or rounded square root of (a1) ; ; Destroys: ; a2, a1, a0 ; b2, b1, b0 ; x1, x0 ; r1, r0 ; n1, n0 ; ; Cycles: 188, (6.96 us @ 27MHz) for truncated results ; 208, (7.70 us @ 27MHz) for rounded results ; [From original: d0 is not necessary, d1 = b0, d2 = b1, d3 = r0] sqrt ; a1 = number clr b #>1,x1 ; b1 will be left_part(number) - (r0/2)**2 move b,r0 ; r0 will be sqrt(number) * 2 move a1,b0 ; b0 = number [shifts left into b1] move #<2,n1 do #12+ROUND,end ; 2 bits at a time for 24 bits in total asl b r0,n0 asl b ; b1 = left_part(number) tfr b,a (r0)+n0 ; a1/a0 is copy of b1/b0 ; r0 = 2 * r0 sub x1,a r0,r1 move r0,x0 sub x0,a (r1)+n1 ; a1 = b1 - r0 - 1 ; r1 = r0 + 2 tpl a,b r1,r0 ; if ( b1 >= r0 + 1 ) then [d2>d3] ; b1 = b1 - r0 - 1 ; r0 = r0 + 2 end move r0,a asr a ; a1 = r0/2 = sqrt(number) if ROUND==1 move #<2,a0 ; Force round up not convergent rounding asr a rnd a ; Round up endif ----------------------------------------------------------------- Terrence W. Holm holm@mprgate.mpr.ca