[comp.arch] John von Neumann, sqrt instr

mark@mips.COM (Mark G. Johnson) (08/18/89)

In article <21353@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
    >I remember reading in an old AFIPS paper that von Neumann believed
    >computers of the future would all have the SQRT instruction, because of
    >the importance of square root in coordinate geometry.
 
Several of the RISC camps, armed with gigabytes of traces and simulation
results, have decided to include FP square root as well.  I suspect it is
because of the importance of square root in computing the width of the
depletion layer at a semiconductor junction.
-- 
 -- Mark Johnson	
 	MIPS Computer Systems, 930 E. Arques, Sunnyvale, CA 94086
	...!decwrl!mips!mark	(408) 991-0208

oconnordm@CRD.GE.COM (Dennis M. O'Connor) (08/18/89)

mark@mips (Mark G. Johnson) writes:
]In article <21353@cup.portal.com> mmm@cup.portal.com (Mark Robert Thorson) writes:
]    >I remember reading in an old AFIPS paper that von Neumann believed
]    >computers of the future would all have the SQRT instruction, because of
]    >the importance of square root in coordinate geometry.
] 
]Several of the RISC camps, armed with gigabytes of traces and simulation
]results, have decided to include FP square root as well.  I suspect it is
]because of the importance of square root in computing the width of the
]depletion layer at a semiconductor junction.

Three years ago, a team working on a RISC at GE went and asked the
real-time-control people at GE what they need in the way of complicated
floating-point operations. Other than +-*/, the answer wasn't
square root. It was ( drum roll ... )

	Inverse Square Root ( i.e. X to the negative one-half power ).

Seems control theory is riddled with inverse square roots.
And it's a lot faster to do Inv.SQRT than SQRT followed by divide,
as you would expect. However, by putting a 64-entry "first-guess"
table in memory, and doing Newton-Raphson, you can do this
with multiplies almost as quickly :-) as doing it in microcode.

There ARE other considerations, of course. Right and wrong
aren't applicable judgements to make in this domain.
--
 Dennis O'Connor      OCONNORDM@CRD.GE.COM       UUNET!CRD.GE.COM!OCONNORDM
 Rimmer : "This isn't part of my fantasy!" Cat : "No, it's part of mine!"

cik@l.cc.purdue.edu (Herman Rubin) (08/18/89)

There is no reason why a machine with hardware division should not have
hardware square root.  It costs almost nothing.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

stevew@wyse.wyse.com (Steve Wilson xttemp dept303) (08/19/89)

In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>
>There is no reason why a machine with hardware division should not have
>hardware square root.  It costs almost nothing.
>-- 
>Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907
>Phone: (317)494-6054
>hrubin@l.cc.purdue.edu (Internet, bitnet, UUCP)

Ah, but that presumes that hardware division is warranted!

The Cydra-5 included hardware divide and square-root on the some
board so your contention on this point is correct. However, I'm
not sure I'd accept the proposition that these functions were mandatory
for the majority of applications.

I know that both operations sure did a number on scheduling the inner-most
loop.  Both operations had a long latency, thus caused scheduling
headaches.  The other point is that this board was fairly expensive.
Does the occurance rate of divide/square root in scientific computing
justify the cost?

How does the scientific computing community feel about this functionality?

Steve Wilson

The above are my opinions, not those of my employer.

khb@road.Sun.COM (Keith Bierman - Advanced Languages - Floating Point Group ) (08/19/89)

In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:
>In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:

>The Cydra-5 included hardware divide and square-root on the some
>board ....
>
>I know that both operations sure did a number on scheduling the inner-most
>loop.  Both operations had a long latency, thus caused scheduling
>headaches.>
>How does the scientific computing community feel about this functionality?

The Cydra-5's very long divide latency proved very harmful when
running customer level application benchmarks. Divide may be "rare"
but when it happens it happens often. Furthermore the very long
(26ish) cycle latency was compounded by a decision to reuse some of
the stages .. so it was 26cycles between initations ... as opposed to
1 or 2 cycles for most other operations. This resulted in compile
times of HOURS for some simple loops ( / / / /) while the compiler
tried to get a sensible schedule (for some reason, my suggestion of
simply having acompiler directive to give up after a couple of minutes
wasn't accepted).

Both divide and sqrt crop up when one wants to be VERY careful about
numerics ... as several really good algorithms rely on them ... there
are quicker alogorithms for those applications, but usually less
numerically robust.

The Cydra 5 failed for primarily for business reasons; but there were
some suboptimal technical decisions and I'd place 26 cycle II for
divide on that list. Key applications went a good 10x slower; compile
times went exponetially bad (though that was fixable). The cost of NOT
having done a better job on this was quite large. I don't know if the
desigers gave thought to giving up sqrt for a pipelined divide ....
but it would have been a very good trade.

While no one at ardent will 'fess up, I feel pretty confident in
guessing that their next machine will have divide, or they will adopt
a cray style "divide". 

Keith H. Bierman    |*My thoughts are my own. !! kbierman@sun.com
It's Not My Fault   |	MTS --Only my work belongs to Sun* 
I Voted for Bill &  | Advanced Languages/Floating Point Group            
Opus                | "When the going gets Weird .. the Weird turn PRO"

lgy@blake.acs.washington.edu (Laurence Yaffe) (08/19/89)

In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:

- Ah, but that presumes that hardware division is warranted!
- 
- Does the occurrence rate of divide/square root in scientific computing
- justify the cost?
- 
- How does the scientific computing community feel about this functionality?
- 
- Steve Wilson

    From a user's perspective (having designed and used quite a range
of scientific applications):

1) Whether or not divide or square root is done in hardware (per se) is
   irrelevant.  What matters is the ratio of their speed to that of multiplies.

2) I'm content if divide is (typically) no more than 5 times slower
   than multiply.  Most of my programs probably wouldn't care if
   divide was 100 times slower than multiply, but that's not good
   enough - a few of my important programs do need a better divide.

3) 64 bit floating multiply should be no more than 4 or 5 times slower than
   integer addition.  This is critical.

4) 32 bit floating point operations are almost irrelevant, and design
   tradeoffs which speed up 32 operations at the price of slowing down
   64 bit ops are a mistake.

5) I'm not a big user of square roots.  As long as square roots cost
   less than 100 multiplies, I don't think I'd trade faster square roots
   for slower performance in other ops.

6) As always, what REALLY matters is the speed of running real programs.

I find the ratio of instruction timings on MIPS machines fairly close to
optimum - I wouldn't trade my M/2000 for a machine with faster integer
performance but significantly slower floating point, or faster divides but
slower multiplies, etc.  However, I do wish that (a) integer multiplication
was as fast as floating, and (b) that it was possible to tell the hardware
to set floating point underflows to zero without generating a trap (which
gets handled by software).  When I was choosing a new machine a year ago,
I down-rated Sun 4's for poor floating point performance (compared to integer),
and down-rated the Apollo DN1000 for poor integer performance (compared to
floating point).

-- 
Laurence G. Yaffe		Internet: lgy@newton.phys.washington.edu
University of Washington	  Bitnet: yaffe@uwaphast.bitnet

seibel@cgl.ucsf.edu (George Seibel) (08/19/89)

In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:

>Does the occurance rate of divide/square root in scientific computing
>justify the cost?

>How does the scientific computing community feel about this functionality?

  Molecular Dynamics people are rather keen on 1/sqrt(r).   It's used
in evaluating the coulomb interaction between two charged particles.
It wouldn't be unusual to do it 10^10 to 10^11 times in a run.  On some
machines it may amount to half the cycles.

George Seibel, UCSF

nettles@b.gp.cs.cmu.edu (Scott Nettles) (08/19/89)

In article <11758@cgl.ucsf.EDU> seibel@cgl.ucsf.edu (George Seibel) writes:
>  Molecular Dynamics people are rather keen on 1/sqrt(r).   It's used
>in evaluating the coulomb interaction between two charged particles.
>It wouldn't be unusual to do it 10^10 to 10^11 times in a run.  On some
>machines it may amount to half the cycles.

Good MD code will NOT use square root this way.  By working with potentials
that are transformed to use the square of the distance, one can avoid doing
these costly operations for every pair potential, resulting in substantial
preformance improvements.  The fastest MD codes I know of use table look up
and interpolation to evaluate potential functions. Another advantage of this
technique is that it works for complex potentials, as well as simple ones.

At least from my experience (and I've done both many CPU day and week long
scientific calculations, and worked with top researchers in computer
architecture), the best thing computer designers can do is make the machine
as fast as possible on 64 bit floating point multiply and adds (and maybe
add hardware assist for divide), and let the software do the rest.  Cray has
the right idea, but we knew that already.

Scott Nettles
Scott.Nettles@cs.cmu.edu

hankd@pur-ee.UUCP (Hank Dietz) (08/21/89)

In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
...
>2) I'm content if divide is (typically) no more than 5 times slower
>   than multiply.  Most of my programs probably wouldn't care if
>   divide was 100 times slower than multiply, but that's not good
>   enough - a few of my important programs do need a better divide.
...
>5) I'm not a big user of square roots.  As long as square roots cost
>   less than 100 multiplies, I don't think I'd trade faster square roots
>   for slower performance in other ops.
....

Doing this all in software (and restricting ourselves to integer for
simplicity), we managed to get divide to be FASTER than multiply...  but it
doesn't give you modulus/remainder for free.  Yeah, it surprised me too.  I
still haven't been able to find a software technique to make square root
comparably fast.  Anybody got a screamer of a software technique for
computing square root?

							-hankd@ecn.purdue.edu

drago@bnr-rsc.UUCP (Drago Goricanec) (08/21/89)

In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>In article <3312@blake.acs.washington.edu> lgy@newton.phys.washington.edu (Laurence Yaffe) writes:
>>5) I'm not a big user of square roots.  As long as square roots cost
>>   less than 100 multiplies, I don't think I'd trade faster square roots
>>   for slower performance in other ops.
>...
>Doing this all in software (and restricting ourselves to integer for
>simplicity), we managed to get divide to be FASTER than multiply...  but it
>doesn't give you modulus/remainder for free.  Yeah, it surprised me too.  I
>still haven't been able to find a software technique to make square root
>comparably fast.  Anybody got a screamer of a software technique for
>computing square root?

Here's a method that computes the square root directly (ie. this is not
an approximation method).  I don't remember what it's called, but I
originally saw the method done on decimal numbers.  The algorithm is
much simpler on binary numbers.  It resembles long division on paper.

I'll present it in 68030 code, then describe it.  I apologize for not
using a higher level language, but this algorithm is most suited for an
environment where bits can be manipulated directly.  The code here is
for discussion purposes only, and can be optimized.

; D1 contains the 31-bit argument whose square root is computed
; D0 will contain the 16-bit square root on exit 
; D2, D3 are scratch registers
; D4 is a loop counter
	mov.b	#15, D4
        clr.l	D2	; initialize
	clr.l	D3
	clr.l	D0

loop	lsl.l	#1, D1	; shift D1 into D2 left 2
	rol.l	#1, D2
	lsl.l	#1, D1
	rol.l	#1, D2
	add.l	D3, D3	; same as shift D3 left 1
	add.l	D0, D0
	cmp.l	D2, D3	; is D3 < D2?
	bcc	skip	; skip if not
	sub.l	D3, D2	; D2 := (D2 - D3 - 1)
	subq.l	#1, D2
	addq.l	#2, D3
	addq.l	#1, D0	; set bit in result

skip	subq.b	#1, D4
	bpl	loop	; while D4 >= 0

; at this point, D0 contains sqrt( D1 )
; original contents of D1 lost

The reason D1 contains a 31-bit, and not 32-bit result, is that the most
significant bit could be lost in the shifts above.

The code above was not tested, but has been derived from the C program below,
which does work.

#include <stdio.h>

main()
{
	unsigned int D2, D3, D0, num;
	int i;

	scanf( "%d", &num );
	D2 = D3 = D0 = 0;
	for( i=0; i<16; i++ ) {
		D2 <<= 2;
		D2 |= (num >> 30) & 0x03;
		num <<= 2;
		D3 <<= 1;
		D0 <<= 1;
		if( D3<D2 ) {
			D2 -= (D3 + 1);
			D3 += 2;
			D0 += 1;
		}
	}
	printf( "%d\n", D0 );
}

Essentially, this algorithm will compute the square root of an n-bit
number in approximately n/2 iterations of the loop.  The bits are
shifted into the result as necessary.  To round the result, you need
to iterate one more time (taking care not to lose the most significant
bits in D2), and round up or down, based on the last computed bit in D0.

Drago
-- 
Drago Goricanec                              (...!bnr-vpa!bnr-rsc!drago)
Bell-Northern Research Ltd., Ottawa, Ontario              (613) 763-8957

firth@sei.cmu.edu (Robert Firth) (08/21/89)

In article <12640@pur-ee.UUCP> hankd@pur-ee.UUCP (Hank Dietz) writes:
>.. I still haven't been able to find a software technique to make square root
>comparably fast.  Anybody got a screamer of a software technique for
>computing square root?

A fairly good way os to use bit-level hacking to get a first approximation,
and then unroll a few iterations of Newton's method.  When I did sqrt for
the PDP-11, the first approximation was two shifts and an add, and gave
5 bits of accuracy.  Two iterations then yielded 23 bits for a reasonable
single-precision result.  With everything done in registers, the cost
was certainly less than 20 multiplies.

slackey@bbn.com (Stan Lackey) (08/21/89)

In article <2376@wyse.wyse.com> stevew@wyse.UUCP (Steve Wilson xttemp dept303) writes:
>In article <1513@l.cc.purdue.edu> cik@l.cc.purdue.edu (Herman Rubin) writes:
>>There is no reason why a machine with hardware division should not have
>>hardware square root.  It costs almost nothing.
>Ah, but that presumes that hardware division is warranted!
>Does the occurance rate of divide/square root in scientific computing
>justify the cost?
>How does the scientific computing community feel about this functionality?

I was once involved in a trade-off between doing a divider gate array,
or just iterating through the multiplier.  Performance was that for a
multiply of 3 cycles, the division would take 18.  Also, vector
multiplication would be one chime, and vector division 18. (We still
considered it worthwhile to keep the vector division instructions, in
case a future implementation had a faster one and also to make things
easier for the compiler.)  We found that over a wide range of
applications the ratio of mul to div was like 3:1 to 5:1.  Worse
still, Ahmdahl's Law predicted that with those numbers we'd be
spending nearly 50% of our time doing division!!  In the applications
we wanted to do well at, that is.

We did the gate array.  And yes it has square root.  I personally
don't know how important sqrt is, but it was cheap enough.
-Stan

holm@handel.mpr.ca (Terrence W. Holm) (08/24/89)

In article <911@bnr-rsc.UUCP> drago@bnr-rsc.UUCP (Drago Goricanec) writes:
>>In article <3312@blake.acs.washington.edu> (Laurence Yaffe) writes:
>>>5) I'm not a big user of square roots.  As long as square roots cost
>>>   less than 100 multiplies, I don't think I'd trade faster square roots
>>>   for slower performance in other ops.
>
>Here's a method that computes the square root directly (ie. this is not
>an approximation method).  I don't remember what it's called, but I
>originally saw the method done on decimal numbers.  The algorithm is
>much simpler on binary numbers.  It resembles long division on paper.
>
>I'll present it in 68030 code, then describe it.
> ...
>Drago Goricanec                              (...!bnr-vpa!bnr-rsc!drago)
>Bell-Northern Research Ltd., Ottawa, Ontario              (613) 763-8957


The algorithm looked interesting to me, so I rewrote it for the
Motorola DSP56000. It takes 188 cycles (7 microseconds).
[A multiply on this processor takes 2 cycles.]

The actual code is below, if anybody cares (should be in comp.dsp).

-----------------------------------------------------------------
	opt cc

ROUND	equ	0			; Set to 0 to enable truncated results
					; Set to 1 to enable rounding results

	org	p:$100

; sqrt() - Square root of an integer
;	   [Not for fixed point fractional numbers]
;
; Author: Terrence W. Holm, 1989-August-23
;
;
; Entry:
;	a1 = 24 bit unsigned integer
;	m0 = $FFFF
;	m1 = $FFFF
;
; Exit:
;	a1 = Truncated or rounded square root of (a1)
;
; Destroys:
;	a2, a1, a0
;	b2, b1, b0
;	    x1, x0
;	    r1, r0
;	    n1, n0
;
; Cycles: 188, (6.96 us @ 27MHz) for truncated results
;         208, (7.70 us @ 27MHz) for rounded results

; [From original: d0 is not necessary, d1 = b0, d2 = b1, d3 = r0]


sqrt					; a1 = number
	clr	b	#>1,x1		; b1 will be left_part(number) - (r0/2)**2
	move	b,r0			; r0 will be sqrt(number) * 2
	move	a1,b0			; b0 = number [shifts left into b1]
	move	#<2,n1

	do	#12+ROUND,end		; 2 bits at a time for 24 bits in total
	asl	b	r0,n0
	asl	b			; b1 = left_part(number)
	tfr	b,a	(r0)+n0		; a1/a0 is copy of b1/b0
					; r0 = 2 * r0
	sub	x1,a	r0,r1
	move	r0,x0
	sub	x0,a	(r1)+n1		; a1 = b1 - r0 - 1
					; r1 = r0 + 2
	tpl	a,b	r1,r0		; if ( b1 >= r0 + 1 ) then  [d2>d3]
					;    b1 = b1 - r0 - 1
					;    r0 = r0 + 2
end
	move	r0,a
	asr	a			; a1 = r0/2 = sqrt(number)
	if	ROUND==1
	move	#<2,a0			; Force round up not convergent rounding
	asr	a
	rnd	a			; Round up
	endif
-----------------------------------------------------------------
			Terrence W. Holm
			holm@mprgate.mpr.ca