[comp.arch] fast square root

firth@sei.cmu.edu (Robert Firth) (08/22/89)
For those interested in this hackery, here's a very simple
square root approximation.  We assume the number is non
negative, and in standard binary floating-point representation,
with exponent and mantissa juxtaposed and with one hidden bit.
Thus, the bit pattern k|bbb... means 2^k * (0.1bbb...).

Take the machine word holding exponent and high mantissa,
shift it right logically one place, and add (using integer
addition) the bit pattern 0000|100...

For example, a starting pattern of 

	k0|0000 (ie 2^2k * 1/2)

becomes

	k|1000  (ie 2^k * 3/4)

(If the exponent is represented in excess-2^n form, then of
course the bit pattern you add contains also the correction
to the exponent, but that's a negligible detail)

The other three cases of interest are

	k0|1111 (2^2k * (1-e)) => k|1111 (2^k * (1-e))

	k1|0000 (2^(2k+1) * 1/2) => (k+1)|0000 (2^(k+1) * 1/2)

	k1|1111 (2^(2k+1) * (1-e)) => (k+1)|0111 (2^(k+1) * (3/4 - e))

The third case is exact (good thing, too), and the second is nearly
so.  The first and fourth cases have the greatest inaccuracy, and
in effect take 1.5 as the first approximation to sqrt(2), which is
an error of 86 in 1414 or less than 1 in 16.

So we have at worst three good bits in this first approximation, whence
three iterations of Newton's rule will give us at worst 31 bits.

The initial approximation is one shift and one add, which is almost
certainly faster than a lookup table.  Moreover, the time to get
the square root is quite predictable - within a factor of 2 surely
even allowing for data-dependent division instructions.

[Found in the PDP-11 Fortran IV mathlib; credit probably due to some
 anonymous person at DEC]

However, note that a lookup table based on the last bit of the
exponent and the first three of the mantissa can give you five
bits' initial accuracy, which probably lets you save an iteration.