[comp.arch] float/float = integer, remainder

jbs@WATSON.IBM.COM (05/13/91)

          Herman Rubin states:
                                              I have already pointed out
that every trigonometic and exponential routine does, in some way,
float/float -> integer, remainder.  The integer is also used.

          1.  As I have already pointed out, the actual computation being
done is float/real -> integer, remainder (since pi and ln(2) are not ma-
chine numbers).  A proposed float/float -> integer, remainder instruc-
tion (assuming it ran at the same speed as floating divide, actually it
would probably be slower) would not benefit the trigonometric or
exponential routines on any architecture I am familiar with
          2.  In any case it is not true that "every" exponential routine
does this computation.  The scalar short precision exp routine in IBM's
vs fortran library does not do this computation in any way.
                       James B. Shearer

pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (05/13/91)

In article <9105130317.AA09926@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes:
>
>          Herman Rubin states:
>                                              I have already pointed out
>that every trigonometric and exponential routine does, in some way,
>float/float -> integer, remainder.  The integer is also used.
>
>          1.  As I have already pointed out, the actual computation being
>done is float/real -> integer, remainder (since pi and ln(2) are not ma-
>chine numbers).  A proposed float/float -> integer, remainder instruc-
>tion (assuming it ran at the same speed as floating divide, actually it
>would probably be slower) would not benefit the trigonometric or
>exponential routines on any architecture I am familiar with

	Suppose we want sin(1234.0).  Approximate pi/2 = pi1 + pi2
where pi1 and pi2 are each double precision, | pi1 | >> | pi2 |. 
For purposes of illustration, let

		pi1 =  1.571
		pi2 = - .0002037

(decimal notation with four significant digits) so that 
pi1 + pi2 = 1.5707963 whereas pi/2 = 1.57079632679 ... .  Herman's
proposed instruction would find

		1234.0 = 785*pi1 + 0.765 .

If the quotient is always rounded to the nearest integer, then
the remainder (here 0.765) can be represented exactly even though 
785*1.571 = 1233.235 requires more that four significant digits.
How can Herman Rubin get this remainder accurately with the hardware 
and software commonly available? 

	jbs argues that this remainder 0.765 is only an approximation
to 1234.0 mod pi/2, since the divisor 3.142 was an approximation.
This is where pi2 can be used, together with the first quotient and remainder:

	   1234.0 - 785*(pi1 + pi2) = (1234.0 - 785*pi1) - 785*pi2
	= 0.7650 + 785*.0002037 = 0.7650 + .1599045 = 0.9249

(using 4-digit arithmetic).  The revised remainder of 0.9249 
compares to the true value 0.9248834 of 1234 - 785*(pi/2).  
With this we estimate sin(1234) ~= cos(0.9249) ~ 0.6019, 
which is accurate to four digits.
--
        Peter L. Montgomery 
        pmontgom@MATH.UCLA.EDU 
        Department of Mathematics, UCLA, Los Angeles, CA 90024-1555
If I spent as much time on my dissertation as I do reading news, I'd graduate.

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/13/91)

In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes:
> 
>           Herman Rubin states:
>                                               I have already pointed out
> that every trigonometic and exponential routine does, in some way,
> float/float -> integer, remainder.  The integer is also used.
						     ^^^^
>           1.  As I have already pointed out, the actual computation being
> done is float/real -> integer, remainder (since pi and ln(2) are not ma-
> chine numbers).  A proposed float/float -> integer, remainder instruc-
> tion (assuming it ran at the same speed as floating divide, actually it
> would probably be slower) would not benefit the trigonometric or
> exponential routines on any architecture I am familiar with

Divide has been one of the stumbling blocks of computing since the
first speedups of multiplication.  You are right about the accuracy
part, and certainly for accurate computation, corrections would have
to be made.  It might run slightly slower than floating divide, but
not much unless the divide is done by multiplication, which it might
very well be.  Even then, some decent hardware support is called for.

I believe that the quote from me was taken from an article in which
I pointed out that communication between floating and integer registers
was important.  I have underlined the key word here; however things are
done, it is still necessary to get that integer to the integer unit,
and several steps, going trough memory, hardly seems to be the way.

>           2.  In any case it is not true that "every" exponential routine
> does this computation.  The scalar short precision exp routine in IBM's
> vs fortran library does not do this computation in any way.

I fail to see how one would compute exp(pi^3) even to short precision
without doing some computation like (pi^3)/ln2.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/14/91)

In article <12248@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
>In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes:
>
>>           2.  In any case it is not true that "every" exponential routine
>> does this computation.  The scalar short precision exp routine in IBM's
>> vs fortran library does not do this computation in any way.
>
>I fail to see how one would compute exp(pi^3) even to short precision
>without doing some computation like (pi^3)/ln2.

A reference to check:

Cody, William J., Jr., and William Waite, "Software Manual for the
	Elementary Functions," Prentice-Hall, 1980.

Pages 60-71 seem to describe an algorithm for the Exponential routine
that involves the division by ln2 that Herman describes. Of course, 
they recommend multiplication by the reciprocal of ln2, which can
be computed and used as a stored constant. No divides are found in
the algorithm, as best as I can tell from a cursory reading. Perhaps
IBM's VS FORTRAN library is using the multiply instead of the divide
in the same fashion?

The Cody/Waite algorithm does do float -> integer conversion at one point,
and the presence of Herman's requested instruction would make it
possible to do things differently. But, should we be requesting a
floating multiply instruction that produces an integer result and
a floating fractional result? This might be faster. Or are there other
routines that cannot be altered from a divide to a multiply and are
important enough to justify the requested divide instruction ?

-----------------------------------------------------------------------------
"The use of COBOL cripples the mind; its teaching should, therefore, be 
regarded as a criminal offence." E.W.Dijkstra, 18th June 1975.
|||  clc5q@virginia.edu (Clark L. Coleman)

hrubin@pop.stat.purdue.edu (Herman Rubin) (05/15/91)

In article <1991May14.141623.16040@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes:
> In article <12248@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes:
> >In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes:

			..................

> The Cody/Waite algorithm does do float -> integer conversion at one point,
> and the presence of Herman's requested instruction would make it
> possible to do things differently. But, should we be requesting a
> floating multiply instruction that produces an integer result and
> a floating fractional result? This might be faster. Or are there other
> routines that cannot be altered from a divide to a multiply and are
> important enough to justify the requested divide instruction ?

There are quite a few machines which cannot even find the integer part
of a floating number in hardware.  This can be a substantial problem.

There are situations in which the floating fractional result of the
multiplication leads to problems.  The way division is now going on
most computers, it might be just as good to do a multiply and subtract.
So at least float -> integer part, in both integer and floating, is
needed.
-- 
Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399
Phone: (317)494-6054
hrubin@l.cc.purdue.edu (Internet, bitnet)   {purdue,pur-ee}!l.cc!hrubin(UUCP)

jbs@WATSON.IBM.COM (05/15/91)

         Peter Montgomery gives an example (computing sin(1234.) in
4 digit decimal) which purports to show how to use a float/float ->
integer, remainder instruction for argument reduction.  There are some
problems with his method (as I will discuss later).  In any case it is
besides the point.  I am not claiming you can not use this instruction
for argument reduction.  I am claiming it will not enable you to do
argument reduction any faster.  This is because it can be expected to run
slower than floating divide.  Floating divides are so slow on most recent
machines (as compared to floating multiplies or adds) that it pays to go
to great lengths to avoid them.  In this case this is easy to do as we
are dividing by a constant so we can multiply by the reciprocal instead.
         Consider for example how argument reduction is done for the
exp routine (as we will see this is the easier case).  We are using the
identity exp(x)=exp(n*log(2)+(x-n*log(2))=(2**n)*exp(y) where y=x-n*
log(2).  We want y as near to 0 as possible so we choose n to be the
nearest integer to x/log(2).  We will compute exp(y) by using a mini-
max polynomial approximation on the interval (-.5*ln(2),.5*ln(2)).  If
we actually use a minimax approximation valid on a slightly bigger in-
terval (say (-.35,.35)) than we can afford to be slightly sloppy in
choosing n.  Hence we take n to be the nearest integer to x*(1./log(2)).
Now we must compute x-n*log(2) without loss of accuracy due to cancell-
ation.  In this case this is easy.  Note n is at most 11 bits long (else
exp(x) will overflow or underflow; we are assuming IEEE 64-bit arithme-
tic). So write ln(2)=z1+z2 where z1 contains the first 42 bits of ln(2)
and z2 the next 53.  Then y=(x-n*z1)-n*z2 computes y accurately.  Also
this entire sequence will be faster than a single floating divide on
either a 3090 mainframe or a Risc System 6000.
         For trigonometric functions we let n be the closest integer to
x/(pi/2) and y=x-n*(pi/2).  The computation of n and y is trickier for
reasons:
         1)  n may be very large.
         2)  y must be computed with small relative error not just small
absolute error (since sin has a zero at zero).
         Peter Montgomery's method does not deal with these problems.
Since the quotient is returned as an integer it will overflow whenever
n>2**31.  His example assumes his integers have as many digits as his
floats.  If the integers were 2 digits long then 785 would overflow.
His example also assumes 8 digits of pi/2 suffice for accurate argument
argument reduction of 4 digit numbers.  This is not true as can be seen
by trying sin(355.).  Finally in computing 785*pi2, 785 must be convert-
ed back to floating point so you still have conversions.
         Problem 1 can be dealt with by requiring abs(x) to be not too
large.  Bigger arguments can be forbidden or handled by alternative
slower code.  Since almost all inputs to the trig functions lie near zero
this will not affect the average performance.
         Problem 2 can be dealt with be dividing pi/2 into more than 2
parts.  The precision of pi/2 required increases as allowed values of n
increase.  Note if you have 3 parts pi1, pi2 and pi3 you must compute
t-n*pi2 accurately as well as t=x-n*pi1.
         The general problem of computing x-y*z accurately even when y*z
is near x can handled in several ways without the new instruction.  On
machines such as the Risc System 6000 with a fused multiply-add there is
no problem, the multiply-add instruction does the right thing.  On ma-
chines that can return the low part of y*z you just subtract the high
part first, then the low part.
                          James B. Shearer

jbs@WATSON.IBM.COM (05/17/91)

        David Hough states:
A couple of issues have been confused in recent comp.arch discussion.
One is whether argument reduction for elementary transcendental functions
involves simultaneous computation of a quotient (preferably in integer
format) and a remainder; another is whether instructions for that purpose
should be included in computer instruction set architectures.

        Well the reason they have been "confused" is that people advo-
cating the float/float = integer, remainder instruction have stated
that such an instruction would be useful for argument reduction for
elementary transcendental functions.  David Hough appears (his post was
little unclear) to agree with me that this is not the case.
        If you are going to put an instruction in your architecture
that may run for 1000's of cycles and causes all sorts of design
headaches it would be nice if it did something useful.
                       James B. Shearer

pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (05/17/91)

In article <9105150230.AA26299@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes:

>         Peter Montgomery gives an example (computing sin(1234.) in
>4 digit decimal) which purports to show how to use a float/float ->
>integer, remainder instruction for argument reduction.  There are some
>problems with his method (as I will discuss later).  In any case it is
>besides the point.  I am not claiming you can not use this instruction
>for argument reduction.  I am claiming it will not enable you to do
>argument reduction any faster.  

	Then I misunderstood you.  I believe that the remainder is needed seldom
enough that it may be reasonable to require up to (say) five floating point
instructions to get it.   But it is essential that one be able to
request this operation in a portable way which compilers can translate
into efficient object code producing an accurate result, such as

		Q = round(Y/X);		(take quotient, round to nearest
				 	integer, keep in floating format)
		R = subtract_product(Y, Q, X);	
					(Y - Q*X without intermediate rounding)

Writing R = Y - X*round(Y/X) (say) is unacceptable if IEEE rounding is
done after the multiply, as it will not produce the remainder
which the application wants.  The point of my previous posting was that
the remainder Herman Rubin requests can be useful.  Now the question 
is how one can obtain that remainder on current and proposed systems.  
I asked "How can Herman Rubin get this remainder accurately
with the hardware and software commonly available?"
The IBM RS/6000 has such an instruction,
but how does one write code that works there as well as the
SPARC series, the MIPS R3000 series, and elsewhere?

	Shearer goes on to mention how using ln(2) = z1 + z2
where z1 has only 42 significant bits we can compute exponentials
since the quotient will be at most 11 bits.  It appears sound.

>         For trigonometric functions we let n be the closest integer to
>x/(pi/2) and y=x-n*(pi/2).  The computation of n and y is trickier for
>reasons:
>        1)  n may be very large.
>         2)  y must be computed with small relative error not just small
>absolute error (since sin has a zero at zero).

>        Peter Montgomery's method does not deal with these problems.
>Since the quotient is returned as an integer it will overflow whenever
>n>2**31.  His example assumes his integers have as many digits as his
>floats.  

	The intermediate quotient can be indeed stored as a float 
to guard against this very problem, using a separate instruction
to convert it to integer if the application knows it will be in range.

>His example also assumes 8 digits of pi/2 suffice for accurate argument
>argument reduction of 4 digit numbers.  This is not true as can be seen
>by trying sin(355.).  

	Yes, if one wants four significant figures of sin(355) ~ -.00003014.
But for many applications, the absolute error in trigonometric functions
is more important than the relative error; for example, it matters
little whether one approximates the complex number

	10000*(cos(355), sin(355)) 

as (-10000, 0) or as (-10000, -0.3014).  On a SUN 4, if I compute
sin(2549491779) using double precision arithmetic, I get 4.507E-10
whereas Maple gives 4.474E-10.  Despite the huge quotient,
the result is accurate to 11 decimal places, although to only
2 significant digits; however it is the decimal places 
which is important for trig functions.

 	If this error is unacceptable, one needs a more accurate approximation
to pi/2, as Shearer suggests.  And since the subsequent corrections
won't have the direct form of a remainder from divide, 
putting the remainder directly in the floating point divide instruction 
is probably not the best solution.  But the functionality is needed somehow.

--
        Peter L. Montgomery 
        pmontgom@MATH.UCLA.EDU 
        Department of Mathematics, UCLA, Los Angeles, CA 90024-1555
If I spent as much time on my dissertation as I do reading news, I'd graduate.