jbs@WATSON.IBM.COM (05/13/91)
Herman Rubin states: I have already pointed out that every trigonometic and exponential routine does, in some way, float/float -> integer, remainder. The integer is also used. 1. As I have already pointed out, the actual computation being done is float/real -> integer, remainder (since pi and ln(2) are not ma- chine numbers). A proposed float/float -> integer, remainder instruc- tion (assuming it ran at the same speed as floating divide, actually it would probably be slower) would not benefit the trigonometric or exponential routines on any architecture I am familiar with 2. In any case it is not true that "every" exponential routine does this computation. The scalar short precision exp routine in IBM's vs fortran library does not do this computation in any way. James B. Shearer
pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (05/13/91)
In article <9105130317.AA09926@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes: > > Herman Rubin states: > I have already pointed out >that every trigonometric and exponential routine does, in some way, >float/float -> integer, remainder. The integer is also used. > > 1. As I have already pointed out, the actual computation being >done is float/real -> integer, remainder (since pi and ln(2) are not ma- >chine numbers). A proposed float/float -> integer, remainder instruc- >tion (assuming it ran at the same speed as floating divide, actually it >would probably be slower) would not benefit the trigonometric or >exponential routines on any architecture I am familiar with Suppose we want sin(1234.0). Approximate pi/2 = pi1 + pi2 where pi1 and pi2 are each double precision, | pi1 | >> | pi2 |. For purposes of illustration, let pi1 = 1.571 pi2 = - .0002037 (decimal notation with four significant digits) so that pi1 + pi2 = 1.5707963 whereas pi/2 = 1.57079632679 ... . Herman's proposed instruction would find 1234.0 = 785*pi1 + 0.765 . If the quotient is always rounded to the nearest integer, then the remainder (here 0.765) can be represented exactly even though 785*1.571 = 1233.235 requires more that four significant digits. How can Herman Rubin get this remainder accurately with the hardware and software commonly available? jbs argues that this remainder 0.765 is only an approximation to 1234.0 mod pi/2, since the divisor 3.142 was an approximation. This is where pi2 can be used, together with the first quotient and remainder: 1234.0 - 785*(pi1 + pi2) = (1234.0 - 785*pi1) - 785*pi2 = 0.7650 + 785*.0002037 = 0.7650 + .1599045 = 0.9249 (using 4-digit arithmetic). The revised remainder of 0.9249 compares to the true value 0.9248834 of 1234 - 785*(pi/2). With this we estimate sin(1234) ~= cos(0.9249) ~ 0.6019, which is accurate to four digits. -- Peter L. Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 If I spent as much time on my dissertation as I do reading news, I'd graduate.
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/13/91)
In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes: > > Herman Rubin states: > I have already pointed out > that every trigonometic and exponential routine does, in some way, > float/float -> integer, remainder. The integer is also used. ^^^^ > 1. As I have already pointed out, the actual computation being > done is float/real -> integer, remainder (since pi and ln(2) are not ma- > chine numbers). A proposed float/float -> integer, remainder instruc- > tion (assuming it ran at the same speed as floating divide, actually it > would probably be slower) would not benefit the trigonometric or > exponential routines on any architecture I am familiar with Divide has been one of the stumbling blocks of computing since the first speedups of multiplication. You are right about the accuracy part, and certainly for accurate computation, corrections would have to be made. It might run slightly slower than floating divide, but not much unless the divide is done by multiplication, which it might very well be. Even then, some decent hardware support is called for. I believe that the quote from me was taken from an article in which I pointed out that communication between floating and integer registers was important. I have underlined the key word here; however things are done, it is still necessary to get that integer to the integer unit, and several steps, going trough memory, hardly seems to be the way. > 2. In any case it is not true that "every" exponential routine > does this computation. The scalar short precision exp routine in IBM's > vs fortran library does not do this computation in any way. I fail to see how one would compute exp(pi^3) even to short precision without doing some computation like (pi^3)/ln2. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) (05/14/91)
In article <12248@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: >In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes: > >> 2. In any case it is not true that "every" exponential routine >> does this computation. The scalar short precision exp routine in IBM's >> vs fortran library does not do this computation in any way. > >I fail to see how one would compute exp(pi^3) even to short precision >without doing some computation like (pi^3)/ln2. A reference to check: Cody, William J., Jr., and William Waite, "Software Manual for the Elementary Functions," Prentice-Hall, 1980. Pages 60-71 seem to describe an algorithm for the Exponential routine that involves the division by ln2 that Herman describes. Of course, they recommend multiplication by the reciprocal of ln2, which can be computed and used as a stored constant. No divides are found in the algorithm, as best as I can tell from a cursory reading. Perhaps IBM's VS FORTRAN library is using the multiply instead of the divide in the same fashion? The Cody/Waite algorithm does do float -> integer conversion at one point, and the presence of Herman's requested instruction would make it possible to do things differently. But, should we be requesting a floating multiply instruction that produces an integer result and a floating fractional result? This might be faster. Or are there other routines that cannot be altered from a divide to a multiply and are important enough to justify the requested divide instruction ? ----------------------------------------------------------------------------- "The use of COBOL cripples the mind; its teaching should, therefore, be regarded as a criminal offence." E.W.Dijkstra, 18th June 1975. ||| clc5q@virginia.edu (Clark L. Coleman)
hrubin@pop.stat.purdue.edu (Herman Rubin) (05/15/91)
In article <1991May14.141623.16040@murdoch.acc.Virginia.EDU>, clc5q@hemlock.cs.Virginia.EDU (Clark L. Coleman) writes: > In article <12248@mentor.cc.purdue.edu> hrubin@pop.stat.purdue.edu (Herman Rubin) writes: > >In article <9105130317.AA09926@ucbvax.Berkeley.EDU>, jbs@WATSON.IBM.COM writes: .................. > The Cody/Waite algorithm does do float -> integer conversion at one point, > and the presence of Herman's requested instruction would make it > possible to do things differently. But, should we be requesting a > floating multiply instruction that produces an integer result and > a floating fractional result? This might be faster. Or are there other > routines that cannot be altered from a divide to a multiply and are > important enough to justify the requested divide instruction ? There are quite a few machines which cannot even find the integer part of a floating number in hardware. This can be a substantial problem. There are situations in which the floating fractional result of the multiplication leads to problems. The way division is now going on most computers, it might be just as good to do a multiply and subtract. So at least float -> integer part, in both integer and floating, is needed. -- Herman Rubin, Dept. of Statistics, Purdue Univ., West Lafayette IN47907-1399 Phone: (317)494-6054 hrubin@l.cc.purdue.edu (Internet, bitnet) {purdue,pur-ee}!l.cc!hrubin(UUCP)
jbs@WATSON.IBM.COM (05/15/91)
Peter Montgomery gives an example (computing sin(1234.) in 4 digit decimal) which purports to show how to use a float/float -> integer, remainder instruction for argument reduction. There are some problems with his method (as I will discuss later). In any case it is besides the point. I am not claiming you can not use this instruction for argument reduction. I am claiming it will not enable you to do argument reduction any faster. This is because it can be expected to run slower than floating divide. Floating divides are so slow on most recent machines (as compared to floating multiplies or adds) that it pays to go to great lengths to avoid them. In this case this is easy to do as we are dividing by a constant so we can multiply by the reciprocal instead. Consider for example how argument reduction is done for the exp routine (as we will see this is the easier case). We are using the identity exp(x)=exp(n*log(2)+(x-n*log(2))=(2**n)*exp(y) where y=x-n* log(2). We want y as near to 0 as possible so we choose n to be the nearest integer to x/log(2). We will compute exp(y) by using a mini- max polynomial approximation on the interval (-.5*ln(2),.5*ln(2)). If we actually use a minimax approximation valid on a slightly bigger in- terval (say (-.35,.35)) than we can afford to be slightly sloppy in choosing n. Hence we take n to be the nearest integer to x*(1./log(2)). Now we must compute x-n*log(2) without loss of accuracy due to cancell- ation. In this case this is easy. Note n is at most 11 bits long (else exp(x) will overflow or underflow; we are assuming IEEE 64-bit arithme- tic). So write ln(2)=z1+z2 where z1 contains the first 42 bits of ln(2) and z2 the next 53. Then y=(x-n*z1)-n*z2 computes y accurately. Also this entire sequence will be faster than a single floating divide on either a 3090 mainframe or a Risc System 6000. For trigonometric functions we let n be the closest integer to x/(pi/2) and y=x-n*(pi/2). The computation of n and y is trickier for reasons: 1) n may be very large. 2) y must be computed with small relative error not just small absolute error (since sin has a zero at zero). Peter Montgomery's method does not deal with these problems. Since the quotient is returned as an integer it will overflow whenever n>2**31. His example assumes his integers have as many digits as his floats. If the integers were 2 digits long then 785 would overflow. His example also assumes 8 digits of pi/2 suffice for accurate argument argument reduction of 4 digit numbers. This is not true as can be seen by trying sin(355.). Finally in computing 785*pi2, 785 must be convert- ed back to floating point so you still have conversions. Problem 1 can be dealt with by requiring abs(x) to be not too large. Bigger arguments can be forbidden or handled by alternative slower code. Since almost all inputs to the trig functions lie near zero this will not affect the average performance. Problem 2 can be dealt with be dividing pi/2 into more than 2 parts. The precision of pi/2 required increases as allowed values of n increase. Note if you have 3 parts pi1, pi2 and pi3 you must compute t-n*pi2 accurately as well as t=x-n*pi1. The general problem of computing x-y*z accurately even when y*z is near x can handled in several ways without the new instruction. On machines such as the Risc System 6000 with a fused multiply-add there is no problem, the multiply-add instruction does the right thing. On ma- chines that can return the low part of y*z you just subtract the high part first, then the low part. James B. Shearer
jbs@WATSON.IBM.COM (05/17/91)
David Hough states: A couple of issues have been confused in recent comp.arch discussion. One is whether argument reduction for elementary transcendental functions involves simultaneous computation of a quotient (preferably in integer format) and a remainder; another is whether instructions for that purpose should be included in computer instruction set architectures. Well the reason they have been "confused" is that people advo- cating the float/float = integer, remainder instruction have stated that such an instruction would be useful for argument reduction for elementary transcendental functions. David Hough appears (his post was little unclear) to agree with me that this is not the case. If you are going to put an instruction in your architecture that may run for 1000's of cycles and causes all sorts of design headaches it would be nice if it did something useful. James B. Shearer
pmontgom@euphemia.math.ucla.edu (Peter Montgomery) (05/17/91)
In article <9105150230.AA26299@ucbvax.Berkeley.EDU> jbs@WATSON.IBM.COM writes: > Peter Montgomery gives an example (computing sin(1234.) in >4 digit decimal) which purports to show how to use a float/float -> >integer, remainder instruction for argument reduction. There are some >problems with his method (as I will discuss later). In any case it is >besides the point. I am not claiming you can not use this instruction >for argument reduction. I am claiming it will not enable you to do >argument reduction any faster. Then I misunderstood you. I believe that the remainder is needed seldom enough that it may be reasonable to require up to (say) five floating point instructions to get it. But it is essential that one be able to request this operation in a portable way which compilers can translate into efficient object code producing an accurate result, such as Q = round(Y/X); (take quotient, round to nearest integer, keep in floating format) R = subtract_product(Y, Q, X); (Y - Q*X without intermediate rounding) Writing R = Y - X*round(Y/X) (say) is unacceptable if IEEE rounding is done after the multiply, as it will not produce the remainder which the application wants. The point of my previous posting was that the remainder Herman Rubin requests can be useful. Now the question is how one can obtain that remainder on current and proposed systems. I asked "How can Herman Rubin get this remainder accurately with the hardware and software commonly available?" The IBM RS/6000 has such an instruction, but how does one write code that works there as well as the SPARC series, the MIPS R3000 series, and elsewhere? Shearer goes on to mention how using ln(2) = z1 + z2 where z1 has only 42 significant bits we can compute exponentials since the quotient will be at most 11 bits. It appears sound. > For trigonometric functions we let n be the closest integer to >x/(pi/2) and y=x-n*(pi/2). The computation of n and y is trickier for >reasons: > 1) n may be very large. > 2) y must be computed with small relative error not just small >absolute error (since sin has a zero at zero). > Peter Montgomery's method does not deal with these problems. >Since the quotient is returned as an integer it will overflow whenever >n>2**31. His example assumes his integers have as many digits as his >floats. The intermediate quotient can be indeed stored as a float to guard against this very problem, using a separate instruction to convert it to integer if the application knows it will be in range. >His example also assumes 8 digits of pi/2 suffice for accurate argument >argument reduction of 4 digit numbers. This is not true as can be seen >by trying sin(355.). Yes, if one wants four significant figures of sin(355) ~ -.00003014. But for many applications, the absolute error in trigonometric functions is more important than the relative error; for example, it matters little whether one approximates the complex number 10000*(cos(355), sin(355)) as (-10000, 0) or as (-10000, -0.3014). On a SUN 4, if I compute sin(2549491779) using double precision arithmetic, I get 4.507E-10 whereas Maple gives 4.474E-10. Despite the huge quotient, the result is accurate to 11 decimal places, although to only 2 significant digits; however it is the decimal places which is important for trig functions. If this error is unacceptable, one needs a more accurate approximation to pi/2, as Shearer suggests. And since the subsequent corrections won't have the direct form of a remainder from divide, putting the remainder directly in the floating point divide instruction is probably not the best solution. But the functionality is needed somehow. -- Peter L. Montgomery pmontgom@MATH.UCLA.EDU Department of Mathematics, UCLA, Los Angeles, CA 90024-1555 If I spent as much time on my dissertation as I do reading news, I'd graduate.