dgh@sun.uucp (David Hough) (11/25/85)
nroff source for this posting is available from the author. Extending Floating Point Chip Architecture David Hough _A_B_S_T_R_A_C_T The Motorola 68881 and Weitek 1164/1165 chips have become popular floating point implementations and represent different decisions about perfor- mance, complexity, and conformance to the IEEE standard. Future chip sets in the 1164/1165 style should incorporate some of the ideas that shaped the 68881. _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l _o_p_i_n_i_o_n_s _o_f _t_h_e _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e _p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n. The Motorola 68881 establishes a new standard in per- formance for single-chip floating point coprocessors. But even more performance can be realized from multichip imple- mentations of floating point units. The Weitek 1164/1165 chip set has been very popular in recent designs of this type. Other semiconductor manufacturers have noticed Weitek's continuing profitability and have undertaken or contemplated competitive products. Certain features of the 1164/1165 architecture have meant that compromises must be made between performance, complexity, and conformance to the IEEE standard for binary floating point. Eliminating these compromises is an opportunity for Weitek to expand its lead or its competitors to narrow it. What follows is a list of possibilities for Weitek and its competitors for developing the successors to the 1164/1165 chip set. _M_a_r_k_e_t _r_e_q_u_i_r_e_m_e_n_t_s: It may help to understand the market needs which motivate my suggestions. The intended market is general-purpose double precision floating point computations expressed in higher level languages, particu- larly Fortran. The calculations to be performed are not known in advance, so the floating point hardware to be sup- plied must be of the highest possible quality and completely conforming to the IEEE standard, possibly in conjunction with appropriate software. This differs from some other applications such as signal processing and graphics transformations in which the single precision calculations are well understood and for which deviations from the IEEE standard may be acceptable. General-purpose computation is often done in double precision in the hope that roundoff D. Hough 24 November 1985 Extending Floating Point Chip Architecture 2 error analysis may thereby by avoided. This hope is not totally unfounded but roundoff analysis may yet be required if the "atomic" double precision computations are not as accurate as possible. What is thought of as atomic varies among languages and programmers, but most expect justifiably that rational operations + - * / and format conversions be as accurate as possible, with no more error than half a unit the last place of double precision; many expect, but with less justification, the same accuracy in all the intrinsic Fortran functions including exponential, logarithmic, and trigonometric elementary functions and complex arithmetic. _E_X_T_E_N_D_E_D _P_R_E_C_I_S_I_O_N The greatest advantage the 68881 has over the 1164/1165 is that extended precision is used for all intermediate results. This means that computations like double precision complex magnitude sqrt(x*x + y*y) can be computed in a straightforward way with no difficul- ties due to intermediate underflow or underflow. No 1164/1165 program is equally robust and efficient, even given a fast square root. Consequently the most important extension to the 1164/1165 architecture would be to include extended precision operands. It is true that extended precision increases the com- plexity and reduces the performance of the multiplier by an amount which is some function of (64/53). But when coding algorithms intended to be robust, that loss of efficiency on a single operation is often more than regained by simpler algorithms overall. Here's one robust double precision com- plex magnitude algorithm designed for the 1164/1165, assum- ing an efficient square root is already programmed: if (abs(x) < abs(y)) then /* swap */ t = x x = y y = t q = y/x /* q <=1 or is NaN */ if (6 <= result status <= 10) then ... /* compute correct q by wrapping x or y, repeating q = x/y and unwrapping q */ if (result status = 15) then return abs(x) /* q = 0/0 or inf/inf */ save weitek mode set weitek mode to "fast" /* so q*q is safe */ return abs(x) * sqrt( 1 + q*q ) restore weitek mode The division could be eliminated by constructing an even more complicated algorithm that involves exponent extraction and scaling. The point is that no elaborate planning is D. Hough 24 November 1985 Extending Floating Point Chip Architecture 3 required to obtain satisfactory results with extended preci- sion. In the 68881 architecture, all numerical results are computed to extended precision, then rounded if desired to a shorter precision in a subsequent operation. While this seems less efficient than computing to the precision of the argument, it may actually be more efficient; for instance, a product of single precision numbers fits in double or extended precision exactly without need for rounding or overflow/underflow checking. If the product is to be used immediately as the operand to an addition then it might as well be left in higher precision and the addition performed in the higher precision. The most general approach is to provide instructions which produce results in the highest precision available, as well as instructions which produce results in the precision of the operands. The first is usu- ally faster and more accurate for iterative accumulations; the latter is better for isolated calculations A = B op C. _A_c_c_u_m_u_l_a_t_i_o_n: The most typical common floating point operation pattern in scientific computation is a multiplica- tion followed by an add/subtract involving the product. Floating point architectures should make this operation as efficient as possible. Representative simple inner loops include: do 1 i = 1, n 1 s = s + x(i) * y(i) do 1 i = 1, n 1 x(i) = x(i) + k * y(i) do 1 i = 1, n 1 p = c(i) + x * p do 1 i = 1, n /* subtract then multiply */ 1 p = p * (x - r(i)) Some representative complicated inner loops are contained in the Livermore Loops benchmark program of McMahon. _B_u_s _s_i_z_e: Although 32 bit buses are most common at the system level, many implementations could support local 64 bit buses between, for instance, floating point chips and registers. It would be helpful to be able to be able to transmit one 64 bit operand or two 32 bit operands in a sin- gle cycle. Extended formats might as well match the 68881, using only 80 bits. Unused bits make the extended format 96 bits wide in 32 bit systems and 128 bits wide in 64 bit systems. D. Hough 24 November 1985 Extending Floating Point Chip Architecture 4 _A_D_D_I_T_I_O_N_A_L _I_N_S_T_R_U_C_T_I_O_N_S _D_i_v_i_s_i_o_n, _s_q_u_a_r_e _r_o_o_t, _a_n_d _r_e_m_a_i_n_d_e_r: Correct IEEE division, square root, and remainder, along with Fortran MOD, can all be implemented with a fairly uniform algorithm along the lines of work done by George Taylor at Berkeley and ELXSI. Taylor's work is published in various recent volumes of the proceedings of the IEEE Computer Arithmetic Symposia. These instructions would be worth the cost of a third chip if the payoff were correctly rounded division and square root in two to four multiplication times. If only modest performance is provided for correctly rounded divi- sion, then an additional division or reciprocal instruction could be offered that does not meet IEEE rounding specifica- tions, if it can provide an approximate quotient in only a few mutiplication times. Just as correctly rounded division is tedious to pro- vide with only multiplication in hardware, a correctly rounded square root is tedious with only division in hardware, yet the inherent complexities of division and square root are comparable. One such tedious algorithm is contained in a July 1980 note from W. Kahan to the IEEE P754 committee. Since a complete IEEE remainder can require a long time to compute, it must be interruptible and restartable, or only a partial remainder operation may be provided, as on the Intel 8087. An additional status register is required to hold the partial/complete flag and the least 8, 16, or 32 bits of quotient from a remainder operation. _Q_u_a_d_r_u_p_l_e _p_r_e_c_i_s_i_o_n _s_u_p_p_o_r_t: If extended precision is not supported or if correctly rounded division or square root are omitted, then quadruple precision support instruc- tions can be used to allow efficient assembly-language software to replace the missing hardware. Quadruple preci- sion support operations for add, subtract, multiplication, and division take two double precision operands and return a double precision result. Thus the quadruple precision sup- port multiplication instruction returns a double precision result that, added to the result of a double precision mul- tiplication, would yield the exact quadruple precision pro- duct. Correctly rounded double precision division and square root can be constructed from approximations if an exact residual can be computed; the quadruple precision sup- port multiplication facilitates the latter. The quadruple precision division support instruction returns the double precision remainder from a correctly rounded double preci- sion division; that remainder can be used to construct the IEEE remainder if it is not otherwise provided. _C_o_n_v_e_r_s_i_o_n _t_o _i_n_t_e_g_e_r: There is considerable confusion about what functions are required for converting floating D. Hough 24 November 1985 Extending Floating Point Chip Architecture 5 point arguments to integral values in integer formats (For- tran INT, NINT) or to integral values in floating point for- mats (Fortran AINT, ANINT). What is certain is that making the result of the operation dependent on dynamic modes is a mistake, except to the extent the operation is intended to reflect dynamic IEEE rounding modes as discussed later. The following operations cover all requirements in this area: A) convert to integral value in integer format: i) rounding toward zero (INT), ii) biased rounding to nearest (NINT), iii) rounding according to current IEEE rounding mode (Intel IRINT); B) convert to integral value in floating point format: i) rounding toward zero (AINT), ii) biased rounding to nearest (ANINT), iii) rounding according to current IEEE rounding mode (Intel RINT). A further question is whether to set the inexact excep- tion in cases like converting 1.4 to an integer value. The IEEE P854 committee has at various times voted to define operation B)iii) above with and without setting the inexact flag. A noncommittal approach would be to provide two op codes for each of the six operations. _N_a_N_s: The 1164/1165 do not distinguish quiet from sig- nalling NaNs. To meet the IEEE standard the distinction has to be maintained in software by trapping on all occurrences of NaNs. It would be better to adopt the 68881 and Zilog 8070 convention that signalling NaNs are indicated by the leading fraction bit zero and quiet by that bit one. Quiet NaNs propagate without exception in operations that produce floating point results, while signalling NaNs produce an invalid exception on every operation. _O_p_t_i_m_i_z_i_n_g _i_n_s_t_r_u_c_t_i_o_n_s: These instructions can be used by optimizing compilers to avoid more costly multiplica- tions: square(x) computes x*x transmitting x only once scale(x,i) computes x*(2**i) for 32 bit integer i Square(x) produces the same numerical results and exceptions as x*x, and scale(x,i) produces the same numerical results and exceptions as x*(2**i); in particular, scale(x,1) is the same as 2*x and x+x. _I_E_E_E _A_p_p_e_n_d_i_x _F_u_n_c_t_i_o_n_s: Certain computations such as sqrt(x*x+y*y) can be made more robust in the absence of extended precision if exponent extraction and scaling instructions are available. For this purpose exponent extraction and scaling may implement the IEEE logb and scalb functions or may simply return the unbiased exponent field as an integer and provide the x*2**i operation mentioned above. D. Hough 24 November 1985 Extending Floating Point Chip Architecture 6 These functions and the IEEE copysign, nextafter, and classify functions are trivial to implement in hardware com- pared to addition or multiplication, yet are tedious to implement in software, particularly from higher level languages. _E_l_e_m_e_n_t_a_r_y _t_r_a_n_s_c_e_n_d_e_n_t_a_l _f_u_n_c_t_i_o_n_s: All the elementary transcendental functions e**x-1, e**x, 2**x, 10**x, loge(1+x), loge, log2, log10, sinh, cosh, tanh, atanh, sin, cos, sincos, tan, asin, acos, and atan are provided by the 68881. asinh and acosh could have been provided as well for completeness. The core approximations for restricted argu- ments, which are computed by rational functions on chips like the 8087, are computed by a unified cordic algorithm on the 68881. Such algorithms are described in papers like J. S. Walther's, reproduced in Swartzlander's 1980 compendium _C_o_m_p_u_t_e_r _A_r_i_t_h_m_e_t_i_c. Since cordic algorithms are not unlike certain division and square root algorithms, they might be conveniently placed on the same chip with the division operations. Indeed, division and square root algorithms have been cast in cordic form. Cordic algorithms typically require substantial extra intermediate precision and a large constant rom. The greatest difficulty with elementary transcendental functions is argument reduction, which is complicated by special cases, particularly when high accuracy and appropri- ate signalling of IEEE exceptions are desired. It may be better in high performance systems to only implement these functions for restricted arguments, especially for sin, cos, and tan, which can require long times for the remainder operation. Where rational approximations are to be used on reduced arguments, data paths should facilitate polynomial evaluation; an on-chip accumulator may be helpful. Argument reduction for exponential and logarithmic functions is facilitated by the logb and scalb IEEE appendix functions mentioned earlier. Simple integer instructions in the ALU (add/subtract, logical, shift) simplify argument reduction in systems in which the floating point unit is autonomous from the CPU. If complete transcendental functions are provided, they should be in the same chip with the IEEE remainder function for argument reduction purposes. Transcendental functions of two arguments such as X**Y and ATAN2 are best left to software. The principal question is where the benefit of extremely fast elementary transcendental functions accrues. The 68881's fast elementary transcendental functions cause amazing performance on the Whetstone marketing benchmark, relative to the 68881's performance on real floating-point intensive applications. In the Livermore Loops benchmark program mentioned earlier, the only elementary D. Hough 24 November 1985 Extending Floating Point Chip Architecture 7 transcendental function to appear in an inner loop is exp(x)-1, which appears once in the 24 loops. If the Liver- more Loops are an accurate reflection of scientific comput- ing, elementary transcendental function speed is not criti- cal. Perhaps a more important benefit of incorporating a 68881 into a system is that a high quality, high performance elementary transcendental function library is available with no software effort. _M_O_D_E_S The use of dynamic modes to define arithmetic opera- tions leads to inefficiencies and difficulties in implemen- tation and debugging. Thus it is better to provide separate instructions for "fast" and "IEEE" multiplication, rather than have one instruction whose behavior is determined at run time according to mode bits which are sometimes set by mistake! Unfortunately the IEEE rounding modes for direction and precision themselves are perhaps best left dynamic rather than incorporated into op codes. The consensus of the IEEE P754 and P854 committees seems to favor an implementation via dynamic modes, but a vocal minority favors static mode declarations at compile time. Dynamic modes preclude noth- ing and thus maximize flexibility; static modes, by preclud- ing side effects, simplify compiled code generation and verification and simplify high performance pipelined hardware. _Q_U_A_L_I_T_Y By negligence or plan, debugging floating point hardware is a job performed by the buyer as much as the seller. Hardware manufacturers' EE-oriented management, that would not expect its engineers to produce competitive products with 20-year old oscilloscopes, is sometimes blind to the user-level testing requirements of powerful develop- ment systems, operating systems, higher level language com- pilers, and personnel familiar enough with floating point computation and standards to test and debug new hardware effectively. Eventually testing programs and procedures for IEEE standard floating point will be sufficiently widely known that suppliers and users will be able to easily verify the correctness of new floating point implementations. By then there will be very little need or demand for gratuitously incorrect or deviant implementations. D. Hough 24 November 1985
ken@turtlevax.UUCP (Ken Turkowski) (11/25/85)
In article <3026@sun.uucp> dgh@sun.uucp (David Hough) writes: > _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l _o_p_i_n_i_o_n_s > _o_f _t_h_e _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e > _p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n. I was unable to read the above and other similar sections on my terminal, as the words were overlapped. Please don't try to include terminal-dependent escape codes in news postings. Regarding computations of functions of two variables (x**y or atan2(x,y) ala Fortran), I would suggest that atan2() be implemented with CORDIC iterations rather than being programmed in software, especially if sin() and the other circular and hyperbolic trigonometric functions are implemented with CORDIC. Probably the most common transendental operation is conversion between polar and rectangular coordinates (or some variant thereof), so these should be optimized. Similarly, when sin() is needed, usually cos() is needed as well. Provision should be made to allow return of two values (i.e. an array of length 2 or a two-dimensional function), especially since with CORDIC computations, the second one comes for free. -- Ken Turkowski @ CIMLINC (formerly CADLINC), Menlo Park, CA UUCP: {amd,decwrl,hplabs,seismo,spar}!turtlevax!ken ARPA: turtlevax!ken@DECWRL.DEC.COM