[net.arch] Floating Point Chip Architecture

dgh@sun.uucp (David Hough) (11/25/85)

nroff source for this posting is available from the author.

Extending Floating Point Chip Architecture

David Hough

_A_B_S_T_R_A_C_T

The Motorola 68881 and Weitek 1164/1165 chips
have become popular floating point implementations
and represent different decisions about perfor-
mance, complexity, and conformance to the IEEE
standard. Future chip sets in the 1164/1165 style
should incorporate some of the ideas that shaped
the 68881. _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l _o_p_i_n_i_o_n_s
_o_f _t_h_e _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e
_p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n.

The Motorola 68881 establishes a new standard in per-
formance for single-chip floating point coprocessors. But
even more performance can be realized from multichip imple-
mentations of floating point units. The Weitek 1164/1165
chip set has been very popular in recent designs of this
type. Other semiconductor manufacturers have noticed
Weitek's continuing profitability and have undertaken or
contemplated competitive products. Certain features of the
1164/1165 architecture have meant that compromises must be
made between performance, complexity, and conformance to the
IEEE standard for binary floating point. Eliminating these
compromises is an opportunity for Weitek to expand its lead
or its competitors to narrow it. What follows is a list of
possibilities for Weitek and its competitors for developing
the successors to the 1164/1165 chip set.

_M_a_r_k_e_t _r_e_q_u_i_r_e_m_e_n_t_s: It may help to understand the
market needs which motivate my suggestions. The intended
market is general-purpose double precision floating point
computations expressed in higher level languages, particu-
larly Fortran. The calculations to be performed are not
known in advance, so the floating point hardware to be sup-
plied must be of the highest possible quality and completely
conforming to the IEEE standard, possibly in conjunction
with appropriate software. This differs from some other
applications such as signal processing and graphics
transformations in which the single precision calculations
are well understood and for which deviations from the IEEE
standard may be acceptable. General-purpose computation is
often done in double precision in the hope that roundoff

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 2

error analysis may thereby by avoided. This hope is not
totally unfounded but roundoff analysis may yet be required
if the "atomic" double precision computations are not as
accurate as possible. What is thought of as atomic varies
among languages and programmers, but most expect justifiably
that rational operations + - * / and format conversions be
as accurate as possible, with no more error than half a unit
the last place of double precision; many expect, but with
less justification, the same accuracy in all the intrinsic
Fortran functions including exponential, logarithmic, and
trigonometric elementary functions and complex arithmetic.

_E_X_T_E_N_D_E_D _P_R_E_C_I_S_I_O_N

The greatest advantage the 68881 has over the 1164/1165
is that extended precision is used for all intermediate
results. This means that computations like double precision
complex magnitude

sqrt(x*x + y*y)

can be computed in a straightforward way with no difficul-
ties due to intermediate underflow or underflow. No
1164/1165 program is equally robust and efficient, even
given a fast square root. Consequently the most important
extension to the 1164/1165 architecture would be to include
extended precision operands.

It is true that extended precision increases the com-
plexity and reduces the performance of the multiplier by an
amount which is some function of (64/53). But when coding
algorithms intended to be robust, that loss of efficiency on
a single operation is often more than regained by simpler
algorithms overall. Here's one robust double precision com-
plex magnitude algorithm designed for the 1164/1165, assum-
ing an efficient square root is already programmed:
if (abs(x) < abs(y)) then /* swap */
t = x
x = y
y = t
q = y/x /* q <=1 or is NaN */
if (6 <= result status <= 10) then ... /* compute correct q
by wrapping x or y,
repeating q = x/y
and unwrapping q */
if (result status = 15) then
return abs(x) /* q = 0/0 or inf/inf */
save weitek mode
set weitek mode to "fast" /* so q*q is safe */
return abs(x) * sqrt( 1 + q*q )
restore weitek mode
The division could be eliminated by constructing an even
more complicated algorithm that involves exponent extraction
and scaling. The point is that no elaborate planning is

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 3

required to obtain satisfactory results with extended preci-
sion.

In the 68881 architecture, all numerical results are
computed to extended precision, then rounded if desired to a
shorter precision in a subsequent operation. While this
seems less efficient than computing to the precision of the
argument, it may actually be more efficient; for instance, a
product of single precision numbers fits in double or
extended precision exactly without need for rounding or
overflow/underflow checking. If the product is to be used
immediately as the operand to an addition then it might as
well be left in higher precision and the addition performed
in the higher precision. The most general approach is to
provide instructions which produce results in the highest
precision available, as well as instructions which produce
results in the precision of the operands. The first is usu-
ally faster and more accurate for iterative accumulations;
the latter is better for isolated calculations A = B op C.

_A_c_c_u_m_u_l_a_t_i_o_n: The most typical common floating point
operation pattern in scientific computation is a multiplica-
tion followed by an add/subtract involving the product.
Floating point architectures should make this operation as
efficient as possible. Representative simple inner loops
include:

do 1 i = 1, n
1 s = s + x(i) * y(i)

do 1 i = 1, n
1 x(i) = x(i) + k * y(i)

do 1 i = 1, n
1 p = c(i) + x * p

do 1 i = 1, n /* subtract then multiply */
1 p = p * (x - r(i))

Some representative complicated inner loops are contained in
the Livermore Loops benchmark program of McMahon.

_B_u_s _s_i_z_e: Although 32 bit buses are most common at the
system level, many implementations could support local 64
bit buses between, for instance, floating point chips and
registers. It would be helpful to be able to be able to
transmit one 64 bit operand or two 32 bit operands in a sin-
gle cycle.

Extended formats might as well match the 68881, using
only 80 bits. Unused bits make the extended format 96 bits
wide in 32 bit systems and 128 bits wide in 64 bit systems.

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 4

_A_D_D_I_T_I_O_N_A_L _I_N_S_T_R_U_C_T_I_O_N_S

_D_i_v_i_s_i_o_n, _s_q_u_a_r_e _r_o_o_t, _a_n_d _r_e_m_a_i_n_d_e_r: Correct IEEE
division, square root, and remainder, along with Fortran
MOD, can all be implemented with a fairly uniform algorithm
along the lines of work done by George Taylor at Berkeley
and ELXSI. Taylor's work is published in various recent
volumes of the proceedings of the IEEE Computer Arithmetic
Symposia. These instructions would be worth the cost of a
third chip if the payoff were correctly rounded division and
square root in two to four multiplication times. If only
modest performance is provided for correctly rounded divi-
sion, then an additional division or reciprocal instruction
could be offered that does not meet IEEE rounding specifica-
tions, if it can provide an approximate quotient in only a
few mutiplication times.

Just as correctly rounded division is tedious to pro-
vide with only multiplication in hardware, a correctly
rounded square root is tedious with only division in
hardware, yet the inherent complexities of division and
square root are comparable. One such tedious algorithm is
contained in a July 1980 note from W. Kahan to the IEEE P754
committee.

Since a complete IEEE remainder can require a long time
to compute, it must be interruptible and restartable, or
only a partial remainder operation may be provided, as on
the Intel 8087. An additional status register is required
to hold the partial/complete flag and the least 8, 16, or 32
bits of quotient from a remainder operation.

_Q_u_a_d_r_u_p_l_e _p_r_e_c_i_s_i_o_n _s_u_p_p_o_r_t: If extended precision is
not supported or if correctly rounded division or square
root are omitted, then quadruple precision support instruc-
tions can be used to allow efficient assembly-language
software to replace the missing hardware. Quadruple preci-
sion support operations for add, subtract, multiplication,
and division take two double precision operands and return a
double precision result. Thus the quadruple precision sup-
port multiplication instruction returns a double precision
result that, added to the result of a double precision mul-
tiplication, would yield the exact quadruple precision pro-
duct. Correctly rounded double precision division and
square root can be constructed from approximations if an
exact residual can be computed; the quadruple precision sup-
port multiplication facilitates the latter. The quadruple
precision division support instruction returns the double
precision remainder from a correctly rounded double preci-
sion division; that remainder can be used to construct the
IEEE remainder if it is not otherwise provided.

_C_o_n_v_e_r_s_i_o_n _t_o _i_n_t_e_g_e_r: There is considerable confusion
about what functions are required for converting floating

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 5

point arguments to integral values in integer formats (For-
tran INT, NINT) or to integral values in floating point for-
mats (Fortran AINT, ANINT). What is certain is that making
the result of the operation dependent on dynamic modes is a
mistake, except to the extent the operation is intended to
reflect dynamic IEEE rounding modes as discussed later. The
following operations cover all requirements in this area:

A) convert to integral value in integer format:
i) rounding toward zero (INT),
ii) biased rounding to nearest (NINT),
iii) rounding according to current IEEE rounding mode (Intel IRINT);
B) convert to integral value in floating point format:
i) rounding toward zero (AINT),
ii) biased rounding to nearest (ANINT),
iii) rounding according to current IEEE rounding mode (Intel RINT).

A further question is whether to set the inexact excep-
tion in cases like converting 1.4 to an integer value. The
IEEE P854 committee has at various times voted to define
operation B)iii) above with and without setting the inexact
flag. A noncommittal approach would be to provide two op
codes for each of the six operations.

_N_a_N_s: The 1164/1165 do not distinguish quiet from sig-
nalling NaNs. To meet the IEEE standard the distinction has
to be maintained in software by trapping on all occurrences
of NaNs. It would be better to adopt the 68881 and Zilog
8070 convention that signalling NaNs are indicated by the
leading fraction bit zero and quiet by that bit one. Quiet
NaNs propagate without exception in operations that produce
floating point results, while signalling NaNs produce an
invalid exception on every operation.

_O_p_t_i_m_i_z_i_n_g _i_n_s_t_r_u_c_t_i_o_n_s: These instructions can be used
by optimizing compilers to avoid more costly multiplica-
tions:
square(x) computes x*x transmitting x only once
scale(x,i) computes x*(2**i) for 32 bit integer i

Square(x) produces the same numerical results and exceptions
as x*x, and scale(x,i) produces the same numerical results
and exceptions as x*(2**i); in particular, scale(x,1) is the
same as 2*x and x+x.

_I_E_E_E _A_p_p_e_n_d_i_x _F_u_n_c_t_i_o_n_s: Certain computations such as
sqrt(x*x+y*y) can be made more robust in the absence of
extended precision if exponent extraction and scaling
instructions are available. For this purpose exponent
extraction and scaling may implement the IEEE logb and scalb
functions or may simply return the unbiased exponent field
as an integer and provide the x*2**i operation mentioned
above.

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 6

These functions and the IEEE copysign, nextafter, and
classify functions are trivial to implement in hardware com-
pared to addition or multiplication, yet are tedious to
implement in software, particularly from higher level
languages.

_E_l_e_m_e_n_t_a_r_y _t_r_a_n_s_c_e_n_d_e_n_t_a_l _f_u_n_c_t_i_o_n_s: All the elementary
transcendental functions e**x-1, e**x, 2**x, 10**x,
loge(1+x), loge, log2, log10, sinh, cosh, tanh, atanh, sin,
cos, sincos, tan, asin, acos, and atan are provided by the
68881. asinh and acosh could have been provided as well for
completeness. The core approximations for restricted argu-
ments, which are computed by rational functions on chips
like the 8087, are computed by a unified cordic algorithm on
the 68881. Such algorithms are described in papers like J.
S. Walther's, reproduced in Swartzlander's 1980 compendium
_C_o_m_p_u_t_e_r _A_r_i_t_h_m_e_t_i_c. Since cordic algorithms are not unlike
certain division and square root algorithms, they might be
conveniently placed on the same chip with the division
operations. Indeed, division and square root algorithms
have been cast in cordic form. Cordic algorithms typically
require substantial extra intermediate precision and a large
constant rom.

The greatest difficulty with elementary transcendental
functions is argument reduction, which is complicated by
special cases, particularly when high accuracy and appropri-
ate signalling of IEEE exceptions are desired. It may be
better in high performance systems to only implement these
functions for restricted arguments, especially for sin, cos,
and tan, which can require long times for the remainder
operation. Where rational approximations are to be used on
reduced arguments, data paths should facilitate polynomial
evaluation; an on-chip accumulator may be helpful. Argument
reduction for exponential and logarithmic functions is
facilitated by the logb and scalb IEEE appendix functions
mentioned earlier. Simple integer instructions in the ALU
(add/subtract, logical, shift) simplify argument reduction
in systems in which the floating point unit is autonomous
from the CPU.

If complete transcendental functions are provided, they
should be in the same chip with the IEEE remainder function
for argument reduction purposes. Transcendental functions
of two arguments such as X**Y and ATAN2 are best left to
software.

The principal question is where the benefit of
extremely fast elementary transcendental functions accrues.
The 68881's fast elementary transcendental functions cause
amazing performance on the Whetstone marketing benchmark,
relative to the 68881's performance on real floating-point
intensive applications. In the Livermore Loops benchmark
program mentioned earlier, the only elementary

D. Hough 24 November 1985

Extending Floating Point Chip Architecture 7

transcendental function to appear in an inner loop is
exp(x)-1, which appears once in the 24 loops. If the Liver-
more Loops are an accurate reflection of scientific comput-
ing, elementary transcendental function speed is not criti-
cal. Perhaps a more important benefit of incorporating a
68881 into a system is that a high quality, high performance
elementary transcendental function library is available with
no software effort.

_M_O_D_E_S

The use of dynamic modes to define arithmetic opera-
tions leads to inefficiencies and difficulties in implemen-
tation and debugging. Thus it is better to provide separate
instructions for "fast" and "IEEE" multiplication, rather
than have one instruction whose behavior is determined at
run time according to mode bits which are sometimes set by
mistake!

Unfortunately the IEEE rounding modes for direction and
precision themselves are perhaps best left dynamic rather
than incorporated into op codes. The consensus of the IEEE
P754 and P854 committees seems to favor an implementation
via dynamic modes, but a vocal minority favors static mode
declarations at compile time. Dynamic modes preclude noth-
ing and thus maximize flexibility; static modes, by preclud-
ing side effects, simplify compiled code generation and
verification and simplify high performance pipelined
hardware.

_Q_U_A_L_I_T_Y

By negligence or plan, debugging floating point
hardware is a job performed by the buyer as much as the
seller. Hardware manufacturers' EE-oriented management,
that would not expect its engineers to produce competitive
products with 20-year old oscilloscopes, is sometimes blind
to the user-level testing requirements of powerful develop-
ment systems, operating systems, higher level language com-
pilers, and personnel familiar enough with floating point
computation and standards to test and debug new hardware
effectively.

Eventually testing programs and procedures for IEEE
standard floating point will be sufficiently widely known
that suppliers and users will be able to easily verify the
correctness of new floating point implementations. By then
there will be very little need or demand for gratuitously
incorrect or deviant implementations.

D. Hough 24 November 1985

ken@turtlevax.UUCP (Ken Turkowski) (11/25/85)

In article <3026@sun.uucp> dgh@sun.uucp (David Hough) writes:
>  _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l  _o_p_i_n_i_o_n_s
>     _o_f  _t_h_e  _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e
>     _p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n.

I was unable to read the above and other similar sections on my
terminal, as the words were overlapped.  Please don't try to include
terminal-dependent escape codes in news postings.

Regarding computations of functions of two variables (x**y or
atan2(x,y) ala Fortran), I would suggest that atan2() be implemented
with CORDIC iterations rather than being programmed in software,
especially if sin() and the other circular and hyperbolic trigonometric
functions are implemented with CORDIC.

Probably the most common transendental operation is conversion between
polar and rectangular coordinates (or some variant thereof), so these
should be optimized.  Similarly, when sin() is needed, usually cos() is
needed as well.  Provision should be made to allow return of two values
(i.e. an array of length 2 or a two-dimensional function), especially
since with CORDIC computations, the second one comes for free.
-- 
Ken Turkowski @ CIMLINC (formerly CADLINC), Menlo Park, CA
UUCP: {amd,decwrl,hplabs,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.DEC.COM