[net.arch] Floating Point Chip Architecture

dgh@sun.uucp (David Hough) (11/25/85)

nroff source for this posting is available from the author.


         Extending Floating Point Chip Architecture


                        David Hough



                          _A_B_S_T_R_A_C_T

          The Motorola 68881 and Weitek 1164/1165 chips
     have become popular floating point implementations
     and represent different  decisions  about  perfor-
     mance,  complexity,  and  conformance  to the IEEE
     standard. Future chip sets in the 1164/1165  style
     should  incorporate  some of the ideas that shaped
     the 68881.  _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l  _o_p_i_n_i_o_n_s
     _o_f  _t_h_e  _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e
     _p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n.



     The Motorola 68881 establishes a new standard  in  per-
formance  for  single-chip floating point coprocessors.  But
even more performance can be realized from multichip  imple-
mentations  of  floating  point units.  The Weitek 1164/1165
chip set has been very popular in  recent  designs  of  this
type.    Other   semiconductor  manufacturers  have  noticed
Weitek's continuing profitability  and  have  undertaken  or
contemplated  competitive products.  Certain features of the
1164/1165 architecture have meant that compromises  must  be
made between performance, complexity, and conformance to the
IEEE standard for binary floating point.  Eliminating  these
compromises  is an opportunity for Weitek to expand its lead
or its competitors to narrow it.  What follows is a list  of
possibilities  for Weitek and its competitors for developing
the successors to the 1164/1165 chip set.

     _M_a_r_k_e_t _r_e_q_u_i_r_e_m_e_n_t_s: It  may  help  to  understand  the
market  needs  which  motivate my suggestions.  The intended
market is general-purpose double  precision  floating  point
computations  expressed  in higher level languages, particu-
larly Fortran.  The calculations to  be  performed  are  not
known  in advance, so the floating point hardware to be sup-
plied must be of the highest possible quality and completely
conforming  to  the  IEEE  standard, possibly in conjunction
with appropriate software.  This  differs  from  some  other
applications   such   as   signal  processing  and  graphics
transformations in which the single  precision  calculations
are  well  understood and for which deviations from the IEEE
standard may be acceptable.  General-purpose computation  is
often  done  in  double  precision in the hope that roundoff



D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        2


error analysis may thereby by avoided.   This  hope  is  not
totally  unfounded but roundoff analysis may yet be required
if the "atomic" double precision  computations  are  not  as
accurate  as  possible.  What is thought of as atomic varies
among languages and programmers, but most expect justifiably
that  rational  operations + - * / and format conversions be
as accurate as possible, with no more error than half a unit
the  last  place  of double precision; many expect, but with
less justification, the same accuracy in all  the  intrinsic
Fortran  functions  including  exponential, logarithmic, and
trigonometric elementary functions and complex arithmetic.

_E_X_T_E_N_D_E_D _P_R_E_C_I_S_I_O_N

     The greatest advantage the 68881 has over the 1164/1165
is  that  extended  precision  is  used for all intermediate
results.  This means that computations like double precision
complex magnitude

        sqrt(x*x + y*y)

can be computed in a straightforward way with  no  difficul-
ties   due  to  intermediate  underflow  or  underflow.   No
1164/1165 program is  equally  robust  and  efficient,  even
given  a  fast square root.  Consequently the most important
extension to the 1164/1165 architecture would be to  include
extended precision operands.

     It is true that extended precision increases  the  com-
plexity  and reduces the performance of the multiplier by an
amount which is some function of (64/53).  But  when  coding
algorithms intended to be robust, that loss of efficiency on
a single operation is often more than  regained  by  simpler
algorithms overall.  Here's one robust double precision com-
plex magnitude algorithm designed for the 1164/1165,  assum-
ing an efficient square root is already programmed:
        if (abs(x) < abs(y)) then                       /* swap */
                t = x
                x = y
                y = t
        q = y/x                                         /* q <=1 or is NaN */
        if (6 <= result status <= 10) then ...          /* compute correct q
                                                           by wrapping x or y,
                                                           repeating q = x/y
                                                           and unwrapping q */
        if (result status = 15) then
                return abs(x)                           /* q = 0/0 or inf/inf */
        save weitek mode
        set weitek mode to "fast"                       /* so q*q is safe */
        return abs(x) * sqrt( 1 + q*q )
        restore weitek mode
The division could be eliminated  by  constructing  an  even
more complicated algorithm that involves exponent extraction
and scaling.  The point is that  no  elaborate  planning  is



D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        3


required to obtain satisfactory results with extended preci-
sion.

     In the 68881 architecture, all  numerical  results  are
computed to extended precision, then rounded if desired to a
shorter precision in a  subsequent  operation.   While  this
seems  less efficient than computing to the precision of the
argument, it may actually be more efficient; for instance, a
product  of  single  precision  numbers  fits  in  double or
extended precision exactly  without  need  for  rounding  or
overflow/underflow  checking.   If the product is to be used
immediately as the operand to an addition then it  might  as
well  be left in higher precision and the addition performed
in the higher precision.  The most general  approach  is  to
provide  instructions  which  produce results in the highest
precision available, as well as instructions  which  produce
results in the precision of the operands.  The first is usu-
ally faster and more accurate for  iterative  accumulations;
the latter is better for isolated calculations A = B op C.

     _A_c_c_u_m_u_l_a_t_i_o_n: The most typical  common  floating  point
operation pattern in scientific computation is a multiplica-
tion followed by  an  add/subtract  involving  the  product.
Floating  point  architectures should make this operation as
efficient as possible.  Representative  simple  inner  loops
include:

        do 1 i = 1, n
 1      s = s + x(i) * y(i)

        do 1 i = 1, n
 1      x(i) = x(i) + k * y(i)

        do 1 i = 1, n
 1      p = c(i) + x * p

        do 1 i = 1, n                   /* subtract then multiply */
 1      p = p * (x - r(i))

Some representative complicated inner loops are contained in
the Livermore Loops benchmark program of McMahon.

     _B_u_s _s_i_z_e: Although 32 bit buses are most common at  the
system  level,  many  implementations could support local 64
bit buses between, for instance, floating  point  chips  and
registers.  It  would  be  helpful  to be able to be able to
transmit one 64 bit operand or two 32 bit operands in a sin-
gle cycle.

     Extended formats might as well match the  68881,  using
only  80 bits.  Unused bits make the extended format 96 bits
wide in 32 bit systems and 128 bits wide in 64 bit systems.





D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        4


_A_D_D_I_T_I_O_N_A_L _I_N_S_T_R_U_C_T_I_O_N_S

     _D_i_v_i_s_i_o_n, _s_q_u_a_r_e  _r_o_o_t,  _a_n_d  _r_e_m_a_i_n_d_e_r:  Correct  IEEE
division,  square  root,  and  remainder, along with Fortran
MOD, can all be implemented with a fairly uniform  algorithm
along  the  lines  of work done by George Taylor at Berkeley
and ELXSI.  Taylor's work is  published  in  various  recent
volumes  of  the proceedings of the IEEE Computer Arithmetic
Symposia. These instructions would be worth the  cost  of  a
third chip if the payoff were correctly rounded division and
square root in two to four multiplication  times.   If  only
modest  performance  is provided for correctly rounded divi-
sion, then an additional division or reciprocal  instruction
could be offered that does not meet IEEE rounding specifica-
tions, if it can provide an approximate quotient in  only  a
few mutiplication times.

     Just as correctly rounded division is tedious  to  pro-
vide  with  only  multiplication  in  hardware,  a correctly
rounded  square  root  is  tedious  with  only  division  in
hardware,  yet  the  inherent  complexities  of division and
square root are comparable.  One such tedious  algorithm  is
contained in a July 1980 note from W. Kahan to the IEEE P754
committee.

     Since a complete IEEE remainder can require a long time
to  compute,  it  must  be interruptible and restartable, or
only a partial remainder operation may be  provided,  as  on
the  Intel  8087.  An additional status register is required
to hold the partial/complete flag and the least 8, 16, or 32
bits of quotient from a remainder operation.

     _Q_u_a_d_r_u_p_l_e _p_r_e_c_i_s_i_o_n _s_u_p_p_o_r_t: If extended  precision  is
not  supported  or  if  correctly rounded division or square
root are omitted, then quadruple precision support  instruc-
tions  can  be  used  to  allow  efficient assembly-language
software to replace the missing hardware.  Quadruple  preci-
sion  support  operations for add, subtract, multiplication,
and division take two double precision operands and return a
double  precision result.  Thus the quadruple precision sup-
port multiplication instruction returns a  double  precision
result  that, added to the result of a double precision mul-
tiplication, would yield the exact quadruple precision  pro-
duct.   Correctly  rounded  double  precision  division  and
square root can be constructed  from  approximations  if  an
exact residual can be computed; the quadruple precision sup-
port multiplication facilitates the latter.   The  quadruple
precision  division  support  instruction returns the double
precision remainder from a correctly rounded  double  preci-
sion  division;  that remainder can be used to construct the
IEEE remainder if it is not otherwise provided.

     _C_o_n_v_e_r_s_i_o_n _t_o _i_n_t_e_g_e_r: There is considerable  confusion
about  what  functions  are required for converting floating



D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        5


point arguments to integral values in integer formats  (For-
tran INT, NINT) or to integral values in floating point for-
mats (Fortran AINT, ANINT). What is certain is  that  making
the  result of the operation dependent on dynamic modes is a
mistake, except to the extent the operation is  intended  to
reflect  dynamic IEEE rounding modes as discussed later. The
following operations cover all requirements in this area:

A) convert to integral value in integer format:
        i) rounding toward zero (INT),
        ii) biased rounding to nearest (NINT),
        iii) rounding according to current IEEE rounding mode (Intel IRINT);
B) convert to integral value in floating point format:
        i) rounding toward zero (AINT),
        ii) biased rounding to nearest (ANINT),
        iii) rounding according to current IEEE rounding mode (Intel RINT).

     A further question is whether to set the inexact excep-
tion  in cases like converting 1.4 to an integer value.  The
IEEE P854 committee has at various  times  voted  to  define
operation  B)iii) above with and without setting the inexact
flag.  A noncommittal approach would be to  provide  two  op
codes for each of the six operations.

     _N_a_N_s: The 1164/1165 do not distinguish quiet from  sig-
nalling NaNs.  To meet the IEEE standard the distinction has
to be maintained in software by trapping on all  occurrences
of  NaNs.   It  would be better to adopt the 68881 and Zilog
8070 convention that signalling NaNs are  indicated  by  the
leading  fraction bit zero and quiet by that bit one.  Quiet
NaNs propagate without exception in operations that  produce
floating  point  results,  while  signalling NaNs produce an
invalid exception on every operation.

     _O_p_t_i_m_i_z_i_n_g _i_n_s_t_r_u_c_t_i_o_n_s: These instructions can be used
by  optimizing  compilers  to  avoid more costly multiplica-
tions:
        square(x)       computes x*x            transmitting x only once
        scale(x,i)      computes x*(2**i)       for 32 bit integer i

Square(x) produces the same numerical results and exceptions
as  x*x,  and scale(x,i) produces the same numerical results
and exceptions as x*(2**i); in particular, scale(x,1) is the
same as 2*x and x+x.

     _I_E_E_E _A_p_p_e_n_d_i_x _F_u_n_c_t_i_o_n_s: Certain computations  such  as
sqrt(x*x+y*y)  can  be  made  more  robust in the absence of
extended  precision  if  exponent  extraction  and   scaling
instructions  are  available.   For  this  purpose  exponent
extraction and scaling may implement the IEEE logb and scalb
functions  or  may simply return the unbiased exponent field
as an integer and provide  the  x*2**i  operation  mentioned
above.




D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        6


     These functions and the IEEE copysign,  nextafter,  and
classify functions are trivial to implement in hardware com-
pared to addition or  multiplication,  yet  are  tedious  to
implement   in  software,  particularly  from  higher  level
languages.

     _E_l_e_m_e_n_t_a_r_y _t_r_a_n_s_c_e_n_d_e_n_t_a_l _f_u_n_c_t_i_o_n_s: All the elementary
transcendental   functions   e**x-1,   e**x,   2**x,  10**x,
loge(1+x), loge, log2, log10, sinh, cosh, tanh, atanh,  sin,
cos,  sincos,  tan, asin, acos, and atan are provided by the
68881.  asinh and acosh could have been provided as well for
completeness.   The core approximations for restricted argu-
ments, which are computed by  rational  functions  on  chips
like the 8087, are computed by a unified cordic algorithm on
the 68881.  Such algorithms are described in papers like  J.
S.  Walther's,  reproduced in Swartzlander's 1980 compendium
_C_o_m_p_u_t_e_r _A_r_i_t_h_m_e_t_i_c.  Since cordic algorithms are not unlike
certain  division  and square root algorithms, they might be
conveniently placed on  the  same  chip  with  the  division
operations.   Indeed,  division  and  square root algorithms
have been cast in cordic form.  Cordic algorithms  typically
require substantial extra intermediate precision and a large
constant rom.

     The greatest difficulty with elementary  transcendental
functions  is  argument  reduction,  which is complicated by
special cases, particularly when high accuracy and appropri-
ate  signalling  of  IEEE  exceptions are desired. It may be
better in high performance systems to only  implement  these
functions for restricted arguments, especially for sin, cos,
and tan, which can require  long  times  for  the  remainder
operation.   Where rational approximations are to be used on
reduced arguments, data paths should  facilitate  polynomial
evaluation; an on-chip accumulator may be helpful.  Argument
reduction  for  exponential  and  logarithmic  functions  is
facilitated  by  the  logb and scalb IEEE appendix functions
mentioned earlier. Simple integer instructions  in  the  ALU
(add/subtract,  logical,  shift) simplify argument reduction
in systems in which the floating point  unit  is  autonomous
from the CPU.

     If complete transcendental functions are provided, they
should  be in the same chip with the IEEE remainder function
for argument reduction purposes.   Transcendental  functions
of  two  arguments  such  as X**Y and ATAN2 are best left to
software.

     The  principal  question  is  where  the   benefit   of
extremely  fast elementary transcendental functions accrues.
The 68881's fast elementary transcendental  functions  cause
amazing  performance  on  the Whetstone marketing benchmark,
relative to the 68881's performance on  real  floating-point
intensive  applications.   In  the Livermore Loops benchmark
program   mentioned    earlier,    the    only    elementary



D. Hough                                    24 November 1985





         Extending Floating Point Chip Architecture        7


transcendental  function  to  appear  in  an  inner  loop is
exp(x)-1, which appears once in the 24 loops.  If the Liver-
more  Loops are an accurate reflection of scientific comput-
ing, elementary transcendental function speed is not  criti-
cal.   Perhaps  a  more important benefit of incorporating a
68881 into a system is that a high quality, high performance
elementary transcendental function library is available with
no software effort.

_M_O_D_E_S

     The use of dynamic modes to  define  arithmetic  opera-
tions  leads to inefficiencies and difficulties in implemen-
tation and debugging.  Thus it is better to provide separate
instructions  for  "fast"  and "IEEE" multiplication, rather
than have one instruction whose behavior  is  determined  at
run  time  according to mode bits which are sometimes set by
mistake!

     Unfortunately the IEEE rounding modes for direction and
precision  themselves  are  perhaps best left dynamic rather
than incorporated into op codes. The consensus of  the  IEEE
P754  and  P854  committees seems to favor an implementation
via dynamic modes, but a vocal minority favors  static  mode
declarations  at compile time.  Dynamic modes preclude noth-
ing and thus maximize flexibility; static modes, by preclud-
ing  side  effects,  simplify  compiled  code generation and
verification  and  simplify   high   performance   pipelined
hardware.

_Q_U_A_L_I_T_Y

     By  negligence  or  plan,  debugging   floating   point
hardware  is  a  job  performed  by the buyer as much as the
seller.   Hardware  manufacturers'  EE-oriented  management,
that  would  not expect its engineers to produce competitive
products with 20-year old oscilloscopes, is sometimes  blind
to  the user-level testing requirements of powerful develop-
ment systems, operating systems, higher level language  com-
pilers,  and  personnel  familiar enough with floating point
computation and standards to test  and  debug  new  hardware
effectively.

     Eventually testing programs  and  procedures  for  IEEE
standard  floating  point  will be sufficiently widely known
that suppliers and users will be able to easily  verify  the
correctness  of new floating point implementations.  By then
there will be very little need or  demand  for  gratuitously
incorrect or deviant implementations.








D. Hough                                    24 November 1985

ken@turtlevax.UUCP (Ken Turkowski) (11/25/85)

In article <3026@sun.uucp> dgh@sun.uucp (David Hough) writes:
>  _T_h_i_s _p_a_p_e_r _c_o_n_t_a_i_n_s _p_e_r_s_o_n_a_l  _o_p_i_n_i_o_n_s
>     _o_f  _t_h_e  _a_u_t_h_o_r _a_n_d _s_h_o_u_l_d _n_o_t _b_e _c_o_n_s_t_r_u_e_d _a_s _t_h_e
>     _p_o_s_i_t_i_o_n _o_f _a_n_y _o_t_h_e_r _p_e_r_s_o_n _o_r _o_r_g_a_n_i_z_a_t_i_o_n.

I was unable to read the above and other similar sections on my
terminal, as the words were overlapped.  Please don't try to include
terminal-dependent escape codes in news postings.

Regarding computations of functions of two variables (x**y or
atan2(x,y) ala Fortran), I would suggest that atan2() be implemented
with CORDIC iterations rather than being programmed in software,
especially if sin() and the other circular and hyperbolic trigonometric
functions are implemented with CORDIC.

Probably the most common transendental operation is conversion between
polar and rectangular coordinates (or some variant thereof), so these
should be optimized.  Similarly, when sin() is needed, usually cos() is
needed as well.  Provision should be made to allow return of two values
(i.e. an array of length 2 or a two-dimensional function), especially
since with CORDIC computations, the second one comes for free.
-- 
Ken Turkowski @ CIMLINC (formerly CADLINC), Menlo Park, CA
UUCP: {amd,decwrl,hplabs,seismo,spar}!turtlevax!ken
ARPA: turtlevax!ken@DECWRL.DEC.COM