[comp.arch] rounded vs. chopped floating-point arithmetic

dgh@validgh.com (David G. Hough on validgh) (06/20/91)

Nelson Beebe (beebe@math.utah.edu) recollected the following message
to a colleague:

 -------------------------------------------

  The following little program can be used to illustrate the effect of
  truncating arithmetic has on your larger program:
  
        real dt,t0,t1,t2,tend
        integer n
  
        n = 0
        dt = 0.018
        t0 = 4000.0
        tend = 5000.0
        t1 = t0
        t2 = t0
  
   10   n = n + 1
        t1 = t1 + dt
        t2 = t0 + float(n)*dt
        if (t2 .lt. tend) go to 10
        write (6,*) t1,t2,(t1 - t2)/t2
        end
  
  On the IBM 3090, this single precision version prints:
  
     4879.89844       5000.00781     -0.240218341E-01
  
  That is, the relative error is 2.4%.  On the Sun 4, it produces
  
      5003.70    5000.01    7.37889E-04
  
  The effect of truncating arithmetic on the running sum is large.
  
  The double precision version is:
  
        double precision dt,t0,t1,t2,tend
        integer n
  
        n = 0
        dt = 0.018D+00
        t0 = 4000.0D+00
        tend = 5000.0D+00
        t1 = t0
        t2 = t0
  
   10   n = n + 1
        t1 = t1 + dt
        t2 = t0 + dfloat(n)*dt
        if (t2 .lt. tend) go to 10
        write (6,*) t1,t2,(t1 - t2)/t2
        end
  
  The IBM 3090 result is
  
  5000.00799995563648       5000.00799999999981     -0.887265231637227285E-11
  
  The Sun 4 result is
  
  5000.0080000016    5000.0080000000    3.2341579848518D-13

 -------------------------------------------

Note that satisfactory results are obtained if you use enough precision
or if you round rather than chop.  Also note that this is not the program
that failed, but rather a drastic simplification of the user's actual
application to reveal the essential problem.  It's a simple example where
the superior statistics of rounding rather than chopping imply a broader
domain of applicability for a particular program.

Correct rounding and chopping, and several other good paradigms, can be
characterized by the property

	The rounded computed result is chosen from the two machine-representable
	numbers nearest the unrounded infinitely-precise exact result,
	according to a rule that depends only on the infinitely-precise
	exact result, and not on the operands or operation (or phase of moon). 

Most "fast" sort-of-rounding or sort-of-chopping schemes invented by 
hardware designers eventually frustrate error analysts because they 
can't be so characterized.

As for the first IBM RS/6000 implementation, I have heard that the original
floating-point unit was designed to implement IBM 370 arithmetic, and was
changed to IEEE 754 format relatively late in the game.  If true then it
would not be surprising that some aspects of 754 were problematic to add in.
The interesting question would then be
which aspects of IEEE arithmetic will be really
problematic for a high-performance RS/6000 implementation designed from
scratch to support 754.
-- 

David Hough

dgh@validgh.com		uunet!validgh!dgh	na.hough@na-net.ornl.gov

jbs@watson.ibm.com (06/22/91)

         David Hough posted the following:
>Nelson Beebe (beebe@math.utah.edu) recollected the following message
>to a colleague:
>-------------------------------------------
> The following little program can be used to illustrate the effect of
> truncating arithmetic has on your larger program:
>       real dt,t0,t1,t2,tend
>       integer n
>       n = 0
>       dt = 0.018
>       t0 = 4000.0
>       tend = 5000.0
>       t1 = t0
>       t2 = t0
>  10   n = n + 1
>       t1 = t1 + dt
>       t2 = t0 + float(n)*dt
>       if (t2 .lt. tend) go to 10
>       write (6,*) t1,t2,(t1 - t2)/t2
>       end
> On the IBM 3090, this single precision version prints:
>    4879.89844       5000.00781     -0.240218341E-01
> That is, the relative error is 2.4%.  On the Sun 4, it produces
>     5003.70    5000.01    7.37889E-04
> The effect of truncating arithmetic on the running sum is large.
> The double precision version is:
>       double precision dt,t0,t1,t2,tend
>       integer n
>       n = 0
>       dt = 0.018D+00
>       t0 = 4000.0D+00
>       tend = 5000.0D+00
>       t1 = t0
>       t2 = t0
>  10   n = n + 1
>       t1 = t1 + dt
>       t2 = t0 + dfloat(n)*dt
>       if (t2 .lt. tend) go to 10
>       write (6,*) t1,t2,(t1 - t2)/t2
>       end
> The IBM 3090 result is
> 5000.00799995563648       5000.00799999999981     -0.887265231637227285E-11
> The Sun 4 result is
> 5000.0080000016    5000.0080000000    3.2341579848518D-13
>-------------------------------------------
>Note that satisfactory results are obtained if you use enough precision
>or if you round rather than chop.  Also note that this is not the program
>that failed, but rather a drastic simplification of the user's actual
>application to reveal the essential problem.  It's a simple example where
>the superior statistics of rounding rather than chopping imply a broader
>domain of applicability for a particular program.
         I believe this example is misleading for 2 reasons.
    1.)  The main source of error in the hex results is the following.
Since 16**3 = 4096 and t0 and tend have been chosen so that the running
sum t1 goes from 4000 to 5000, most of the time t1 will have leading hex
digit 1.  As is well known this causes a loss of accuracy in the hex
computation.  Running the sum from 3000 to 4000 improves the relative
error in the hex results by factors of 10 and 23.  While the possibilty
of this sort of accuracy loss is undeniably a defect in the hex repres-
entation, it says nothing about the relative merits of BINARY chopped
vrs binary rounded.
    2.)  As pointed out above we should run the program using IEEE
chopped for a fair comparison.  Now the relative error is -.44*10**-2
(vrs .73*10**-3 for rounded) for single precision and -.92*10**-11
(vrs .32*10**-12 for rounded) for double precision.  Hence the magnitude
of the chopped error is about 6 times the magnitude of the rounded
error in the single precision case and about 29 times in the double
precision case.  These ratios seem substantial but they are not typical
for the following reason.  As soon as the running sum surpasses
2**12=4096 the binary representation of .018 is such that chopping
makes an error of .86 ulps per addition in the single precision case,
.968 ulps per addition in the double precision case.  Clearly rounding
reduces this error to .14 ulps in the single precision case, .032 ulps
in the double precision case.  The ratios .86/.14 and .968/.032 are
consistent with the observed difference in accuracy.  However it is
easy to see that the average error made by chopping is .5 ulps just
twice the average error made by rounding (.25 ulps).  Similarly the
worse case error made by chopping is 1 ulp just twice the worst case
error (.5 ulps) made by rounding.  Hence on the average rounding gains
one bit of accuracy. This is the result of the fact that the uncertain-
ty in the rounded answer is two-sided while the uncertainty in the
chopped answer is one-sided.  The width of the interval of uncertainty
is not any less for rounding.  If the rounding errors were independent
(which is not the case here) then the answer would be more likely to
lie near the center of the uncertainty interval than near the ends.
This would decrease the expected error in the rounded answer vrs the
chopped answer by a factor of about sqrt(n) where n is the number of
rounds. The worst case error ratio however would remain 2.
         David Hough also said
>Correct rounding and chopping, and several other good paradigms, can be
>characterized by the property
>       The rounded computed result is chosen from the two machine-representable
>       numbers nearest the unrounded infinitely-precise exact result,
>       according to a rule that depends only on the infinitely-precise
>       exact result, and not on the operands or operation (or phase of moon).
>Most "fast" sort-of-rounding or sort-of-chopping schemes invented by
>hardware designers eventually frustrate error analysts because they
>can't be so characterized.
        The chopping scheme I mentioned if properly interpreted has this
property.  The machine numbers for this scheme lie exactly half-way in-
between the IEEE machine numbers (ie they have an implicit 1 bit at the
end instead of an implicit 0 bit).  Chopping then corresponds to round-
ing to nearest (with half way cases always going up).  As far as I can
see the main problem with this scheme is that small integers are no
longer machine numbers which some would no doubt find unsettling.
         David Hough also said:
>As for the first IBM RS/6000 implementation, I have heard that the original
>floating-point unit was designed to implement IBM 370 arithmetic, and was
>changed to IEEE 754 format relatively late in the game.  If true then it
>would not be surprising that some aspects of 754 were problematic to add in.
>The interesting question would then be
>which aspects of IEEE arithmetic will be really
>problematic for a high-performance RS/6000 implementation designed from
>scratch to support 754.
         They may be some truth to this.  On the other hand the design
may have benefited from the efforts of the designers to lose as little
performance as possible relative to hex.  Certainly the performance of
this chip seems to compare favorably to competitive IEEE from the start
designs.  IEEE designers may take the standard for granted and not
spend a lot of time thinking about how fast alternatives would be thus
not fully understanding how much performance the IEEE standard is giving
up.
                          James B. Shearer