[comp.arch] IEEE FP denorms and Deming's Arithmetics With Variable Precision

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (09/30/89)

(Is this a good place for computer arithmetic questions?)

I just read a paper by Deming in the 8th Computer Arithmetic
Symposium, which basically deflates some of the proposed alternate
arithmetics, such as level index (where a floating point number
consists of three fields: pointer, exponent, and mantissa, and the
pointer indicates how many bits are "borrowed" from the mantissa to
expand the exponent and prevent overflow or underflow) or what I call
the meta-exponential form, where a number is expressed as
exp(exp(...(exp(m)))), and you store the number of exponentiations
necessary to bring the value m within the desired range.
    Level index basically has a much larger range than normal FP,
so overflow more rarely, while the meta-exponential form can be
shown to be closed under +-*/ for a given number of bits in the
representation.
    IE. both of these representations trade range for reduced
relative precision at the extrema of the range.

Deming shows how this tradeoff moves the complexity of coding
reliable numerical software from avoiding overflow, to handling
roundoff.  
    IE. reduced precision makes rounding error analysis more
complicated.

QUESTION:
    Don't the same arguments apply to IEEE Floating Point with
denormalized numbers?  Ie. don't denormalized numbers complicate
roundoff error analysis in the same way reduced precision complicates
the other arithmetics?

Deming suggests a sticky register which tracks the least relative
precision ever used in the calculation of intermediate results, which
will give you worst-case rounding error.
    Would such a register be worthwhile tracking the most extremely
denormalized IEEE FP number encountered?
    Does anyone do this sort of thing?

--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

ccplumb@rose.waterloo.edu (Colin Plumb) (10/03/89)

(I'm going a bit out on a limb, since someone with more experience than I
may prove all my ideas total nonsense, but this is what I learned in
conversation with a member of the IEEE 754 standards committee.)

In article <AGLEW.89Sep29135055@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes:
> Deming shows how this tradeoff moves the complexity of coding
> reliable numerical software from avoiding overflow, to handling
> roundoff.  
>     IE. reduced precision makes rounding error analysis more
> complicated.

True.  There are some really useful axioms binary floating-point obeys
that, say, IBM-style base-16 stuff doesn't.  E.g. a+b/2 lies on the closed
interval between a and b.  In base-16, of a and b both have the high bit
of the high 4-bit digit set, then adding them causes you to shift by 4
bits at once, dropping 3 off the bottom.  Dividing by 2 causes 0's to be
shifted in and makes a mess of things.  This can happen anywhere you can lose
more than 1 bit of mantissa in an addition step, such as variable-size
exponent encodings.

> QUESTION:
>     Don't the same arguments apply to IEEE Floating Point with
> denormalized numbers?  Ie. don't denormalized numbers complicate
> roundoff error analysis in the same way reduced precision complicates
> the other arithmetics?

Surprisingly,. no... they improve things!  I've seen a letter from no less
eminent an authority than Our Lord Knuth retracting his opposition to
denormalised numbers.  This is because denormalised numbers let you add
and subtract near the lower end of the expressible range without losing
absolute precision.  Consider a representation without denormalised
numbers.  There is some minimum exponent, 2^-min, which can be multiplied
by a mantissa from 1.111...111 to 1.000...000.  The difference made by
a 1 in the least significant bit of the mantissa is 2^-min*0.000...001,
2^-(min+mantsize).  You can add and subtract a lot of these units, but
once you get down to 1x2^-min, the jump is 2^-min all the way down to
zero.  Rather annoying!  Denormalised numbers let you express the difference
between any two representable numbers with as much absolute accuracy
as the least precise of the inputs.  Rather useful for fiddling with the
last few bits of error term in some messy polynomial approximation or whatever.

> Deming suggests a sticky register which tracks the least relative
> precision ever used in the calculation of intermediate results, which
> will give you worst-case rounding error.
>     Would such a register be worthwhile tracking the most extremely
> denormalized IEEE FP number encountered?
>     Does anyone do this sort of thing?

I don't know... generally those who are really concerned about such things
do interval arithmetic, keeping two answers at all stages which the true answer
is guaranteed to lie between.  There are problems with covariance (even if
x is x+/- epsilon ,x-x is exactly zero, not 0+/-2*epsilon), but if provides
good worst-case bounds.  Addition and multiplication do rather different
things to error bounds.  For the former, an absolute error bound is
best; for the latter, a relative error.  Mixing the two leads to all sorts
of messy analysis.

This is one of the reasons that specifying the rounding mode in the instruction
rather than a mode register is A Good Thing.
-- 
	-Colin

aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) (10/03/89)

A while back I posted concerning the following paper:

    "On Error Analysis in Arithmetic with Varying Relative Precision",
    James W. Demmel (Courant), Proc 8th Symp Comp Arith, 1987, pp. 148-152.

(I incorrectly named the author as Deming, instead of Demmel. My apologies
to both Messrs. Deming and Demmel)

I asked whether the arguments against varying relative precision for
level-index types of arithmetic do not also apply to denormalized numbers
in IEEE floating point.


I have received many answers, both by email and news, of the form "Of
course not - denorms preserve absolute precision on +/-".
    This is true enough.
    But isn't it also true that denorms lose relative precision?

Eg.
    If I compute (x-y)*z
    and x-y produces a denorm,
    then, instead of relative precision related to 1/2^M, 
where there are M bits in the mantissa,
    do you not have relative precision related to 1/2^D, 
where there are D valid bits in the denormalized difference.
    In fact, if you permit denorm-denorm, might you not be reducing
relative accuracy to 1/2 (since you can conceivably have only one significant
denormalized bit)?

    Note that I am not pushing the alternative, which would be to
make (x-y)*z => 0*z => 0, which may have much lower relative precision.
If anything, I was wondering whether the sticky precision register would
be useful.



--
Andy "Krazy" Glew,  Motorola MCD,    	    	    aglew@urbana.mcd.mot.com
1101 E. University, Urbana, IL 61801, USA.          {uunet!,}uiucuxc!udc!aglew
   
My opinions are my own; I indicate my company only so that the reader
may account for any possible bias I may have towards our products.

dik@cwi.nl (Dik T. Winter) (10/04/89)

In article <AGLEW.89Oct3123439@chant.urbana.mcd.mot.com> aglew@urbana.mcd.mot.com (Andy-Krazy-Glew) writes:
 > Eg.
 >     If I compute (x-y)*z
 >     and x-y produces a denorm,
 >     then, instead of relative precision related to 1/2^M, 
 > where there are M bits in the mantissa,
 >     do you not have relative precision related to 1/2^D, 
 > where there are D valid bits in the denormalized difference.
This is only true if the multiplication produces a denorm too.
Note that if underflow to zero occurs the relative precision would be 0.
Numerical analysis has always to take care of underflow and near underflow.
But the use of denorms simplifies matters in a number of cases (and does not
make it more difficult in other cases).  The major advantage of denorms is,
in my opinion, that statements like:
	if(x-y != 0.0) z = x/(x-y);
will not trap with a divide by zero trap but gives an (approximately) correct
result.
-- 
dik t. winter, cwi, amsterdam, nederland
INTERNET   : dik@cwi.nl
BITNET/EARN: dik@mcvax