[comp.lang.c] Floating point non-exactness

rick@.tetrauk.UUCP (Rick Jones) (07/30/90)

(This is nothing to do with unsigned integers :-)

I don't know if this has been discussed in the dim past, but I haven't seen
anything since I have been reading News.

The problem is well-known - not every rational decimal number has an exact
representation in binary floating-point form.  This leads to anomalies such
as:
	a = .291 / 2.0 ;
	b = .1455 ;

	if (a == b)		/* false! */

Both expressions as expressed in decimal yield exactly .1455, but the computed
results may not in fact be equal.  This fails on my system when the 3 values
are generated from strings using atof().  What does and does not fail will
presumably depend on the FP representation used, and possibly on the actual
hardware and compiler.

The reason is that .1455 does not have an exact binary representation.  In this
case the nearest two values are .14549999999999999 and .14550000000000002.  The
two routes to the result get different nearest approximations.  Thus the two
results are not actually equal.

Is there a general solution to this problem without continuously rounding
everything?  I believe it's possible by applying a "maximum reliable precision"
concept to FP numbers.  For IEEE 64-bit numbers this seems to be 1 in 10^16, or
15 significant digits.  Using this, a function "fpcmp" can be written which
takes 2 arguments:

	fpcmp (double a, double b)

		returns 0 if a equals b within the reliable precision,
		else returns 1 if a > b, or -1 if a < b

Real problem:  how do you write fpcmp() ?

I've got one or two possible solutions, but they're not very nice, and
computationally inefficient.  Has anyone got a really good answer to this?

Associated problems:

Can you ensure that "fpcmp (a, b)" and "fpcmp (a-b, 0.0)" yield the same result
when a and b are very close?

What about a rounding function which will consistently round on "5"?  E.g.
using (s)printf to convert .1455 to 3 dec. places will give .145 and .146 for
the two different close values above.


While it's a good discussion subject, I suggest that anything extensive, or
with large chunks of code, you email to me.  I'll summarise such contributions
later.

-- 
Rick Jones					You gotta stand for something
Tetra Ltd.  Maidenhead, Berks			Or you'll fall for anything
rick@tetrauk.uucp (...!ukc!tetrauk.uucp!rick)	     - John Cougar Mellencamp

ok@goanna.cs.rmit.oz.au (Richard A. O'Keefe) (07/31/90)

In article <622@.tetrauk.UUCP>, rick@.tetrauk.UUCP (Rick Jones) writes:
> 	fpcmp (double a, double b)
> 
> 		returns 0 if a equals b within the reliable precision,
> 		else returns 1 if a > b, or -1 if a < b
> 
> Real problem:  how do you write fpcmp() ?

See Knuth, "The Art of Computer Programming", vol 2, "Semi-Numerical Methods"
He defines "approximate" versions of <, >, and *TWO* approximate versions of
==.   Note that the appropriate "tolerance" to use does NOT depend only on
the properties of your floating-point arithmetic, but on details of your
algorithm.

The ultimate answer to your question is to take a course on numerical
analysis.  I sometimes wonder whether there ought to be a generally
understood qualification for computer programmers:  LFP (licensed to
use floating-point) (:-).

> Can you ensure that "fpcmp (a, b)" and "fpcmp (a-b, 0.0)" yield the same result
> when a and b are very close?

No.  In many computer arithmetic systems, a-b may collapse to a precise 0.
See the axioms Knuth gave.
-- 
Science is all about asking the right questions.  | ok@goanna.cs.rmit.oz.au
I'm afraid you just asked one of the wrong ones.  | (quote from Playfair)

carlos@urz.unibas.ch (07/31/90)

In article <622@.tetrauk.UUCP>, rick@.tetrauk.UUCP (Rick Jones) writes:
> 
> (This is nothing to do with unsigned integers :-)
> 
> I don't know if this has been discussed in the dim past, but I haven't seen
> anything since I have been reading News.
> 
> The problem is well-known - not every rational decimal number has an exact
> representation in binary floating-point form.  This leads to anomalies such
> as:
> 	a = .291 / 2.0 ;
> 	b = .1455 ;
> 
> 	if (a == b)		/* false! */
> 
> Both expressions as expressed in decimal yield exactly .1455, but the computed
> results may not in fact be equal.  This fails on my system when the 3 values
> are generated from strings using atof().  What does and does not fail will
> presumably depend on the FP representation used, and possibly on the actual
> hardware and compiler.
> 
> The reason is that .1455 does not have an exact binary representation.  In this
> case the nearest two values are .14549999999999999 and .14550000000000002.  The
> two routes to the result get different nearest approximations.  Thus the two
> results are not actually equal.
> 
> Is there a general solution to this problem without continuously rounding
> everything?  I believe it's possible by applying a "maximum reliable precision"
> concept to FP numbers.  For IEEE 64-bit numbers this seems to be 1 in 10^16, or
> 15 significant digits.  Using this, a function "fpcmp" can be written which
> takes 2 arguments:
> 
> 	fpcmp (double a, double b)
> 
> 		returns 0 if a equals b within the reliable precision,
> 		else returns 1 if a > b, or -1 if a < b
> 
> Real problem:  how do you write fpcmp() ?
> 
> I've got one or two possible solutions, but they're not very nice, and
> computationally inefficient.  Has anyone got a really good answer to this?
> 
> Associated problems:
> 
> Can you ensure that "fpcmp (a, b)" and "fpcmp (a-b, 0.0)" yield the same result
> when a and b are very close?
> 
> What about a rounding function which will consistently round on "5"?  E.g.
> using (s)printf to convert .1455 to 3 dec. places will give .145 and .146 for
> the two different close values above.
> 
> 
> While it's a good discussion subject, I suggest that anything extensive, or
> with large chunks of code, you email to me.  I'll summarise such contributions
> later.
> 
> -- 
> Rick Jones					You gotta stand for something
> Tetra Ltd.  Maidenhead, Berks			Or you'll fall for anything
> rick@tetrauk.uucp (...!ukc!tetrauk.uucp!rick)	     - John Cougar Mellencamp

carlos@urz.unibas.ch (07/31/90)

Oooops, sorry for message nr. 2653, it was our first try ever....
Hope this time works......

In article <622@.tetrauk.UUCP>, rick@.tetrauk.UUCP (Rick Jones) writes:
>       fpcmp (double a, double b)
>               returns 0 if a equals b within the reliable precision,
>               else returns 1 if a > b, or -1 if a < b
>
> Real problem:  how do you write fpcmp() ?

Since there is only a finite quantity of fp numbers on the computer,
an Increase(x) resp. Decrease(x)  (to get the next resp. previous fp
of x)  procedure for fp numbers could do it. The tolerance that you
build in your function fpcmp() depends on the fp system you use.

You can implement the above functions very easily for IEEE fp formats:

 Just regard the bit-sequence representing your fp number as an integer,
 and increase resp. decrease it  by 1. The new integer (interpreting
 it as double) is the next resp. previous possible fp number.

 This holds automatically in case of changes to other exponents and
 changes between normalized and denormalized numbers.

 You must only watch out for a change in sign and that decreasing 
 negative numbers means increasing the mantissa (and exponent).

> Can you ensure that "fpcmp (a, b)" and "fpcmp (a-b, 0.0)" yield the same resul
t
> when a and b are very close?

Yes, if your fp system conforms IEEE Std. 754 due to the existence of 
denormalized numbers.

                         Good luck,   Stefan & Carlos

rob@baloo.eng.ohio-state.edu (Rob Carriere) (08/01/90)

In article <622@.tetrauk.UUCP> rick@tetrauk.UUCP (Rick Jones) writes:
>
>(This is nothing to do with unsigned integers :-)

Are you sure?  It's a floating subject after all...

>The problem is well-known - not every rational decimal number has an exact
>representation in binary floating-point form.  This leads to anomalies such
>as:
[example: computattion != math result]
>
>Is there a general solution to this problem without continuously rounding
>everything?  I believe it's possible by applying a "maximum reliable precision"
>concept to FP numbers.  For IEEE 64-bit numbers this seems to be 1 in 10^16, or
>15 significant digits.  Using this, a function "fpcmp" can be written which
>takes 2 arguments: [and finds whether they are within float epsilon of each
other] 

Hm.. Are you sure this is what you want?  In your normal everyday floating
point problem, the inaccuracies resulting from real-world problems tend to
greatly exceed those from the computer's FP system.  If I measure the
`resistance' of your average resistor, I had better not peek much further than
3 to 4 decimal digits of precision, or I will see that the resistance of this
thing isn't constant after all --> the model (Ohm's Law in this case) doesn't
match reality.  Further, I doubt my multimeter is going to give me more tham 2
digits anyway (and I wouldn't stake anything on the exactness of the 2nd digit
unless the thing's been calibrated recently) --> measurement error.

OK, so this resistor is 3.1 kOhm.  I tell my nifty-dandy program this and it
comes back with the result that the eggs will take 182.14142314159 s to
finish.  How many digits of that answer do you trust?

And we haven't even touched the numerical conditioning of that program yet. :-)

In other words, usually what you want is something like

int fcmp( double a, double b, double eps )
{
   if (  fabs( a-b ) < eps  ) return 0;
   if (  a > b  ) return 1;
   return -1;
}

where eps will be more than large enough that you don't have to worry about
the FP system messing up the comparison.  (Optimization of the above code is
left as an excercise for th reader.  After all, I already did the easy part
:-)

SR
---

ark@alice.UUCP (Andrew Koenig) (08/01/90)

In article <5467@quanta.eng.ohio-state.edu>, rob@baloo.eng.ohio-state.edu (Rob Carriere) writes:

> In other words, usually what you want is something like

> int fcmp( double a, double b, double eps )
> {
>    if (  fabs( a-b ) < eps  ) return 0;
>    if (  a > b  ) return 1;
>    return -1;
> }

Probably not.

The trouble is that if a and b are big enough, then a-b will either be
exactly 0 or will be larger than eps.  (Here, `big enough' is approximately
eps * 2^n, where n is the number of significant bits in a floating-point
freaction.)  There is also the possibility that a-b might overflow,
in which case the comparison would surely have been valid if done directly.

Floating point arithmetic is *hard*.
-- 
				--Andrew Koenig
				  ark@europa.att.com

johnb@srchtec.UUCP (John Baldwin) (08/02/90)

In article <1990Jul31.101803.852@urz.unibas.ch> carlos@urz.unibas.ch writes:
>
>> Can you ensure that "fpcmp (a, b)" and "fpcmp (a-b, 0.0)" yield the same
           ^^^^^^[emphasis mine]
>> result when a and b are very close?
>
>Yes, if your fp system conforms IEEE Std. 754 due to the existence of 
>denormalized numbers.

Actually, it depends on what the previous writer means by "ensure."
If the code has to be portable, one can't rely on IEEE-754 representation
of floating point values, since that's NOT part of the C standard.

Portably, you should be able to "get away with" using the values built in
to <float.h> (a required part of the standard library) for the epsilon
of your floating point representation.  [Just in case this slipped past
anyone, epsilon is the smallest quantity "e" such that x + e > x]

-- 
John T. Baldwin                      |  johnb@srchtec.uucp
Search Technology, Inc.              |  johnb%srchtec.uucp@mathcs.emory.edu
standard disclaimer:                 |  ...uunet!samsung!emory!stiatl!srchtec..
opinions and mistakes purely my own. |  ...mailrus!gatech!stiatl!srchtec...

r91400@memqa.uucp (Michael C. Grant) (08/06/90)

In article <11117@alice.UUCP>, ark@alice.UUCP (Andrew Koenig) writes:
> In article <5467@quanta.eng.ohio-state.edu>, rob@baloo.eng.ohio-state.edu (Rob Carriere) writes:
>> In other words, usually what you want is something like
>> int fcmp( double a, double b, double eps )
>> {
>>    if (  fabs( a-b ) < eps  ) return 0;
>>    if (  a > b  ) return 1;
>>    return -1;
>> }
> Probably not.
> 
> The trouble is that if a and b are big enough, then a-b will either be
> exactly 0 or will be larger than eps.  (Here, `big enough' is approximately
> eps * 2^n, where n is the number of significant bits in a floating-point
> freaction.)  There is also the possibility that a-b might overflow,
> in which case the comparison would surely have been valid if done directly.

Perhaps a solution to this problem would be an implementation-
dependent library function fcmp, in which 'eps' is not some absolute
number, but a fraction of the two nearly-equal numbers themselves:

	if (fabs(a-b)<(eps*a)) return 0; 
	(assuming a is almost equal to b)

The reason it would need to be a library function is that implementing
it efficiently would involve extracting the mantissa and exponent and
examining them separately.  For example, assuming a standard which
requires normalization, this algorithm might work (note this is 
just a near-equality test, not a complete comparison):

	if (exponent_a!=exponent_b) return false
	if abs(mantissa_a-mantissa_b)<error_limit return true
	   else return false

Of course, if you don't care about portable code just write
this function yourself!

Michael C. Grant
no .sig, no .identity

ark@alice.UUCP (Andrew Koenig) (08/06/90)

In article <4958@memqa.uucp>, r91400@memqa.uucp (Michael C. Grant) writes:

> For example, assuming a standard which
> requires normalization, this algorithm might work (note this is 
> just a near-equality test, not a complete comparison):

> 	if (exponent_a!=exponent_b) return false
> 	if abs(mantissa_a-mantissa_b)<error_limit return true
> 	   else return false


You can use frexp() to split numbers into mantissa and exponent
(though I forget whether frexp is actually part of the ANSI standard
or not).  However, the algorithm shown here doesn't work because
the two numbers being compared could straddle a power of 2,
which would give them different exponents even though they're
arbitrarily close together.
-- 
				--Andrew Koenig
				  ark@europa.att.com

gdtltr@freezer.it.udel.edu (Gary Duzan) (08/06/90)

   All practical applications aside, isn't it philosophically unnatural to
apply an eqivalence comparison to two floating point (presumably real)
numbers? I know there are finite representations in the machine, etc., but
what mathematics I have had would make me think twice about this anyway.

                                        Gary Duzan
                                        Time  Lord
                                    Third Regeneration



--
                          gdtltr@freezer.it.udel.edu
   _o_                    --------------------------                      _o_
 [|o o|] If you can square, round, or cube a number, why not sphere it? [|o o|]
  |_O_|         "Don't listen to me; I never do." -- Doctor Who          |_O_|

ark@alice.UUCP (Andrew Koenig) (08/08/90)

In article <26691@nigel.ee.udel.edu>, gdtltr@freezer.it.udel.edu (Gary Duzan) writes:

>    All practical applications aside, isn't it philosophically unnatural to
> apply an eqivalence comparison to two floating point (presumably real)
> numbers?

Not if you're doing appropriate operations on a machine with IEEE
floating point or some other good floating point system.

For example, IEEE guarantees that if I read the decimal representation
of a number that can be represented exactly in floating point,
that's what I'll get, period.  It also guarantees exact results
for primitive operations including addition, subtraction, multiplication,
division, and square root.

Suppose, then that I want to do calculations on integers that are
too big to fit in `int' values but small enough to be represented
exactly in floating point.  I see no reason not to do any comparison
on floating point numbers that I would do on integers.

As a simple example,

	double d;
	for (d = 0; d < 1e10; d++) {
		/* stuff */
	}

under IEEE floating point can be counted on to execute precisely 1e10
(ten billion) times.

Yes, such techniques are not portable to machines with lousy floating
point, but it might be that your program wouldn't run there anyway
for some reason.  As good floating point systems become more widespread,
this becomes less of an issue.
-- 
				--Andrew Koenig
				  ark@europa.att.com

doug@ozdaltx.UUCP (Doug Matlock) (08/13/90)

In article <11160@alice.UUCP>, ark@alice.UUCP (Andrew Koenig) writes:
> For example, IEEE guarantees that if I read the decimal representation
> of a number that can be represented exactly in floating point,
> that's what I'll get, period.  It also guarantees exact results
> for primitive operations including addition, subtraction, multiplication,
> division, and square root.

What if these primitive operators generates a result that cannot be
represented exactly in IEEE floating point notation?  Many simple rational
(or irrational, since you've included square root) numbers are not
exactly expressesed in decimal notation.  

> As a simple example,
> 
> 	double d;
> 	for (d = 0; d < 1e10; d++) {
> 		/* stuff */
> 	}
> 
> under IEEE floating point can be counted on to execute precisely 1e10
> (ten billion) times.

The obvious pitfall here is exceeding the precision of floating point.
For 8-byte double, I think the turning point is somewhere around 1e15.
If I take a floating point number greater than 1e15 (or so) and add 1
to it, I will get the same number back.  (This is generally lots of fun
when running computer simulations. ;-) )

-- 
Doug.

"If you want Peace, work for Justice."

bright@Data-IO.COM (Walter Bright) (08/16/90)

In article <6855@ozdaltx.UUCP> doug@ozdaltx.UUCP (Doug Matlock) writes:
<In article <11160@alice.UUCP>, ark@alice.UUCP (Andrew Koenig) writes:
<< For example, IEEE guarantees that if I read the decimal representation
<< of a number that can be represented exactly in floating point,
<< that's what I'll get, period.  It also guarantees exact results
<< for primitive operations including addition, subtraction, multiplication,
<< division, and square root.
<What if these primitive operators generates a result that cannot be
<represented exactly in IEEE floating point notation?

IEEE enables you to specify that an exception is generated if the result
is not exact.

ark@alice.UUCP (Andrew Koenig) (08/17/90)

In article <2646@dataio.Data-IO.COM>, bright@Data-IO.COM (Walter Bright) writes:
> In article <6855@ozdaltx.UUCP> doug@ozdaltx.UUCP (Doug Matlock) writes:

> <In article <11160@alice.UUCP>, ark@alice.UUCP (Andrew Koenig) writes:

> << It also guarantees exact results
> << for primitive operations including addition, subtraction, multiplication,
> << division, and square root.

> <What if these primitive operators generates a result that cannot be
> <represented exactly in IEEE floating point notation?

> IEEE enables you to specify that an exception is generated if the result
> is not exact.

Yes.  Also, and this is the point I was trying to make before, for the
primitive operations I mentioned, IEEE requires that the result of
each operation be exactly what would be obtained by doing the operation
in infinite precision and then rounding it.  In other words, for
those operations, all IEEE implementations must give bit-for-bit
identical results.
-- 
				--Andrew Koenig
				  ark@europa.att.com