ark@alice.UucP (Andrew Koenig) (12/22/85)
The purpose of this note is to illustrate that some issues are much more complicated than they appear on the surface. Consider the following C statement: a = a + 1.0; and the superficially similar statement: a = a + b; In these examples, a and b are presumed to be of type float, and 1.0 is (by definition) of type double. The question: is it allowable for a compiler to quietly substitute a single-precision addition for the double-precision addition that the language definition requires? In other words: will the results always be equivalent both ways? To explore this, I need to define a few terms and assume a few things about the architecture. First of all, realize that any floating-point value is the exact representation of some rational number, and that the sum of any two rationals is a rational. However, this sum may not be capable of being represented exactly as a floating-point value, even if both operands started out that way. Thus, given a rational number, we need a way of deciding how to represent it. In general, we will choose one of the two nearest representable neighbors. Which one we choose is determined by the rounding rule of the implementation, and for a good implementation will be unique for any particular value being rounded. If some process gives the correctly rounded value of the exact mathematical result, I'll call the result of that process "correct." Now, let's assume that both single- and double-precision addition on our machine give correct results. We then want to know: will the two examples above give correct results in both single and double precision? If not, which one is correct? It is straightforward that doing the addition in single gives correct results in both cases. The second case (a = a + b) follows immediately from the definition of "correct" and the assumption that single-precision addition gives correct results. The first case is true given the further assumption that 1.0 can be exactly represented in single precision, an assumption that is true of every floating-point architecture I have ever seen. But what about the double-precision case? Will (double)a + (double)b always have the same value as (single)a + (single)b? What about (single)((double)a + (double)b)? It's the last one that is the toughie, because what is really happening with the double-precision case is that an exact mathematical result is being formed which is then rounded to double and then rounded a second time to single. If we knew that this double rounding was always harmless, then we could say with confidence that the result will be the same either way. But is it harmless? The answer is: NO, double rounding is NOT harmless. Consider a result to be rounded twice with the default IEEE rounding rule: round to nearest, and round up in case of ties. Apply this rule to, say: .....1|01111111111|11... Here, the ... represent uninteresting fraction bits and the | represent the two rounding points. If we round this value directly to single, the 0 after the leftmost cut point clearly indicates rounding down. However, if we round to double first, we round up, which overflows into the 0 making it a 1. This generates a tie in the second rounding, and the round-to-even rule says we must round up. So here is an example of a case where rounding directly to single gives different results from rounding first to double and then to single. The next questions are these: (a) is it possible to generate such a result by adding two single-precision values? (b) is it possible to generate such a result if one of the operands is 1 and the other is a single-precision value? Ansure: I'm not sure yet -- I have to think about it some more.