[comp.lang.c] Floating point puzzle

riddle@emory.uucp (Larry Riddle) (08/07/88)

The following is a very simple C program compiled and run on a Sun-4
with no special command line options.

main()
{
	float x,y;
	x = 1.0/10.0;
	y = 1677721.0/16777216.0; 
	printf("x: %x",x);
	printf("%20.17f\n",x);
	printf("y: %x",y);
	printf("%20.17f\n",y);
}

Here is the output:

x: 3fb99999 0.10000000149011612 
y: 3fb99999 0.09999996423721313

Notice that x and y, which have been declared as floats, and thus have
a 32 bit representation (according to the manual this obeys IEEE
floating point arithmetic standards), both are printed the same in hex,
but give different values when printed as floats. I believe that the
hex is a straight translation of the internal bit representation. The
division in computing x and y is done in double precision (64 bits) and
then converted to floats.

Can anyone enlighten me on why this output comes out this way?

**********

According to what I have read about the IEEE standard, floats should
have 1 sign bit, a biased exponent of 8 bits, and a 23 bit normalized
mantissa. However, my experiments seem to imply that floats have an 11
bit biased exponent (offset by 1023) and only a 20 bit normalized
mantissa, exactly the same as doubles, except double has a 52 bit
mantissa. For example, the bit pattern 3fb99999 given above for 1/10
corresponds to

    exponent   mantissa
0 01111111011 10011001100110011001

The 11 bits of this exponent gives 1019-1023 = -4 which coupled with
the mantissa gives the binary number

.0001100110011001100110011001...

which is the (non-terminating) binary representation for 1/10.  Notice
also that this 32 bit representation has been chopped, rather than
rounded.

I don't understand this discrepancy either. Any suggestions?

Thanks.

***********
-- 
Larry Riddle        |  gatech!emory!riddle               USENET
Emory University    |  riddle@emory                      CSNET,BITNET
Dept of Math and CS |  riddle.emory@csnet-relay          ARPANET 
Atlanta, Ga 30322   |  (404) 727-7922                    AT@T

mcdonald@uxe.cso.uiuc.edu (08/07/88)

I tried running the following program on my IBM PC, using Microsoft C 5.1
with command line switch /Od - no optimization.

#include <stdio.h>

main()
{
union { 
    float fl;
    unsigned long  lo;
} x,y;
	x.fl = 1.0/10.0;
	y.fl = 1677721.0/16777216.0; 
	printf("x: %lx", x.lo);
	printf("%20.17f\n",x.fl);
	printf("y: %lx", y.lo);
	printf("%20.17f\n",y.fl);
}

/* output is:

x: 3dcccccd 0.10000000149011610
y: 3dccccc8 0.09999996423721313

/*

Which is exactly what I would expect. The first one is 0.1 to almost
8 places, while the second is a bit smaller - just as it should be.

Try running this program, or, if your compiler has 32bit ints, 
substitute "int" for "long" and "%x" for %lx".

Doug McDonald

rob@kaa.eng.ohio-state.edu (Rob Carriere) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
> [ C program on Sun-4 gives strange single precision results, to wit,
>   the results look like mutilated doubles ]

I don't *know*, but you might want to look into what your floating
point hardware does.  Most floating point hardware I know of, only
implements IEEE to whatever extent was convenient.  Specifically, many
of these devices leave results hanging around in the (extended
precision) floating registers if at all possible.  This means that you
may get more precision than you asked for, ie probably your results
really are doubles, despite you asking for singles.  This would also
explain why you see truncated rather than rounded doubles: you are
doing the truncation manually when you ask for the hex output.

Note that this behaviour, while not IEEE, is ANSI C acceptable.

Rob Carriere

jgk@speech2.cs.cmu.edu (Joe Keane) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
>However, my experiments seem to imply that floats have an 11
>bit biased exponent (offset by 1023) and only a 20 bit normalized
>mantissa, exactly the same as doubles, except double has a 52 bit
>mantissa.

(This belongs on comp.lang.c!)  The problem is, a float parameter gets
converted to double.  So you're looking at the first 32 bits of the
double representation.  To look at the float representation, use this:
	printf("x: %lx", * (long *) &x);

--Joe

braner@batcomputer.tn.cornell.edu (braner) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
>
>main()
>{
>	float x,y;
>	x = 1.0/10.0;
>	y = 1677721.0/16777216.0; 
>	printf("x: %x",x);
>	printf("%20.17f\n",x);
>	printf("y: %x",y);
>	printf("%20.17f\n",y);
>}
>
>Here is the output:
>
>x: 3fb99999 0.10000000149011612 
>y: 3fb99999 0.09999996423721313

Seems that the calculation of y is just different enough from 1/10 to cause
a difference in the last bit or so of the floats.  In the printf() calls the
floats are converted to doubles before being passed to the function, and
since the doubles have a larger exponent field the difference in the
mantissa is out of sight within the (long) int referred to in the "%x".
I suggest you try:

	float	x;
	long	*p;
	...
	p = (long *) &x;
	printf ("representation of x: %lx\n", *p);
	...and so on...

- Moshe Braner

ok@quintus.uucp (Richard A. O'Keefe) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
>Notice that x and y, which have been declared as floats, and thus have
>a 32 bit representation (according to the manual this obeys IEEE
>floating point arithmetic standards), both are printed the same in hex,

This is >>C<< remember?  Floats are 32-bits IN MEMORY, but when you
operate on them or pass them to functions they are automatically
converted to double.

Since you specifically mention the Sun-4, I suggest that you read your
copy of the Sun Floating-Point Programmer's Guide.  In particular, if
you really want to pass 32-bit floating-point numbers in float format,
you will need the "-fsingle2" compiler option.  (I haven't tried this
on a Sun-4, but it works fine on Sun-3s.)

The two flags are

	-fsingle	If the operands of a floating-point operation are
			both 'float', the operation will be done in single
			precision instead of the normal double precision.

	-fsingle2	Float arguments are passed to functions as 32 bits,
			and float results are returned as 32 bits.  (Useful
			for calling Fortran functions.)

TRAP:  floating-point constants such as 1.0 are DOUBLE precision, so if the
compiler sees float x; ... x+1.0, it will do the addition in double
precision.  In such a case, I do	float one = 1.0; ... x+one...

carl@csli.STANFORD.EDU (Carl Schaefer) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
> The following is a very simple C program compiled and run on a Sun-4
> with no special command line options.
> 
> main()
> {
> 	float x,y;
> 	x = 1.0/10.0;
> 	y = 1677721.0/16777216.0; 
> 	printf("x: %x",x);
> 	printf("%20.17f\n",x);
> 	printf("y: %x",y);
> 	printf("%20.17f\n",y);
> }
> 
> Here is the output:
> 
> x: 3fb99999 0.10000000149011612 
> y: 3fb99999 0.09999996423721313
> 
> Notice that x and y, which have been declared as floats, and thus have
> a 32 bit representation (according to the manual this obeys IEEE
> floating point arithmetic standards), both are printed the same in hex,
> but give different values when printed as floats. I believe that the
> hex is a straight translation of the internal bit representation. The
> division in computing x and y is done in double precision (64 bits) and
> then converted to floats.
> 
> Can anyone enlighten me on why this output comes out this way?

Floats are converted to doubles when passed as function arguments.
The following inherently nonportable code prints the bit patterns of
your floats as floats and doubles:

#include <stdio.h>

void main()
{
  union {
    double d;
    float f;
    int i[2];
    } X;
  float x, y;

  x = 1.0/10.0;
  y = 1677721.0/16777216.0;
  X.f = x;
  (void) printf("x (float): %8x, %20.17f\n", X.i[0], X.f);
  X.f = y;
  (void) printf("y (float): %8x, %20.17f\n", X.i[0], X.f);
  X.d = x;
  (void) printf("x (double): %8x/%8x, %20.17f\n", X.i[0], X.i[1], X.d);
  X.d = y;
  (void) printf("y (double): %8x/%8x, %20.17f\n", X.i[0], X.i[1], X.d);
  exit(0);
}

produces:

x (float): 3dcccccd,  0.10000000149011612
y (float): 3dccccc8,  0.09999996423721313
x (double): 3fb99999/a0000000,  0.10000000149011612
y (double): 3fb99999/       0,  0.09999996423721313

when run on a sun4, sizeof(int) == sizeof(float) == 4, sizeof(double) == 8

leo@philmds.UUCP (Leo de Wit) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
|The following is a very simple C program compiled and run on a Sun-4
|with no special command line options.
|
|main()
|{
|	float x,y;
|	x = 1.0/10.0;
|	y = 1677721.0/16777216.0; 
|	printf("x: %x",x);
|	printf("%20.17f\n",x);
|	printf("y: %x",y);
|	printf("%20.17f\n",y);
|}
|
|Here is the output:
|
|x: 3fb99999 0.10000000149011612 
|y: 3fb99999 0.09999996423721313
|
|Notice that x and y, which have been declared as floats, and thus have
|a 32 bit representation (according to the manual this obeys IEEE
|floating point arithmetic standards), both are printed the same in hex,
|but give different values when printed as floats. I believe that the
|hex is a straight translation of the internal bit representation. The
|division in computing x and y is done in double precision (64 bits) and
|then converted to floats.
|
|Can anyone enlighten me on why this output comes out this way?

I think I can. On a SUN-3 the same results were achieved. The problem
is that you do not get the internal representation of x and y, but of
x and y converted to double (because that's what is passed to printf).
By coincidence the part of the double that is printed of x and y is
the same (I tried it also on Ultrix were it was _different_).
By using a union that contains a float and a int the following values
for x and y were obtained (writing into the float part and reading
from the int part):
x: 3dcccccd
y: 3dccccc8
which shows both that x and y _have_ different binary representations,
and that the value is quite different from the original one (3fb99999).

All this might also answer the second question in the original article
about the IEEE float representation, simply because the wrong values
were used.

Note that I do not recommend this union kludge; it is however probably
the only way to get to the bit patterns; or else, cast x and y to a
pointer so that their values will not be converted to double
(this is of course not portable because the pointer needs to be the
same size as a float).

Any more Riddles 8-) ?

     Leo.

markhall@pyramid.pyramid.com (Mark Hall) (08/08/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
>The following is a very simple C program compiled and run on a Sun-4
>with no special command line options.
>
>main()
>{
>	float x,y;
>	x = 1.0/10.0;
>	y = 1677721.0/16777216.0; 
>	printf("x: %x",x);
>	printf("%20.17f\n",x);
>	printf("y: %x",y);
>	printf("%20.17f\n",y);
>}
>
>Here is the output:
>
>x: 3fb99999 0.10000000149011612 
>y: 3fb99999 0.09999996423721313
>
>Notice that x and y, which have been declared as floats, and thus have
>a 32 bit representation.

The use of the float as an expression (in this case, as the
arguments to printf) is always converted to a double.  For
example, witness the assembly generated on the pyramid:

        movw    $0x3dcccccd,lr0
        movw    $0x3dccccc8,lr1		#note the diff here in last 3 bits.
        cvtfd   lr0,tr1			#here's the conversion
        movw    $"x: %x",tr0
        call    _printf

Since exponents in double take up more bits than floats,
you may lose any difference in the mantissas with the call
to printf. Try putting printf("x: %x %x",x) and comparing
the bit patterns for your machine then. On the pyramid, they
turn out to be:

x: 3fb99999 a0000000 0.10000000149011611 #close to the sun output, eh?
y: 3fb99999 00000000 0.09999996423721313

hence the difference when you print out the float.  I leave it as
an exercise (mostly cuz of the flames for non-portable code) 
as to how you can print out the bits without converting to 
floating point.

-Mark Hall (smart mailer): markhall@pyramid.pyramid.com
	   (uucp paths  ): 
		{amdahl|decwrl|sun|seismo|lll-lcc}!pyramid!markhall

turk@Apple.COM (Ken "Turk" Turkowski) (08/11/88)

In article <3117@emory.uucp> riddle@emory.uucp (Larry Riddle) writes:
>	float x,y;
>	x = 1.0/10.0;
>	y = 1677721.0/16777216.0; 
>	printf("x: %x",x);
>	printf("%20.17f\n",x);
>	printf("y: %x",y);
>	printf("%20.17f\n",y);
>Here is the output:
>
>x: 3fb99999 0.10000000149011612 
>y: 3fb99999 0.09999996423721313
>
>Notice that x and y, which have been declared as floats, and thus have
>a 32 bit representation (according to the manual this obeys IEEE
>floating point arithmetic standards), both are printed the same in hex,
>but give different values when printed as floats. I believe that the
>hex is a straight translation of the internal bit representation. The
>division in computing x and y is done in double precision (64 bits) and
>then converted to floats.

The problem is that floats are converted to doubles (or extendeds on the
macintosh).  A double has an 11-bit exponent, whereas a float only has 8.
If you print out the next word, you'll see that the two hex representations
differ somewhere in the next 3 bits.
Ken Turkowski @ Apple Computer, Inc., Cupertino, CA
UUCP: {mtxinu,sun,decwrl,nsc,voder}!apple!turk
CSNET: turk@Apple.CSNET
ARPA: turk%Apple@csnet-relay.ARPA
Domain-based address: turk@apple.com

flaps@dgp.toronto.edu (Alan J Rosenthal) (08/11/88)

In article <259@quintus.UUCP> ok@quintus.UUCP (Richard A. O'Keefe) writes:
>TRAP:  floating-point constants such as 1.0 are DOUBLE precision, so if the
>compiler sees float x; ... x+1.0, it will do the addition in double
>precision.  In such a case, I do	float one = 1.0; ... x+one...

AACKK.. Pascal programmer detected.  Use a cast.  "x + (float)1.0" is the
same as "x + one" in your example.  At least, it should be.

--
owotd (rot13): fabgentf

jfh@rpp386.UUCP (John F. Haugh II) (08/11/88)

In article <15370@apple.Apple.COM> turk@apple.com.UUCP (Ken "Turk" Turkowski) writes:
>The problem is that floats are converted to doubles (or extendeds on the
>macintosh).  A double has an 11-bit exponent, whereas a float only has 8.
>If you print out the next word, you'll see that the two hex representations
>differ somewhere in the next 3 bits.

correct.  obviously there is a difference.  here is how i faked out the
conversion rules using a union:

Script is typescript, started Thu Aug 11 10:27:39 1988
Subscript out of range.
1 - rpp386-> cat puzzle.c
main ()
{
	float x,y;
	union {
		long ul;
		float uf;
	} ux, uy;

	ux.uf = x = 1.0/10.0;
	uy.uf = y = 1677721.0/16777216.0; 

	printf("x: %08x:%08x, ux.ul: %08x,",ux.uf,ux.ul);
	printf("%20.17f\n",x);
	printf("y: %08x:%08x, uy.ul: %08x,",uy.uf,uy.ul);
	printf("%20.17f\n",y);
}
2 - rpp386-> cc puzzle.c
puzzle.c
3 - rpp386-> a.out
x: a0000000:3fb99999, ux.ul: 3dcccccd, 0.10000000149011612
y: 00000000:3fb99999, uy.ul: 3dccccc8, 0.09999996423721313
4 - rpp386-> logout
Not a terminal: Not a character device
John's Words of Wisdumb -
A great many people think they are thinking when they are merely
rearranging their prejudices.
		-- William James
Script done Thu Aug 11 10:27:59 1988

the output shows the float, which was passed as a long for the ux.ul
and uy.ul outputs, are different in 32 bits.
-- 
John F. Haugh II                 +--------- Cute Chocolate Quote ---------
HASA, "S" Division               | "USENET should not be confused with
UUCP:   killer!rpp386!jfh        |  something that matters, like CHOCOLATE"
DOMAIN: jfh@rpp386.uucp          |         -- apologizes to Dennis O'Connor

scs@athena.mit.edu (Steve Summit) (08/14/88)

In article <225800048@uxe.cso.uiuc.edu> mcdonald@uxe.cso.uiuc.edu writes:
>I tried running the following program...
>
>#include <stdio.h>
>
>main()
>{
>union { 
>    float fl;
>    unsigned long  lo;
>} x,y;
>	x.fl = 1.0/10.0;
>	y.fl = 1677721.0/16777216.0; 
>	printf("x: %lx", x.lo);
>	printf("%20.17f\n",x.fl);
>	printf("y: %lx", y.lo);
>	printf("%20.17f\n",y.fl);
>}
>
>/* output is:
>
>x: 3dcccccd 0.10000000149011610
>y: 3dccccc8 0.09999996423721313
>
>/*
>
>Try running this program, or, if your compiler has 32bit ints, 
>substitute "int" for "long" and "%x" for %lx".

Look, folks, if you're going to write machine-dependent programs,
at least do so portably!  Use sizeof() so you don't have to know
how big floats and doubles are on your machine:

	main()
	{
	printfloat(1.0/10.0);
	printfloat(1677721.0/16777216.0);
	}

	int littlendian = 1;

	union u	{
		double d;
		char c[sizeof(double)];
		};

	printfloat(d)
	double d;
	{
	union u u;
	int i;

	u.d = d;
	printf("%-10.7g = ", d);
	for(i = 0; i < sizeof(d); i++)
		{
		int ii = littlendian ? sizeof(d) - 1 - i : i;
		printf("%02x", u.c[ii] & 0xff);
		}
	printf("\n");
	}

Oops, it's still a little machine-dependent: for a pleasingly-
arranged output, you have to know whether your machine stores
things least-significant or mist-significant byte first, which is
an issue I'll not diverge to now.

You can leave out the &0xff in the next-to-last printf if you
make the second member of the union an unsigned char array, if
your compiler understands them.  (Most newer ones do.)

If you don't like unions, you can do it with pointer punning:

	printfloat(d)
	double d;
	{
	int i;
	printf("%-10.7g = ", d);
	for(i = 0; i < sizeof(d); i++)
		{
		int ii = littlendian ? sizeof(d) - 1 - i : i;
		printf("%02x", ((char *)&d)[ii] & 0xff);
		}
	printf("\n");
	}

The output (on a VAX; your actual funny hex numbers may vary) is:

	0.1        = cccdcccccccc3ecc
	0.09999996 = 00000000ccc83ecc

They're quite a bit different, but the reasons why have been
explained already.

                                            Steve Summit
                                            scs@adam.pika.mit.edu