[comp.lang.c] IEEE floating point format

manoj@hpldsla.HP.COM (Manoj Joshi) (07/26/89)

I am not sure if this is the right note to ask this 
question, but here it goes...

What is the format for the IEEE floating point storage
convention? In other words (for a 32-bit float) where is 
the exact position of the 4 fields (1 Byte each):

<mantissa sign> <mantissa> <exponent> <exponent sign>

Also what is the size (in bits) of each of these?

Similarly how is this stored in a 64-bit double precision
real number? 

Are there any standard routines to convert real numbers 
in different architectures for UNIX? I realize that MSC 
has a few routines for DOS.

Thanks in advance for the help in this regard.

Manoj Joshi
manoj%hpldas5.HP.COM@hp-sde
(415)857-7099

knighten@pinocchio (Bob Knighten) (07/29/89)

Single format:  Numbers in the single format are composed of three fields:
	A 1-bit sign S
	A biased exponent e = exponent + 127
	A fraction f = 0.b b ...b
                          1 2    23

The exponent range is -126 to +127.  Exponent = 128 is used to encode
+/-infinity and NaNs.  Exponent = -127 is used to encode +/-0 and denormalized
numbers.

The layout is:

	1     8             23              width
       |S|    e   |          f           |
          m      l m                    l   order

where m stands for most significant bit
and   l stands for least significant bit

 
Double format:  Numbers in the single format are composed of three fields:
	A 1-bit sign S
	A biased exponent e = exponent + 1023
	A fraction f = 0.b b ...b
                          1 2    52

The exponent range is -1022 to +1023.  Exponent = 1024 is used to encode
+/-infinity and NaNs.  Exponent = -1023 is used to encode +/-0 and denormalized
numbers.

The layout is:

     1     11                              52                            width
    |S|     e     |                         f                          |
       m         l m                                                  l  order

where m stands for most significant bit
and   l stands for least significant bit

 
The value of a number represented in either of these formats is
                S          exponent
	x = (-1)  * 1.f * 2

--
Bob Knighten                        
Encore Computer Corp.
257 Cedar Hill St.        
Marlborough, MA 01752
(508) 460-0500 ext. 2626

Internet:  knighten@encore.com
UUCP:  {bu-cs,decvax,necntc,talcott}!encore!knighten

ark@alice.UUCP (Andrew Koenig) (07/29/89)

In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes:

> What is the format for the IEEE floating point storage
> convention? In other words (for a 32-bit float) where is 
> the exact position of the 4 fields (1 Byte each):

> Similarly how is this stored in a 64-bit double precision
> real number? 

The IEEE spec gives the format only modulo permutation of the bits.
That is, different machines are allowed to put the bits in different
parts of the word.

The format is:

	field		32-bit format		64-bit format

	sign			1			1
	exponent		8			12
	fraction		23			55

If the exponent is all 0-bits or all 1-bits, the number is a
special case that I'll discuss later.  Otherwise, flip the high-
order bit of the exponent, treat it as a 2's-complement number.
Put a binary point between the first fraction bit and the rest
of them.  Put a 1 ahead of all the fraction bits.  The value
of the number is the fraction times 2^exponent.

For example, the 32-bit representation of 1 is:

	sign		0
	exponent	01111111			(means -1)
	fraction	(1)0.0000000000000000000000	(means 2)

So the number is 2*(2^-1) or 1.  This number might be
represented this way:

	0 01111111 00000000000000000000000		0x3F800000

and indeed it is on some machines.

Now the special cases:

	exponent all 1's	fraction == 0		infinity
	exponent all 1's	fraction != 0		NaN
	exponent all 0's				denormalized

Infinity can be positive or negative.  NaN means `not a number'
and is used to signal things like 0 divided by 0 or infinity - infinity.
Denormalized numbers are little tiny numbers too small to represent
otherwise -- for denormalized numbers the leading 1 isn't added
and the exponent is offset by 1 to compensate.  If all bits are 0,
the number is 0.

-- 
				--Andrew Koenig
				  ark@europa.att.com

hascall@atanasoff.cs.iastate.edu (John Hascall) (07/29/89)

In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
>In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes:
 
>> What is the format for the IEEE floating point storage
 
>The format is:
 
>	field		32-bit format		64-bit format
 
>	sign			1			1
>	exponent		8			12
>	fraction		23			55
                            ------                    -----
				32 (ok!)                68 (huh?)
 
      Is it <1,8,55>, <1,12,51> or some other thing?
      (or have they found a way for more hidden bits :-)


      John

tim@crackle.amd.com (Tim Olson) (07/30/89)

In article <1270@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes:
| In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
| >In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes:
|  
| >> What is the format for the IEEE floating point storage
|  
| >The format is:
|  
| >	field		32-bit format		64-bit format
|  
| >	sign			1			1
| >	exponent		8			12
| >	fraction		23			55
|                             ------                    -----
| 				32 (ok!)                68 (huh?)
|  
|       Is it <1,8,55>, <1,12,51> or some other thing?
|       (or have they found a way for more hidden bits :-)

Neither.  Double precision fields are 1 sign, 11 exponent, and 52
fraction bits.


	-- Tim Olson
	Advanced Micro Devices
	(tim@amd.com)

ark@alice.UUCP (Andrew Koenig) (07/30/89)

In article <1270@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes:
> In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
> >In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes:

> >> What is the format for the IEEE floating point storage

> >The format is:

> >	field		32-bit format		64-bit format

> >	sign			1			1
> >	exponent		8			12
> >	fraction		23			55
>                             ------                    -----
> 				32 (ok!)                68 (huh?)

>       Is it <1,8,55>, <1,12,51> or some other thing?
>       (or have they found a way for more hidden bits :-)

Oh well.  I meant 51 bits in double precision, plus a hidden bit.
Similarly, the 23 bits in single precision doesn't include the
hidden bit.  Therefore there are effectively 24 significant bits
in single precision and 52 in double precision.

The largest integer N such that N and N-1 can both be exactly
represented in single precision is (2^24)-1; the largest for
double precision is (2^52)-1.

I hope I got it right this time.
-- 
				--Andrew Koenig
				  ark@europa.att.com

ark@alice.UUCP (Andrew Koenig) (07/30/89)

In article <26532@amdcad.AMD.COM>, tim@crackle.amd.com (Tim Olson) writes:

> Neither.  Double precision fields are 1 sign, 11 exponent, and 52
> fraction bits.

OK, I've stopped relying on my memory and went and looked at
a paper I wrote about it a few years ago.

You're both right -- in the introductory section of that
paper I said

	An IEEE double precisionf floating point number
	is 64 bits: a sign bit, an 11-bit exponent, and
	a 52-bit fraction ... A single bit, not explicitly
	stored, precedes the binary point; this bit is called
	the hidden bit and its value is determined by that
	of the exponent.

Forgive my dropped neurons; I am quite confident about this
version because I checked it agains the IEEE standard at the
time I wrote it.
-- 
				--Andrew Koenig
				  ark@europa.att.com

bph@buengc.BU.EDU (Blair P. Houghton) (07/31/89)

In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
>	exponent all 1's	fraction == 0		infinity
>	exponent all 1's	fraction != 0		NaN
>	exponent all 0's				denormalized

Fascinating; but, what does it mean to say "denormalized" in this context?

				--Blair
				  "Webster is mute on this
				   topic, and I'm only a SM."

gwyn@smoke.BRL.MIL (Doug Gwyn) (08/02/89)

In article <3554@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
>Fascinating; but, what does it mean to say "denormalized" in this context?

Numbers sufficiently near zero can have an exponent smaller than is
representable, but if you're willing to lose some bits of precision,
you can sometimes represent them as having the smallest possible
exponent and most-significant bit of the significand (aka "mantissa")
0, instead of 1 as it usually would be.  Such a representation is
called "denormalized" (normalized numbers are either exactly 0 or
their MSB is 1).

ark@alice.UUCP (Andrew Koenig) (08/02/89)

In article <3554@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes:
> In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
> >	exponent all 1's	fraction == 0		infinity
> >	exponent all 1's	fraction != 0		NaN
> >	exponent all 0's				denormalized

> Fascinating; but, what does it mean to say "denormalized" in this context?

It means the number is being represented in `gradual underflow'
mode.  To be more specfic: the smallest positive number that can be
represented in IEEE 64-bit form without going into denormalized mode
is 2^-1022.  That number is represented this way:

	0  00000000001  0000000000000000000000000000000000000000000000000000

If you count the way I did in my last note, this means an exponent
of -1021 and a fraction of .(1)0000000....

The 1 is in parentheses because it's the `hidden bit' -- it's not
actually stored.

The next smaller number is represented this way:

	0  00000000000  1111111111111111111111111111111111111111111111111111

This is the largest denormalized number: its value is 2^-1021 times
.(0)11111...    That is, the hidden bit becomes 0 when all the exponent
bits are 0.  Thus it is possible to represent numbers that are too
small for the normal exponent range, albeit with reduced precision.

As a result of this notatation, the smallest positive number that
can be represented in IEEE 64-bit floating-point is 2^-1074.
-- 
				--Andrew Koenig
				  ark@europa.att.com

bph@buengc.BU.EDU (Blair P. Houghton) (08/04/89)

In article <9725@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
>In article <3554@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes:
>
>> Fascinating; but, what does it mean to say "denormalized" in this context?
>
>the smallest positive number that can be
>represented in IEEE 64-bit form without going into denormalized mode
>is 2^-1022.  That number is represented this way:
>
>	0  00000000001  0000000000000000000000000000000000000000000000000000
>
>If you count the way I did in my last note, this means an exponent
>of -1021 and a fraction of .(1)0000000....
>
>The next smaller number is represented this way:
>
>	0  00000000000  1111111111111111111111111111111111111111111111111111
>
>This is the largest denormalized number: its value is 2^-1021 times
>.(0)11111...    That is, the hidden bit becomes 0 when all the exponent
>bits are 0.  Thus it is possible to represent numbers that are too
>small for the normal exponent range, albeit with reduced precision.

Okay, so "normalization" refers to ensuring that the precision is
53 bits for any number with a nonzero exponent-field.

Next question:  do C compilers (math libraries, I expect I should mean)
on IEEE-FP-implementing machines generally limit doubles to normalized
numbers, or do they blithely allow precision to waft away in the name
of a slight increase in the number-range?

I expect the answer is "the compiler has nothing to do with it", so the
next question would be, are there machines that don't permit the loss
of precision without specific orders to do so?

				--Blair
				  "Or Fortran compilers, but I don't
				   need those, and this ain't the
				   group for it, this being
				   comp.lang.c.pointer.addition...."

ark@alice.UUCP (Andrew Koenig) (08/05/89)

In article <3591@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes:

> Next question:  do C compilers (math libraries, I expect I should mean)
> on IEEE-FP-implementing machines generally limit doubles to normalized
> numbers, or do they blithely allow precision to waft away in the name
> of a slight increase in the number-range?

> I expect the answer is "the compiler has nothing to do with it", so the
> next question would be, are there machines that don't permit the loss
> of precision without specific orders to do so?

If you implement IEEE floating point, you must implement denormalized
numbers -- they're part of the spec.

I don't see, though, why you describe denormalized numbers as `the
loss of precision'.  Compared with the alternative, it's a gain in
precision.  After all, the only other thing you could do would be
to underflow to 0, which would lose all precision.

I don't remember whether IEEE requires you to be able to generate
a trap as a side effect of an operation whose result is denormalized.
-- 
				--Andrew Koenig
				  ark@europa.att.com

barmar@think.COM (Barry Margolin) (08/07/89)

In article <9740@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes:
>I don't see, though, why you describe denormalized numbers as `the
>loss of precision'.  Compared with the alternative, it's a gain in
>precision.  After all, the only other thing you could do would be
>to underflow to 0, which would lose all precision.

Denormalized numbers have less precision than normalized numbers.  In
a denormalized number, the leading zero bits of the mantissa don't
contribute to the precision of the number.

You are confusing accuracy with precision.  Think back to your high
school and college science course, where you had to write the
precision of experimental results explicitly.  When you write 1.3, it
implies that you only had two digits of precision (and you might write
1.3+/-.05); however, if you use a high-precision device you might
measure something as 1.3000, which is +/-.00005.  Precision,
therefore, is the number of significant digits you are sure of.

A denormalized number is more accurate than underflowing to zero, but
it isn't necessarily more precise than zero.

Barry Margolin
Thinking Machines Corp.

barmar@think.com
{uunet,harvard}!think!barmar

penneyj@servio.UUCP (D. Jason Penney) (08/08/89)

In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
>Next question:  do C compilers (math libraries, I expect I should mean)
>on IEEE-FP-implementing machines generally limit doubles to normalized
>numbers, or do they blithely allow precision to waft away in the name
>of a slight increase in the number-range?
>
>I expect the answer is "the compiler has nothing to do with it", so the
>next question would be, are there machines that don't permit the loss
>of precision without specific orders to do so?

This is an interesting question.  The early drafts of IEEE P754 had a
"warning mode" -- When "warning mode" was set, an operation with
normal operands that produced a subnormal result 
("subnormal" is the preferred term instead of "denormalized" now, by the way),
an exception was signalled.

It was eventually removed because 1) Checking for this condition
was expensive, and 2) it did not seem to be very useful.

The use of subnormal representations actually caused quite a bit of
controversy, by the way.  Most pre-IEEE floating points underflow from
the lowest normalized value directly to true zero.  IEEE 754 and 854 support
the notion of "gradual underflow", where precision is gradually lost
(through the use of subnormal values).

I won't give a full discussion of the benefit of gradual underflow, but
note that with truncating underflow, it is possible to have two floating 
point values X and Y such that X != Y and yet (X - Y) == 0.0, 
thus vitiating such precautions as,

if (X == Y)
  perror("zero divide");
else
  something = 1.0 / (X - Y);

[Example thanks to Professor Kahan...]

-- 
D. Jason Penney                  Ph: (503) 629-8383
Beaverton, OR 97006              uucp: ...uunet!servio!penneyj
STANDARD DISCLAIMER:  Should I or my opinions be caught or killed, the
company will disavow any knowledge of my actions...

bph@buengc.BU.EDU (Blair P. Houghton) (08/12/89)

In article <152@servio.UUCP> penneyj@servio.UUCP (D. Jason Penney) writes:
>In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:

This is out of the realm of C, now.  You may find it continued on comp.misc.

				--Blair
				  "'Scuse me while I buckle on
				   my asbestos straitjacket..."

karl@haddock.ima.isc.com (Karl Heuer) (08/12/89)

In article <152@servio.UUCP> penneyj@servio.UUCP (D. Jason Penney) writes:
>note that with truncating underflow, it is possible to have two floating 
>point values X and Y such that X != Y and yet (X - Y) == 0.0, 
>thus vitiating such precautions as,
>	if (X == Y) error("zero divide"); else something = 1.0 / (X - Y);

I used to have similar problems with the SIMULA compiler on TOPS-10, which
apparently used fuzzy compare even against zero: (example in C notation)
	if (X < 0.0) error("neg sqrt"); else something = sqrt(X);
would die because -1.0E-20 was considered "equal" to zero rather than
negative, yet the sqrt() routine wasn't fooled.  What a pain to work around!

Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint
(What's this doing in comp.lang.c?  Followups to comp.lang.misc.)

meissner@dg-rtp.dg.com (Michael Meissner) (08/18/89)

In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes:
| Next question:  do C compilers (math libraries, I expect I should mean)
| on IEEE-FP-implementing machines generally limit doubles to normalized
| numbers, or do they blithely allow precision to waft away in the name
| of a slight increase in the number-range?
| 
| I expect the answer is "the compiler has nothing to do with it", so the
| next question would be, are there machines that don't permit the loss
| of precision without specific orders to do so?

The current versions of the Motorola 88000 trap to the kernel to
handle denormalized numbers.  Some early versions of the kernel just
stuff a zero where the denormalized number was.
--
Michael Meissner, Data General.
Uucp:		...!mcnc!rti!xyzzy!meissner		If compiles were much
Internet:	meissner@dg-rtp.DG.COM			faster, when would we
Old Internet:	meissner%dg-rtp.DG.COM@relay.cs.net	have time for netnews?