manoj@hpldsla.HP.COM (Manoj Joshi) (07/26/89)
I am not sure if this is the right note to ask this question, but here it goes... What is the format for the IEEE floating point storage convention? In other words (for a 32-bit float) where is the exact position of the 4 fields (1 Byte each): <mantissa sign> <mantissa> <exponent> <exponent sign> Also what is the size (in bits) of each of these? Similarly how is this stored in a 64-bit double precision real number? Are there any standard routines to convert real numbers in different architectures for UNIX? I realize that MSC has a few routines for DOS. Thanks in advance for the help in this regard. Manoj Joshi manoj%hpldas5.HP.COM@hp-sde (415)857-7099
knighten@pinocchio (Bob Knighten) (07/29/89)
Single format: Numbers in the single format are composed of three fields: A 1-bit sign S A biased exponent e = exponent + 127 A fraction f = 0.b b ...b 1 2 23 The exponent range is -126 to +127. Exponent = 128 is used to encode +/-infinity and NaNs. Exponent = -127 is used to encode +/-0 and denormalized numbers. The layout is: 1 8 23 width |S| e | f | m l m l order where m stands for most significant bit and l stands for least significant bit Double format: Numbers in the single format are composed of three fields: A 1-bit sign S A biased exponent e = exponent + 1023 A fraction f = 0.b b ...b 1 2 52 The exponent range is -1022 to +1023. Exponent = 1024 is used to encode +/-infinity and NaNs. Exponent = -1023 is used to encode +/-0 and denormalized numbers. The layout is: 1 11 52 width |S| e | f | m l m l order where m stands for most significant bit and l stands for least significant bit The value of a number represented in either of these formats is S exponent x = (-1) * 1.f * 2 -- Bob Knighten Encore Computer Corp. 257 Cedar Hill St. Marlborough, MA 01752 (508) 460-0500 ext. 2626 Internet: knighten@encore.com UUCP: {bu-cs,decvax,necntc,talcott}!encore!knighten
ark@alice.UUCP (Andrew Koenig) (07/29/89)
In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes: > What is the format for the IEEE floating point storage > convention? In other words (for a 32-bit float) where is > the exact position of the 4 fields (1 Byte each): > Similarly how is this stored in a 64-bit double precision > real number? The IEEE spec gives the format only modulo permutation of the bits. That is, different machines are allowed to put the bits in different parts of the word. The format is: field 32-bit format 64-bit format sign 1 1 exponent 8 12 fraction 23 55 If the exponent is all 0-bits or all 1-bits, the number is a special case that I'll discuss later. Otherwise, flip the high- order bit of the exponent, treat it as a 2's-complement number. Put a binary point between the first fraction bit and the rest of them. Put a 1 ahead of all the fraction bits. The value of the number is the fraction times 2^exponent. For example, the 32-bit representation of 1 is: sign 0 exponent 01111111 (means -1) fraction (1)0.0000000000000000000000 (means 2) So the number is 2*(2^-1) or 1. This number might be represented this way: 0 01111111 00000000000000000000000 0x3F800000 and indeed it is on some machines. Now the special cases: exponent all 1's fraction == 0 infinity exponent all 1's fraction != 0 NaN exponent all 0's denormalized Infinity can be positive or negative. NaN means `not a number' and is used to signal things like 0 divided by 0 or infinity - infinity. Denormalized numbers are little tiny numbers too small to represent otherwise -- for denormalized numbers the leading 1 isn't added and the exponent is offset by 1 to compensate. If all bits are 0, the number is 0. -- --Andrew Koenig ark@europa.att.com
hascall@atanasoff.cs.iastate.edu (John Hascall) (07/29/89)
In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: >In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes: >> What is the format for the IEEE floating point storage >The format is: > field 32-bit format 64-bit format > sign 1 1 > exponent 8 12 > fraction 23 55 ------ ----- 32 (ok!) 68 (huh?) Is it <1,8,55>, <1,12,51> or some other thing? (or have they found a way for more hidden bits :-) John
tim@crackle.amd.com (Tim Olson) (07/30/89)
In article <1270@atanasoff.cs.iastate.edu> hascall@atanasoff.cs.iastate.edu.UUCP (John Hascall) writes: | In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: | >In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes: | | >> What is the format for the IEEE floating point storage | | >The format is: | | > field 32-bit format 64-bit format | | > sign 1 1 | > exponent 8 12 | > fraction 23 55 | ------ ----- | 32 (ok!) 68 (huh?) | | Is it <1,8,55>, <1,12,51> or some other thing? | (or have they found a way for more hidden bits :-) Neither. Double precision fields are 1 sign, 11 exponent, and 52 fraction bits. -- Tim Olson Advanced Micro Devices (tim@amd.com)
ark@alice.UUCP (Andrew Koenig) (07/30/89)
In article <1270@atanasoff.cs.iastate.edu>, hascall@atanasoff.cs.iastate.edu (John Hascall) writes: > In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: > >In article <2170002@hpldsla.HP.COM>, manoj@hpldsla.HP.COM (Manoj Joshi) writes: > >> What is the format for the IEEE floating point storage > >The format is: > > field 32-bit format 64-bit format > > sign 1 1 > > exponent 8 12 > > fraction 23 55 > ------ ----- > 32 (ok!) 68 (huh?) > Is it <1,8,55>, <1,12,51> or some other thing? > (or have they found a way for more hidden bits :-) Oh well. I meant 51 bits in double precision, plus a hidden bit. Similarly, the 23 bits in single precision doesn't include the hidden bit. Therefore there are effectively 24 significant bits in single precision and 52 in double precision. The largest integer N such that N and N-1 can both be exactly represented in single precision is (2^24)-1; the largest for double precision is (2^52)-1. I hope I got it right this time. -- --Andrew Koenig ark@europa.att.com
ark@alice.UUCP (Andrew Koenig) (07/30/89)
In article <26532@amdcad.AMD.COM>, tim@crackle.amd.com (Tim Olson) writes: > Neither. Double precision fields are 1 sign, 11 exponent, and 52 > fraction bits. OK, I've stopped relying on my memory and went and looked at a paper I wrote about it a few years ago. You're both right -- in the introductory section of that paper I said An IEEE double precisionf floating point number is 64 bits: a sign bit, an 11-bit exponent, and a 52-bit fraction ... A single bit, not explicitly stored, precedes the binary point; this bit is called the hidden bit and its value is determined by that of the exponent. Forgive my dropped neurons; I am quite confident about this version because I checked it agains the IEEE standard at the time I wrote it. -- --Andrew Koenig ark@europa.att.com
bph@buengc.BU.EDU (Blair P. Houghton) (07/31/89)
In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: > exponent all 1's fraction == 0 infinity > exponent all 1's fraction != 0 NaN > exponent all 0's denormalized Fascinating; but, what does it mean to say "denormalized" in this context? --Blair "Webster is mute on this topic, and I'm only a SM."
gwyn@smoke.BRL.MIL (Doug Gwyn) (08/02/89)
In article <3554@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: >Fascinating; but, what does it mean to say "denormalized" in this context? Numbers sufficiently near zero can have an exponent smaller than is representable, but if you're willing to lose some bits of precision, you can sometimes represent them as having the smallest possible exponent and most-significant bit of the significand (aka "mantissa") 0, instead of 1 as it usually would be. Such a representation is called "denormalized" (normalized numbers are either exactly 0 or their MSB is 1).
ark@alice.UUCP (Andrew Koenig) (08/02/89)
In article <3554@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes: > In article <9697@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: > > exponent all 1's fraction == 0 infinity > > exponent all 1's fraction != 0 NaN > > exponent all 0's denormalized > Fascinating; but, what does it mean to say "denormalized" in this context? It means the number is being represented in `gradual underflow' mode. To be more specfic: the smallest positive number that can be represented in IEEE 64-bit form without going into denormalized mode is 2^-1022. That number is represented this way: 0 00000000001 0000000000000000000000000000000000000000000000000000 If you count the way I did in my last note, this means an exponent of -1021 and a fraction of .(1)0000000.... The 1 is in parentheses because it's the `hidden bit' -- it's not actually stored. The next smaller number is represented this way: 0 00000000000 1111111111111111111111111111111111111111111111111111 This is the largest denormalized number: its value is 2^-1021 times .(0)11111... That is, the hidden bit becomes 0 when all the exponent bits are 0. Thus it is possible to represent numbers that are too small for the normal exponent range, albeit with reduced precision. As a result of this notatation, the smallest positive number that can be represented in IEEE 64-bit floating-point is 2^-1074. -- --Andrew Koenig ark@europa.att.com
bph@buengc.BU.EDU (Blair P. Houghton) (08/04/89)
In article <9725@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: >In article <3554@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes: > >> Fascinating; but, what does it mean to say "denormalized" in this context? > >the smallest positive number that can be >represented in IEEE 64-bit form without going into denormalized mode >is 2^-1022. That number is represented this way: > > 0 00000000001 0000000000000000000000000000000000000000000000000000 > >If you count the way I did in my last note, this means an exponent >of -1021 and a fraction of .(1)0000000.... > >The next smaller number is represented this way: > > 0 00000000000 1111111111111111111111111111111111111111111111111111 > >This is the largest denormalized number: its value is 2^-1021 times >.(0)11111... That is, the hidden bit becomes 0 when all the exponent >bits are 0. Thus it is possible to represent numbers that are too >small for the normal exponent range, albeit with reduced precision. Okay, so "normalization" refers to ensuring that the precision is 53 bits for any number with a nonzero exponent-field. Next question: do C compilers (math libraries, I expect I should mean) on IEEE-FP-implementing machines generally limit doubles to normalized numbers, or do they blithely allow precision to waft away in the name of a slight increase in the number-range? I expect the answer is "the compiler has nothing to do with it", so the next question would be, are there machines that don't permit the loss of precision without specific orders to do so? --Blair "Or Fortran compilers, but I don't need those, and this ain't the group for it, this being comp.lang.c.pointer.addition...."
ark@alice.UUCP (Andrew Koenig) (08/05/89)
In article <3591@buengc.BU.EDU>, bph@buengc.BU.EDU (Blair P. Houghton) writes: > Next question: do C compilers (math libraries, I expect I should mean) > on IEEE-FP-implementing machines generally limit doubles to normalized > numbers, or do they blithely allow precision to waft away in the name > of a slight increase in the number-range? > I expect the answer is "the compiler has nothing to do with it", so the > next question would be, are there machines that don't permit the loss > of precision without specific orders to do so? If you implement IEEE floating point, you must implement denormalized numbers -- they're part of the spec. I don't see, though, why you describe denormalized numbers as `the loss of precision'. Compared with the alternative, it's a gain in precision. After all, the only other thing you could do would be to underflow to 0, which would lose all precision. I don't remember whether IEEE requires you to be able to generate a trap as a side effect of an operation whose result is denormalized. -- --Andrew Koenig ark@europa.att.com
barmar@think.COM (Barry Margolin) (08/07/89)
In article <9740@alice.UUCP> ark@alice.UUCP (Andrew Koenig) writes: >I don't see, though, why you describe denormalized numbers as `the >loss of precision'. Compared with the alternative, it's a gain in >precision. After all, the only other thing you could do would be >to underflow to 0, which would lose all precision. Denormalized numbers have less precision than normalized numbers. In a denormalized number, the leading zero bits of the mantissa don't contribute to the precision of the number. You are confusing accuracy with precision. Think back to your high school and college science course, where you had to write the precision of experimental results explicitly. When you write 1.3, it implies that you only had two digits of precision (and you might write 1.3+/-.05); however, if you use a high-precision device you might measure something as 1.3000, which is +/-.00005. Precision, therefore, is the number of significant digits you are sure of. A denormalized number is more accurate than underflowing to zero, but it isn't necessarily more precise than zero. Barry Margolin Thinking Machines Corp. barmar@think.com {uunet,harvard}!think!barmar
penneyj@servio.UUCP (D. Jason Penney) (08/08/89)
In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: >Next question: do C compilers (math libraries, I expect I should mean) >on IEEE-FP-implementing machines generally limit doubles to normalized >numbers, or do they blithely allow precision to waft away in the name >of a slight increase in the number-range? > >I expect the answer is "the compiler has nothing to do with it", so the >next question would be, are there machines that don't permit the loss >of precision without specific orders to do so? This is an interesting question. The early drafts of IEEE P754 had a "warning mode" -- When "warning mode" was set, an operation with normal operands that produced a subnormal result ("subnormal" is the preferred term instead of "denormalized" now, by the way), an exception was signalled. It was eventually removed because 1) Checking for this condition was expensive, and 2) it did not seem to be very useful. The use of subnormal representations actually caused quite a bit of controversy, by the way. Most pre-IEEE floating points underflow from the lowest normalized value directly to true zero. IEEE 754 and 854 support the notion of "gradual underflow", where precision is gradually lost (through the use of subnormal values). I won't give a full discussion of the benefit of gradual underflow, but note that with truncating underflow, it is possible to have two floating point values X and Y such that X != Y and yet (X - Y) == 0.0, thus vitiating such precautions as, if (X == Y) perror("zero divide"); else something = 1.0 / (X - Y); [Example thanks to Professor Kahan...] -- D. Jason Penney Ph: (503) 629-8383 Beaverton, OR 97006 uucp: ...uunet!servio!penneyj STANDARD DISCLAIMER: Should I or my opinions be caught or killed, the company will disavow any knowledge of my actions...
bph@buengc.BU.EDU (Blair P. Houghton) (08/12/89)
In article <152@servio.UUCP> penneyj@servio.UUCP (D. Jason Penney) writes: >In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: This is out of the realm of C, now. You may find it continued on comp.misc. --Blair "'Scuse me while I buckle on my asbestos straitjacket..."
karl@haddock.ima.isc.com (Karl Heuer) (08/12/89)
In article <152@servio.UUCP> penneyj@servio.UUCP (D. Jason Penney) writes: >note that with truncating underflow, it is possible to have two floating >point values X and Y such that X != Y and yet (X - Y) == 0.0, >thus vitiating such precautions as, > if (X == Y) error("zero divide"); else something = 1.0 / (X - Y); I used to have similar problems with the SIMULA compiler on TOPS-10, which apparently used fuzzy compare even against zero: (example in C notation) if (X < 0.0) error("neg sqrt"); else something = sqrt(X); would die because -1.0E-20 was considered "equal" to zero rather than negative, yet the sqrt() routine wasn't fooled. What a pain to work around! Karl W. Z. Heuer (ima!haddock!karl or karl@haddock.isc.com), The Walking Lint (What's this doing in comp.lang.c? Followups to comp.lang.misc.)
meissner@dg-rtp.dg.com (Michael Meissner) (08/18/89)
In article <3591@buengc.BU.EDU> bph@buengc.bu.edu (Blair P. Houghton) writes: | Next question: do C compilers (math libraries, I expect I should mean) | on IEEE-FP-implementing machines generally limit doubles to normalized | numbers, or do they blithely allow precision to waft away in the name | of a slight increase in the number-range? | | I expect the answer is "the compiler has nothing to do with it", so the | next question would be, are there machines that don't permit the loss | of precision without specific orders to do so? The current versions of the Motorola 88000 trap to the kernel to handle denormalized numbers. Some early versions of the kernel just stuff a zero where the denormalized number was. -- Michael Meissner, Data General. Uucp: ...!mcnc!rti!xyzzy!meissner If compiles were much Internet: meissner@dg-rtp.DG.COM faster, when would we Old Internet: meissner%dg-rtp.DG.COM@relay.cs.net have time for netnews?