Floating-Point in .NET Part I: Concepts and Format

Standards for Floating-Point Arithmetic

We will now look at the details of the single and double precision formats. Although you can get by without knowing the internals, this information can help to understand some of the nuances of working with floating-point numbers. In order to keep the discussion clear, we will at first only consider the single precision format. The double precision format is similar, and will be summarized later.

Normalized numbers

A typical binary floating-point number has the form s × (m / 2N-1)  × 2e, where s is either -1 or +1, m and e are the mantissa or significand and exponent mentioned earlier, and N is the number of bits in the significand, which is a constant for a specific number format. For single-precision numbers, N = 24. The nunbers s, m and e are packed into 32 bits. The layout is shown in the image below:

part sign exponent fraction
bit # 31 23-30 0-22

The sign s is stored in the most significant bit. A value of 0 indicates a positive value, while 1 indicates a negative value.

The exponent field is an 8 bit unsigned integer called the biased exponent. It is equal to the exponent e plus a constant called the bias which has a value of 127 for single-precision numbers. This means that, for example, an exponent of -44 is stored as -44+127= 83 or 01010011. There are two reserved exponent values: 0 and 255. The reason for this will be explained shortly. As a result, the smallest actual exponent is -126, and the largest is +127.

The number format appears to be ambiguous: You can multiply m by two and subtract 1 from e and get the same number. This ambiguity is resolved by minimizing the exponent and maximizing the size of the significand. This process is called normalization. As a result, the significand m always has 24 bits, with the leading bit always equal to 1. Since we know it is always equal to 1, we don't have to store this bit, and so we end up with the significand taking up only 23 bits instead of 24.

Put another way, normalization means that the number m / 2N-1 always lies between 1 and 2. The 23 stored bits are also what comes after the decimal point when the significand is divided by 2N-1. For this reason, these bits are sometimes called the fraction.

Zero and subnormal numbers

At this point, you may wonder how the number zero is stored. After all, neither m nor s can be zero, and so their product cannot be zero either. The answer is that 0 is a special number with a special representation. In fact, it has two representations!

The numbers we have been describing so far, whose significand has maximum length, are called normalized numbers. They represent the vast majority of numbers represented by the floating-point format. The smallest positive value is 223 .2-126+1-24 = 1.1754e-38. The largest value is (224-1).2127+1-24 = 3.4028e+38.

Recall that the biased exponent has two reserved values. The biased exponent 0 is used to represent the number zero as well as subnormal or denormalized numbers. These are numbers whose significand is not normalized and has a maximum length of 23 bits. The actual exponent used is -127+1-24=-149, resulting in a smallest positive number of 2-149 = 1.4012e-45.

When both the biased exponent and the significand are zero, the resulting value is equal to 0. Changing the sign of zero does not change its value, so we have two possible representations of zero: one with a positive sign, and one with a negative sign. As it turns out, it is meaningful to have a 'negative zero' value. Although its value equals the value of normal 'positive zero,' it behaves differently in some situations, which we will get into shortly.

Infinities and Not-a-Number

We still need to explain the use of the other reserved biased exponent value of 255. This exponent is used to represent infinities and Not-a-Number values.

If the biased exponent is all 1's (i.e. equal to 255) and the significand is all 0's, then the number represents infinity. The sign bit indicates whether we're dealing with positive or negative infinity. These numbers are returned for operations that either do not have a finite value (e.g. 1/0) or are too large to be represented by a normalized number (e.g. 21,000,000,000).

The sign of a division by zero depends on the sign of both the numerator and the denominator. If you divide +1 by negative zero, the result is negative infinity. If you divide -1 by positive infinity, the result is negative zero.

If the significand is different from 0, the value represents a Not-a-Number value or NaN. NaN's come in two flavors: signaling and non-signaling or quiet corresponding to the leading bit in the significand being 1 and 0, respectively. This distinction is not very important in practice, and is likely to be dropped in the next revision of the standard.

NaN's are produced when the result of a calculation does not exist (e.g. Math.Sqrt(-1) is not a real number) or cannot be determined (infinity / infinity). One of the peculiarities of NaN's is that all arithmetic operations involving NaN's return a NaN, except when the result would be the same regardless of the value. For example, the function hypot(x, y) = Math.Sqrt(x*x+y*y) with x infinite always equals positive infinity, regardless of the value of y. As a result, hypot(infinity, NaN) = infinity.

Also, any comparison of a NaN with any other number including NaN returns false. The one exception is the inequality operator, which always returns true even if the value being compared is also NaN!

The significand bits of a NaN can be set to an arbitrary value, sometimes called the payload. The IEC 60559 standard specifies that the payload should propagate through calculations. For example, when a NaN is added to a normal number, say 5.3, then the result is a NaN with the same payload as the first operand. When both operands are NaN's, then the resulting NaN carries the payload of either one of the operands. This leaves the possibility to pass on potentially useful information in NaN values. Unfortunately, this feature is hardly ever used.

Some examples

Let's look at some numbers and their corresponding bit patterns.

Number Sign Exponent Fraction
0 0 00000000 00000000000000000000000
-0 1 00000000 00000000000000000000000
1 0 01111111 00000000000000000000000
+Infinity 0 11111111 00000000000000000000000
NaN 1 11111111 10000000000000000000000
3.141593 0 10000000 10010010000111111011100
-3.141593 1 10000000 10010010000111111011100
100000 0 10001111 10000110101000000000000
0.000001 0 01101110 01001111100010110101100
1/3 0 01111101 01010101010101010101011
4/3 0 01111111 01010101010101010101011
2-144 0 00000000 00000000000000000100000

Notice the exponent field for 1 and 4/3. Both these numbers are between 1 and 2, and so their unbiased exponent is zero. The biased exponent is therefore equal to the bias, which is 127, or 1111111 in decimal. Numbers larger than 2 have biased exponents greater than 127. Numbers smaller than 1 have biased exponents smaller than 127.

The last number in the table (2-144) is denormalized. The biased exponent is zero, and since 2-144 = 32*2-149 the fraction is 32 = 25.

You might also like...

Comments

About the author

Jeffrey Sax Canada

Jeffrey Sax has been writing numerical software for many years. He is founder and president of Extreme Optimization (http://www.extremeoptimization.com), a Toronto based provider of numerical co...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Linux is only free if your time has no value” - Jamie Zawinski