Floating-Point in .NET Part I: Concepts and Format

Double and extended precision formats

Double-precison floating-point numbers are stored in a way that is completely analogous to the single-precision format. Some of the constants are different. The sign still takes up 1 bit - no surprise there. The biased exponent takes up 11 bits, with a bias value of 1023. The significand takes up 52 bits with the 53rd bit implicitly set to 1 for normalized numbers.

The IEC 60559 standard doesn't specify exact values for the parameters of the extended floating-point format. It only specifies minimum values. The extended format used by Intel processors since the 8087 is 80 bits long, with 15 bits for the exponent and 64 bits for the significand. Unlike other formats, the extended format does leave room for the leading bit of the significand, enabling certain processor optimizations and saving some precious real estate on the chips.

The following table summarizes the features of the single, double and-precision formats.

Format Single Double Extended
Length (bits) 32 64 80
Exponent bits 8 11 15
Exponent bias 127 1023 16383
Smallest exponent -126 -1022 -16382
Largest exponent +127 +1023 +16383
Precision 24 53 64
Smallest positive value 1.4012985e-45 2.4703282292062327e-324 1.82259976594123730126e-4951
Smallest positive normalized value 1.1754944e-38 2.2250738585072010e-308 3.36210314311209350626e-4932
Largest positive value 3.4028235e+38 1.7976931348623157e+308 1.18973149535723176502e+4932

What about the decimal format?

The Decimal type in the .NET framework is a non-standard floating-point type with base 10. It takes up 128 bits. 96 of those are used for the mantissa. 1 bit is used for the sign, and 5 bits are used for the exponent, which can range from 0 to 28. The format does not follow any existing or planned standard. There are no infinities or NaN's.

Any decimal number of no more than 28 digits before and/or after the decimal point can be represented exactly. This is great for financial calculations, but comes at a significant cost. Calculating with decimals is an order of magnitude slower than the intrinsic floating point types. Decimals also take up at least twice as much memory.

Other parts of the standard

In addition to the number formats, the IEC 60559 standard also precisely defines the behavior of the basic arithmetic operations +, -, *, /, and square root.

It also specifies the details of rounding. There are four possible ways to round a number, called rounding modes in floating-point jargon:

  1. towards the nearest number (round up or down, whichever produces the smaller error)
  2. towards zero (round down for positive numbers, and up for negative numbers)
  3. towards +infinity (always round up)
  4. towards -infinity (always round down)

In general, the first option will lead to smaller round-off error, which is why it is the default in most compilers. However, it is also the least predictable. The other rounding modes have more predictable properties. In some cases, it is more easy to compensate for round-off error using these modes.

Exceptions are another but underused feature. Exceptions signal that something unusual has happened during a calculation. Exceptions are not fatal errors. A flag is set and a default value is returned. There are five exceptions in all:

Exception Situation Return value
Invalid operation An operand is invalid for the operation to be performed. NaN
Division by zero An attempt is made to divide a non-zero value by zero. Infinity (1/-0 = negative infinity)
Overflow The result of an operation is too large to be represented by the floating-point format. Positive or negative infinity.
Underflow The result of an operation is too small to be represented by the floating-point format. Positive or negative zero.
Inexact The rounded result of an operation is not exact. The calculated value.

The return value in case of overflow and underflow actually depends on the rounding mode. The values given are those for rounding to nearest, which is the default.

Exceptions are not fatal errors. They act similar to integer overflows in the CLR. By default, no action is taken in case of overflow. However in a checked context, an exception is thrown when integer overflow occurs. Similarly, the IEEE-754/IEC 60559 defines a trap mechanism that passes control over to a trap handler when an exception occurs.

You might also like...

Comments

About the author

Jeffrey Sax Canada

Jeffrey Sax has been writing numerical software for many years. He is founder and president of Extreme Optimization (http://www.extremeoptimization.com), a Toronto based provider of numerical co...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Measuring programming progress by lines of code is like measuring aircraft building progress by weight.” - Bill Gates