Floating-Point in .NET Part I: Concepts and Format

Finally, some real code

Nearly all discussions up to this point have been theoretical. It's time to do some coding!

We will stick with the IEEE-754 standard and implement some of the 'recommended functions' listed in the annex to the standard for double-precision numbers. These are:

Function Description
CopySign(x, y) Copies the sign of y to x.
Scalb(y, n) Computes y2n for integer n without computing 2n.
Logb(x) Returns the unbiased exponent of x.
NextAfter(x, y) Returns the next representable neighbor of x in the direction of y.
Finite(x) Returns true if x is a finite real number.
Unordered(x, y) Returns true if x and y are unordered, i.e. if either x or y is a NaN.
Class(x) Returns the floating-point class of the number x.

We won't go into much detail here. The code is mostly self-explanatory. There are a few points of interest.

Converting to and from binary representations

Most of these functions perform some sort of operation on the binary representation of a floating-point number. A single value has the same number of bits as an int, and a double value has the same number of bits as a long.

For double values, the BitConverter class contains two useful methods: DoubleToInt64Bits and Int64BitsToDouble. As the name suggests, these methods convert a double to and from a 64-bit integer. There is no equivalent for Single values. Fortunately, one line of unsafe code will do the trick.

Finding the next representable neighbor

Finding the next representable neighbor of a floating-point number, which is the purpose of the NextAfter method, appears to be a rather complicated operation. We have to deal with positive and negative, normalized and denormalized numbers, exponents and significands, as well as zeros, infinities, and NaN's!

Fortunately, a special property of the floating-point formats comes to the rescue: the values are ordered like sign-magnitude integers. What this means is that, setting aside the sign bit for the moment, the order of the floating-point numbers and their binary representation is the same. So, all we have to do to find the next neighbor is to increment or decrement the binary representation. There are a few special cases, but all the handling of exponents is taken care of.

Conclusion

In this first article in a three-part series, we introduced the basic concepts of numerical computing: number formats, accuracy, precision, range, and round-off error. We described the most common number formats (single, double and extended precision) and the standard that defines them. Finally, we wrote some code to implement some floating-point functions.

The next part in this series will be of much more direct practical value. We will look more in depth at the dangers that come with doing calculations with floating-point numbers, and we'll show you how you can avoid them.

You might also like...

Comments

About the author

Jeffrey Sax Canada

Jeffrey Sax has been writing numerical software for many years. He is founder and president of Extreme Optimization (http://www.extremeoptimization.com), a Toronto based provider of numerical co...

Interested in writing for us? Find out more.

Contribute

Why not write for us? Or you could submit an event or a user group in your area. Alternatively just tell us what you think!

Our tools

We've got automatic conversion tools to convert C# to VB.NET, VB.NET to C#. Also you can compress javascript and compress css and generate sql connection strings.

“Computer Science is no more about computers than astronomy is about telescopes.” - E. W. Dijkstra