Floating Point Arithmetic

We had a discussion about floating point arithmetic on 32-bit processors in our team this morning: here’s a brief note I made on the subject a long time ago.

double a = 95.0;
double b = 94.0001;
double c = a – b;

In .Net this give c = 0.99989999999999668. It should be 0.9999 obviously. This is a classic floating point problem: big numbers and small numbers don’t mix well for addition and subtraction. You get the same problem in Java. This is because there’s an IEEE standard for how 32-bit doubles (and other floating point types) should work, which is used in both languages.

The real problem is there’s no easy way of dealing with the fact that the real numbers are infinite and computers just aren’t. More to the point, for efficiency we really want to put some ceiling on our level of accuracy, and in particular we really only want to use 32 bits on a 32-bit processor.

There’s a lot of detail on this on the web, some of it extremely tedious with geeks showing off how mathematical they are, some of it containing sensible advice:



In both .Net and Java there are specific types that are designed to be slow but allow arbitrary accuracy: these are Decimal (.Net) and BigDecimal (Java). Actually Decimal isn’t designed for arbitrary accuracy, it’s 128 bits, but it solves the problem above.

In Excel of course you can’t use these. Excel cheats a little bit in that if you only display 4 DPs (which it does by some default magic) it rounds to that so it appears you get 0.9999. In fact if you set it to show 15DP you get the same answer as above.

Microsoft gives some sensible non-geeky advice about dealing with this in Excel: round. You can even round to what’s displayed because that’s designed to work and there’s a specific option to do it (although it warns that, of course, you’re losing quite a bit of accuracy if you do this everywhere in a complex sheet). In the .Net example at the top of the page you get the right answer if you do Math.Round(c, 14). Another technique used in Excel is to test for a floating point value to within a small epsilon (i.e. don’t test for 0, test for 0 plus or minus epsilon).



Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s