Taming Floating Point Error

If you’ve been a software engineer for long enough, it is very likely that you’ve seen this example of floating point perfidy:

>>> 3.0 / 10
0.3
>>> 0.1 * 3
0.30000000000000004

We understand that this is due to the fact that floating point numbers, stored with only 64 bits of precision, cannot represent the entire real number line. Moreover, when we perform operations with these floating point numbers, the errors inherent in their representation can accumulate and multiply. The moral of the story is, never use a floating point number to represent money.

At least, that is the moral for financial applications. At Square, we used a long amount_cents, and we got along with our lives. However, for most applications that have a good reason to use floating point, this can’t be the end of the story. If floating point were the unpredictable, unreliable thing that I once believed it to be, we wouldn’t be able to numerically solve differential equations, or linear systems, or land on the moon. Rather, there is a science of floating point error, forming part of a science of numerical errors in general, which seeks to tame and understand what happens to errors as they flow through our calculations. In general, numerical error is something that can be rather precisely quantified, as we’ll see. Along the way we’ll look at the Fundamental Axiom of Floating Point Arithmetic, which, at the very least, sounds way cooler than “10 things every developer should know about floating point numbers”.

Machine Epsilon

Fundamentally, error in floating point is due to the problem of “roundoff”. Just as in base 10, we cannot represent the number 1/3 without rounding it off somewhere:

in base 2, we cannot represent many numbers without rounding. Of course, some numbers we can represent exactly. The number 1, for example. Or any integer in the range (−253,253). Also notably, many fractions can be exactly represented:

However, a number like 1/10, just like 1/3 in base 10, must be truncated to fit in the 24 or 53 bits of the mantissa. When we enter 0.1 in a console or in source code, the value that is actually stored is very slightly different than 0.1. According to this excellent IEEE-754 Floating Point Converter, the floating point number that is actually stored (for 64-bit floating point) is

So the initial input to our calculation was flawed! We weren’t calculating 0.1 * 3, we were actually calculating

0.1000000000000000055511151231257827021181583404541015625 * 3

How much of an error is this? We can get an idea by counting the zeros in between the significant digits, 0.100…00055. In this case, there are 16. So in simply entering a number which is not representable exactly in floating point, we have incurred a relative error of roughly 10−16.

Indeed, in all cases we can expect to incur a relative error of roughly 10−16. This magnitude is called machine epsilon, often written ϵmachine. It comes from the relative difference between two successive floating point numbers. Unlike the number line, for every representable floating point number x, there is a next floating point number, and it is approximately x+ϵmachinex. So for an arbitrary real number x0, it falls between two floating point values x and x+ϵx (leaving off the subscript for conciseness). When we represent x0 in floating point, we will get one of these two values. Denote the floating point representation of x0 by fl(x0). The absolute error is

eabs=|fl(x0)−x0|≤max(|x0−x|,|x0−(x+ϵx)|)≤|ϵx|.

The relative error, then, i.e. the absolute error divided by the true value, is

Cool! So we’ve seen that the worst we can do, in relative terms, when representing a floating point number, is approximately 10−16. This is, for almost all practical purposes, very good. Because remember, we’re speaking of a relative error. That means we’re able to represent even very small values, very accurately. Here’s the nearest floating point number to 10−20:

If you’re curious, I invite you to count the 9’s. There are 16 of them. Even when dealing with extremely small numbers, we maintain the same relative precision.

The Fundamental Axiom of Floating Point Arithmetic

Now you might be thinking, wait! It’s all good and well that we can get excellent relative accuracy when representing floating point numbers, but what about when we go to do something with them? Here we’ve got two floating point numbers, both of which are inexact, and we’re about to multiply them! Who knows what might happen?

This concern is well-founded, because the algorithms of floating point arithmetic must be implemented with finite precision. If we are asked to multiply two large numbers with pen and paper, the algorithm that most of us will use is the one we learned in school, which involves lots of addition, carrying, sub-multiplications, and so on. If all of those intermediate steps are using some kind of binary representation, then the intermediate products may be losing precision as we go along! This seems like a recipe for disaster. Fortunately, there is a property that we can require of a floating point implementation, one which is satisfied by IEEE-754 and most other popular floating point standards, that will save us from total anarchy. This is what Trefethen and Bau, Numerical Linear Algebra, refer to as the Fundamental Axiom of Floating Point Arithmetic:

All floating point arithmetic operations are exact up to a relative error of ϵmachine.