Floating point numbers are the computer's way of representing real numbers. If you've ever typed 0.1 + 0.2 into a JavaScript console and gotten 0.30000000000000004, you've experienced the quirks of floating point arithmetic.
Unlike integers, which map neatly to binary bits, real numbers present a challenge. How do you store infinite precision in a finite number of bits? The answer lies in the IEEE 754 standard, which acts like scientific notation for computers.
The Anatomy of a Float
A standard single-precision float (like float in C/C++ or Java) takes up 32 bits. These bits are divided into three parts:
- Sign (1 bit): Determines if the number is positive (0) or negative (1).
- Exponent (8 bits): Determines the magnitude of the number, allowing us to represent very large or very small numbers. It has a "bias" of 127.
- Fraction / Mantissa (23 bits): Represents the significant digits of the number. It assumes an implicit leading
1.
Interact with the 32 bits below to see how they combine to form a floating point number. Click any bit to flip it!
While we've been looking at 32-bit floats (single precision), there are also 64-bit floats (double precision). These are what languages like JavaScript and Python use by default. A 64-bit float works exactly the same way, but it has 1 sign bit, 11 exponent bits, and 52 fraction bits, giving it a much larger range and significantly higher precision.
Scientific Notation in Binary
Just like how scientific notation works in decimal: -1.5 × 10^4. We have a sign (-), a base number (1.5), and an exponent (4).
IEEE 754 does the exact same thing, but in base 2. The formula is:
1value = (-1)^sign × (1 + fraction) × 2^(exponent - 127)
Special Values
What happens if we want to represent zero? We can't use the standard formula, because the implicit leading 1 means we'd always have at least 1.0!
To solve this, IEEE 754 reserves specific exponent values for special cases:
- Zero: Exponent is all 0s, Fraction is all 0s. (Interestingly, this means there is both +0 and -0!).
- Subnormals: Exponent is all 0s, Fraction is non-zero. This allows representing numbers extremely close to zero without the implicit leading 1.
- Infinity: Exponent is all 1s, Fraction is all 0s. (+Infinity and -Infinity by changing the leading bit).
- NaN (Not a Number): Exponent is all 1s, Fraction is non-zero. Result of undefined operations like 0/0.
Precision Issues: The 0.1 + 0.2 Problem
Now that we know how floats are stored, let's understand why there are floating point precision errors.
In our standard Base-10 (decimal) system, fractions that have prime factors other than 2 and 5 repeat infinitely. For example, 1/3 is 0.333333....
Computers represent numbers using Base-2 (binary), where the only prime factor is 2. This means the only fractions that can be represented cleanly in binary are those whose denominators are powers of 2 (like 1/2, 1/4, 1/8).
When you try to convert 0.1 (which is 1/10) to binary, the denominator is not a power of 2. Because of this, 0.1 becomes an infinitely repeating binary fraction. The same happens for 0.2.
Since a 32-bit float only has 23 bits for the fraction (and a 64-bit float has 52), the computer has to chop off that repeating decimal at some point. The values stored in memory aren't exactly 0.1 and 0.2; they are very close approximations.
- Stored 0.1 is roughly:
0.1000000000000000055511151231257827021181583404541015625 - Stored 0.2 is roughly:
0.200000000000000011102230246251565404236316680908203125
When you add these two approximations together, the tiny inaccuracies compound, giving you an exact result of 0.3000000000000000444089209850062616169452667236328125
When the language rounds this back to a manageable string to display to you, it outputs 0.30000000000000004 instead of exactly 0.3!
The Rounding Bit
How exactly do we arrive at these specific approximations? It comes down to rounding. A 32-bit float has 24 bits of precision (1 implicit bit + 23 fraction bits). When converting a repeating fraction, such as 0.1, the computer calculates the first 24 bits and then looks at the 25th bit to determine how to round.
If the 25th bit is a 0, it rounds down (truncates). If it's a 1, it rounds up by adding 1 to the 24th bit.
For 0.1, the 25th bit happens to be a 1. Therefore, the computer rounds the 24th bit up. This means the value actually stored in memory is slightly larger than the true infinite binary fraction of 0.1! The exact same thing happens for 0.2.
Floats are a brilliant compromise between range, precision, and speed, powering everything from 3D graphics to scientific simulations!