How Do Floats Actually Work?

June 16, 20268 min readComputer ScienceComputer ArchitectureExploring C

Floating point numbers are the computer's way of representing real numbers. If you've ever typed 0.1 + 0.2 into a JavaScript console and gotten 0.30000000000000004, you've experienced the quirks of floating point arithmetic.

Unlike integers, which map neatly to binary bits, real numbers present a challenge. How do you store infinite precision in a finite number of bits? The answer lies in the IEEE 754 standard, which acts like scientific notation for computers.

The Anatomy of a Float

A standard single-precision float (like float in C/C++ or Java) takes up 32 bits. These bits are divided into three parts:

Interact with the 32 bits below to see how they combine to form a floating point number. Click any bit to flip it!

0.000000
Sign (1)Exponent (8)Fraction (23)
Sign:0 (Positive)
Exponent:0 - 127 = -127
Mantissa:1 + 0 = 1

0.0
Click the bits to flip them and see how the float value changes.

While we've been looking at 32-bit floats (single precision), there are also 64-bit floats (double precision). These are what languages like JavaScript and Python use by default. A 64-bit float works exactly the same way, but it has 1 sign bit, 11 exponent bits, and 52 fraction bits, giving it a much larger range and significantly higher precision.

What is Single Precision and Double Precision
In single precision there are 23 bits which represent the fraction which represents around 7 bits (single prec) In double precision there are 52 bits which represent the fraction which represents around 15 bits (double prec)

Scientific Notation in Binary

Just like how scientific notation works in decimal: -1.5 × 10^4. We have a sign (-), a base number (1.5), and an exponent (4).

IEEE 754 does the exact same thing, but in base 2. The formula is:

pseudocode
1value = (-1)^sign × (1 + fraction) × 2^(exponent - 127)
The Implicit Leading 1
Because scientific notation requires the number before the decimal point to be non-zero (e.g., 1.5, not 0.15), and since we only have binary digits (0 and 1), the number before the decimal must always be 1. Since it's always 1, IEEE 754 doesn't bother storing it! This gives us a free extra bit of precision.

Special Values

What happens if we want to represent zero? We can't use the standard formula, because the implicit leading 1 means we'd always have at least 1.0!

To solve this, IEEE 754 reserves specific exponent values for special cases:

Precision Issues: The 0.1 + 0.2 Problem

Now that we know how floats are stored, let's understand why there are floating point precision errors.

In our standard Base-10 (decimal) system, fractions that have prime factors other than 2 and 5 repeat infinitely. For example, 1/3 is 0.333333....

Computers represent numbers using Base-2 (binary), where the only prime factor is 2. This means the only fractions that can be represented cleanly in binary are those whose denominators are powers of 2 (like 1/2, 1/4, 1/8).

When you try to convert 0.1 (which is 1/10) to binary, the denominator is not a power of 2. Because of this, 0.1 becomes an infinitely repeating binary fraction. The same happens for 0.2.

Since a 32-bit float only has 23 bits for the fraction (and a 64-bit float has 52), the computer has to chop off that repeating decimal at some point. The values stored in memory aren't exactly 0.1 and 0.2; they are very close approximations.

When you add these two approximations together, the tiny inaccuracies compound, giving you an exact result of 0.3000000000000000444089209850062616169452667236328125

When the language rounds this back to a manageable string to display to you, it outputs 0.30000000000000004 instead of exactly 0.3!

The Rounding Bit

How exactly do we arrive at these specific approximations? It comes down to rounding. A 32-bit float has 24 bits of precision (1 implicit bit + 23 fraction bits). When converting a repeating fraction, such as 0.1, the computer calculates the first 24 bits and then looks at the 25th bit to determine how to round.

If the 25th bit is a 0, it rounds down (truncates). If it's a 1, it rounds up by adding 1 to the 24th bit.

For 0.1, the 25th bit happens to be a 1. Therefore, the computer rounds the 24th bit up. This means the value actually stored in memory is slightly larger than the true infinite binary fraction of 0.1! The exact same thing happens for 0.2.


Floats are a brilliant compromise between range, precision, and speed, powering everything from 3D graphics to scientific simulations!