5 min readJust now
–
During day-to-day programming or general computer use, it’s common to overlook how the computer handles numbers in its definition. But this easily becomes a problem when we try to optimize a solution and even in unavoidable situations.
What really is the danger
Computers represent numbers using bits, their most basic binary unit. Each memory has a bit storage capacity defined by current technology, so if we have a computer that operates with 3 bits, we have the following situation:
Source: Image by the author. When we add 1 when we’re at the maximum number of bits, we encounter a problem since we would need 1 more bit to represent that number. In computing, this is called an integer overflow. When a sum results in integers that are too large for the chos…
5 min readJust now
–
During day-to-day programming or general computer use, it’s common to overlook how the computer handles numbers in its definition. But this easily becomes a problem when we try to optimize a solution and even in unavoidable situations.
What really is the danger
Computers represent numbers using bits, their most basic binary unit. Each memory has a bit storage capacity defined by current technology, so if we have a computer that operates with 3 bits, we have the following situation:
Source: Image by the author. When we add 1 when we’re at the maximum number of bits, we encounter a problem since we would need 1 more bit to represent that number. In computing, this is called an integer overflow. When a sum results in integers that are too large for the chosen type, the result “turns inside out.“
For example, in Python:
import numpy as np# Define the largest 32-bit signed integerx = np.int32(2147483647)print("Before overflow:", x)# Add 1 -> causes overflowx = x + np.int32(1)print("After overflow:", x)
Output:
Before overflow: 2147483647After overflow: -2147483648
This behavior isn’t a bug, but rather a consequence of the limits of binary representation. In several famous examples, this occurs with real-world problems.
Unexpected famous cases
The Boeing 787 Case (2015)
In 2015, Boeing discovered that the Boeing 787 Dreamliner’s generators could shut down mid-flight if they were left on for 248 consecutive days without being restarted.
The reason? An internal timer, based on 32-bit integers, would overflow after this period, leading to a failure in the aircraft’s power management.
The fix was simple: Periodically restart the system to reset the count and memory to zero, but the potential impact was enormous.
The Level 256 Bug in Pac-Man
Those who played Pac-Man in the arcades may be familiar with the “Kill Screen.” After level 255, the level counter (stored in 8 bits) overflows upon reaching 256. This creates a glitched screen, with half the maze unreadable, making the game impossible to complete.
The developers didn’t expect anyone to play 256 levels of Pac-Man, so they didn’t handle this exception!
Source: Image by techspot The bug of 2038
In the past, just before the year 2000, the bug of the millennium event was very popular: it said that many computers could have a bug due to the change of year 31/12/99 to 01/01/00 after midnight. Gladly, everything turned out fine, but now another catastrophic event looms just like a new Maya prophecy.
Many Unix and C systems use a signed 32-bit integer to count seconds since January 1, 1970 (the famous** Unix timestamp**). This counter will reach its limit on** January 19, 2038**, overflowing around 2.147.483.647 seconds. If left unfixed, any software that relies on time could exhibit unpredictable behavior.
And these situations doesn’t just happen with integers — with floating point numbers the situation is even more delicate, especially when we talk about numerical precision in different examples such as in Machine Learning.
How Float Variables Work
Floats (floating-point numbers) are used to represent real numbers in computers, but unlike integers, they cannot represent every value exactly. Instead, they store numbers approximately using a sign, exponent, and mantissa (according to the IEEE 754 standard).
And just like the integers of previous examples, the mantissa and exponent are represented by bits that are finite. Its value will depend on the number of bits such as 16, 32 or 64 defined by the variable declaration.
Float16 (16 bits):
-
Can represent values roughly from 6.1 × 10⁻⁵ to 6.5 × 10⁴
-
Precision of about 3–4 decimal digits
-
Uses 2 bytes (16 bits) of memory Float32 (32 bits):
-
Can represent values roughly from 1.4 × 10⁻⁴⁵ to 3.4 × 10³⁸
-
Precision of about 7 decimal digits
-
Uses 4 bytes of memory Float64 (64 bits):
-
Can represent values roughly from 5 × 10⁻³²⁴ to 1.8 × 10³⁰⁸
-
Precision of about 16 decimal digits
-
Uses 8 bytes of memory
The trade-offs applied in Machine Learning
The higher precision of float64 uses twice the memory and can because of that are slower than flot32 and float16, but is it necessary to use float64?
Deep learning models can have hundreds of millions of parameters. Using float64 would double the memory consumption. For many ML models, including neural networks, float32 is sufficient and allows faster computation with lower memory usage. Some are even studying the application of float16.
In theory, always using the highest precision type seems safe, but in practice Modern GPUs (RTX, for example) perform poorly on** float64**, while they are optimized for float32 and in some cases float16. For exemple, float64 are 10–30x slower on GPUs optimized for float32.
A simple benchmark test can be made by multiplying matrix:
import numpy as npimport time# Matrix sizeN = 500# Matrix with different float bit sizesA32 = np.random.rand(N, N).astype(np.float32)B32 = np.random.rand(N, N).astype(np.float32)A64 = A32.astype(np.float64)B64 = B32.astype(np.float64)A16 = A32.astype(np.float16)B16 = B32.astype(np.float16)def benchmark(A, B, dtype_name): start = time.time() C = A @ B # multiply matrix end = time.time() print(f"{dtype_name}: {end - start:.5f} seconds")benchmark(A16, B16, "float16")benchmark(A32, B32, "float32")benchmark(A64, B64, "float64")
Exemple of output (it will depend of computational resources):
float16: 0.01 secondsfloat32: 0.02 secondsfloat64: 0.15 seconds
That said, an** important point** is that common problems in Machine Learning model performances, such as gradients, are not solved simply by increasing accuracy, but rather by making good architectural choices.
Some good practices to solve it
In deep networks, gradients can become** very smallafter traversing several layers. In float32**, values smaller than ~1e-45 literally become** zero**. This means that the weights are no longer updated — the infamous vanishing gradient problem. But the solution isn’t to migrate to float64. Instead, we have smarter solutions.
**ReLU :**Unlike sigmoidand tanh, which flatten values and make the gradient disappear, ReLU keeps the derivative equal to 1 for x > 0. This prevents the gradient from reaching zero too quickly.
ReLU function **Batch Normalization:**Normalizes the activations in each batch to keep means close to 0 and variances close to 1. This way, the values remain within the safe range of float32representation.
**Residual Connections (ResNet):**They create “shortcuts” through a specific function so the gradient can span multiple layers without disappearing. They allow networks with 100+ layers to work well in float32.