Faster practical modular inversion

December 20, 2025

Last year, Lemire wrote about an optimized variation of the Euclidean algorithm for computing the greatest common divisor of two numbers, called binary Euclidean algorithm or Stein’s algorithm. It’s a best-of-class implementation, though it’s currently only used by libc++.

The post also briefly mentions the extended Euclidean algorithm, a related algorithm most often used to compute the modular multiplicative inverse (given a …

December 20, 2025

The post also briefly mentions the extended Euclidean algorithm, a related algorithm most often used to compute the modular multiplicative inverse (given a remainder a and a modulus m, find x such that a⋅xmodm=1):

There is also a binary version of the extended Euclidean algorithm[,] although it is quite a bit more involved and it is not clear that it […] can be implemented at high speed, leveraging fast instructions, when working on integers that fit in general-purpose registers. […]

My implementation of the binary extended Euclidean algorithm is quite a bit slower and not recommended. I expect that it should be possible to optimize it further.

– Lemire

That’s a big shame, because the extended Euclidean algorithm can be optimized in a very similar manner, and the underlying ideas were described in a 2020 paper. It’s probably not well-known because the paper focuses on constant-time evaluation and long arithmetic, so people might have assumed it’s irrelevant.

I’m hoping to bring justice to the extended Stein’s algorithm with this post. I’ll cover how the algorithm works, its limitations, some optimizations compared to Pornin’s paper, and potential further improvements.

My implementation is available on GitHub as part of a Rust modular arithmetic library.

The textbook algorithm can be used not only to compute inverses, but also to solve linear Diophantine equations. I will focus on the former in this post, since that’s where the optimizations shine at. I’ll briefly cover the general case at the end of the post.

I won’t make claims on exact performance, because something strange is going on with the Lemire’s benchmarking results and I don’t want to add to the mess. I’ve measured that my implementation of the algorithm is 1.3 – 2 times faster than the textbook implementation on average, even on M4, but you may see a completely different picture if your compiler produces slightly different codegen.

Lemire’s benchmark seems to be skewed by the choice of the compiler (GCC vs Clang), its version (Clang 18 vs Clang 21), optimization flags (-O2 vs -O3), the microarchitecture (Haswell vs Ice Lake vs Zen 2), and minutiae of the benchmarking code. Results don’t make much sense mathematically and look disproportionately affected by microarchitectural conditions.

If you want to get the fastest implementation, I suggest you inspect the assembly more closely than me, because I have no idea what’s going on.

Nevertheless, here is some raw data for transparency. The benchmark measures the time per inversion (in ns), the cell format is “Stein’s algorithm / Euclidean algorithm”.

8 bits16 bits32 bits64 bits Haswell11.38 / 19.21 (-41%)17.48 / 33.96 (-49%)29.76 / 61.69 (-52%)67.18 / 152.19 (-56%) Alder Lake8.20 / 10.19 (-20%)13.77 / 16.87 (-18%)21.47 / 31.00 (-31%)50.38 / 69.57 (-28%) Zen 57.77 / 10.56 (-26%)9.43 / 14.80 (-36%)13.96 / 23.98 (-42%)34.58 / 49.24 (-30%) M114.58 / 13.05 (+12%)11.48 / 18.63 (-38%)19.74 / 35.47 (-44%)43.14 / 71.14 (-39%) M28.93 / 10.26 (-13%)11.00 / 17.90 (-39%)19.38 / 33.78 (-43%)41.33 / 68.03 (-39%) M45.28 / 8.60 (-39%)8.07 / 14.77 (-45%)13.63 / 28.05 (-51%)28.68 / 56.22 (-49%) Cortex-A7229.80 / 33.48 (-11%)38.30 / 49.36 (-22%)61.28 / 83.63 (-27%)162.55 / 151.77 (+7%) Snapdragon 8 Gen 39.72 / 12.13 (-20%)14.97 / 21.91 (-32%)28.51 / 39.89 (-29%)70.11 / 75.46 (-7%) Kryo 48515.08 / 19.36 (-22%)21.54 / 30.41 (-29%)33.63 / 50.96 (-34%)90.32 / 94.76 (-5%)

Let’s start with the algorithm for computing the GCD of a and b. Suppose for now that b is odd. Here’s the core idea:

If a is divisible by 2k, this factor can be removed: gcd(2ka,b)=gcd(a,b). This decreases the bit length of a by at least 1, guaranteeing 𝒪(log⁡a) time complexity if we can apply this reduction consistently.
If both a and b are odd, rewriting gcd(a,b)=gcd(a−b,b) guarantees a′=a−b will be even and reducible on the next iteration. To avoid negative integers, swap a and b if a<b beforehand; new b remains odd because a was odd.

The implementation is very short:

while a != 0 {
a >>= a.trailing_zeros();
if a < b {
(a, b) = (b, a);
}
a -= b;
}
return b;

If the initial b is not guaranteed to be odd, some adjustments are necessary:

let shift = (a | b).trailing_zeros(); // == min(a.trailing_zeros(), b.trailing_zeros())
b >>= b.trailing_zeros();
/* loop from the previous snippet */
return b << shift;

But for modular inversion, the modulus is usually odd, so I won’t dwell on this.

This covers the general structure of the algorithm, but some optimizations are crucial for getting good performance.

The conditional swap should be compiled to branchless code to avoid branch misprediction. Compiler hints like __builtin_unpredictable or core::hint::select_unpredictable may be useful.

The loop has a high latency because trailing_zeros, >>=, if, and -= are computed sequentially. But since (-a).trailing_zeros() == a.trailing_zeros(), a.trailing_zeros() can in principle be computed before the swap on the previous iteration:

let mut q = a.trailing_zeros();
while a != 0 {
a >>= q;
q = (a - b).trailing_zeros();
if a < b {
(a, b) = (b - a, a);
} else {
(a, b) = (a - b, b);
}
}

This brings the latency down to 3 operations: >>=; a - b and b - a computed in parallel; trailing_zeros and if computed in parallel. It also slightly increases the number of operations (computing b - a and a - b and only using one), but the tradeoff pays off.

Pay close attention to trailing_zeros if you’re implementing this in C. The algorithm can invoke it with a zero input on the last iteration. This is well-defined in Rust, which maps 0 to the bit width of the data type, but in C __builtin_clz(0) is UB. Use __builtin_clzg to avoid issues. In C++, std::countr_zero(0) is well-defined.

GCC documents __builtin_clz(0) as having an “undefined result”, so I initially assumed it means an indeterminate value. In reality, GCC maintainers consider it UB and LLVM documents it as UB… but the optimizers seem to model it exactly like an indeterminate value? (e.g. LLVM considers @llvm.cttz(0) to produce poison) This is frankly ridiculous, someone do something about it.

You might be wondering how this algorithm is related to modular inversion.

The trick is to express the values of a and b at each point as weighted sums of the original a and b (denoted a0,b0) with some coefficients ki,li:

{a=k0a0+l0b0b=k1a0+l1b0

If a0 is invertible modulo b0, their GCD is 1, and so at the end of the algorithm b=1. This gives us:

k1a0+l1b0=1⟹k1a0=1(modb0)

That is, k1 is the inverse of a0 modulo b0. So all we need to do is track the coefficients across iterations. We start with:

{a=a0=1a0+0b0b=b0=0a0+1b0

When a is divided by 2q, the coefficients are divided by the same value:

a=k0a0+l0b0⟹a2q=k02qa0+l02qb0

When a and b are swapped, the pairs (k0,l0) and (k1,l1) are swapped.

When b is subtracted from a, the coefficients are subtracted:

a−b=(k0−k1)a0+(l0−l1)b0

In other words, whatever we do to a and b, we also do to the coefficient pairs (k0,l0).

Implementation attempts quickly reveal a problem: coefficients are not necessarily divisible by 2q, so it’s not clear how to represent them. Surely not with floats.

This is actually a core difference between Stein’s algorithm and the textbook Euclidean algorithm, which is implemented as gcd(a,b)=gcd(b,amodb).

The Euclidean algorithm uses division (q = a / b), but only to compute constant factors. The values are updated with subtraction and multiplication alone (a -= b * q). Stein’s algorithm divides values (a /= 2^q), causing non-integer coefficients.

This is likely why the extended Stein’s algorithm is unpopular. We’ll use tricks tailored to modular inverse, but the general-purpose case covered at the end of the post essentially boils down to “compute modular inverse and post-process”. I believe it can still be faster than the textbook implementation, but I haven’t tested it.

We can track coefficients as fractions to stay in integers. The most efficient approach uses the same denominator 2p for all variables:

{a=2−p(k0a0+l0b0)b=2−p(k1a0+l1b0)

We start with p=0. Instead of dividing k0,l0 by 2q, we increase p by q and multiply k1,l1 by 2q. Subtraction can ignore p because all coefficients use the same precision.

This seems pointless at first, since we need to know 2−pmodb0, but if the modulus is fixed, we can precompute it. Each iteration reduces the total bit length of a and b by at least q, and after the last right-shift a,b≠0, so if the input numbers fit in k bits, the sum of q (and thus p) is limited by 2k−2. This means that we can increase precision to 2k−2 at the end and use a single precomputed value 2−(2k−2)modb0.

The multitude of variables is getting confusing, so let’s simplify it. We’re looking for k1modb0 and don’t care about li, so tracking just k0 and k1 suffices. Let’s rename these variables to u and v respectively to get rid of indices. This gives us:

// Example for 32-bit inputs (k = 32).
let mut u = 1;
let mut v = 0;
let mut p = 0;
let mut q = a.trailing_zeros();
while a != 0 {
a >>= q;
v <<= q;
p += q;

q = (a - b).trailing_zeros();

if a < b {
(a, b) = (b - a, a);
(u, v) = (v, u);
} else {
(a, b) = (a - b, b);
}
u -= v;
}
assert!(b == 1, "not invertible");
v <<= 62 - p;
return (v * inverse_of_2p62) % b0;

We don’t apply the latency-reducing trick to u and v because the latency is dominated by other calculations. Computing both u - v and v - u would most likely reduce performance, since we’re already pushing the CPU limit of parallel operations.

It’s easy to prove by induction that at the beginning of each iteration,

{−2p+1<u<2p+1−2p<v≤2p

This means that u and v fit in signed p+2-bit integers. Since p≤2k−2, that amounts to 2k-bit types, i.e. twice as wide as the input. And that’s a problem: while it works just fine for 32-bit inputs, 64-bit inputs require i128 arithmetic, which slows down the algorithm considerably. We’ll discuss what to do about it in a bit.

Before we do this, though, let’s finish the 32-bit case. There’s just one thing left to improve: computing v⋅2−62modb0.

On the face of it, this is one multiplication and one reduction, but Montgomery multiplication demonstrates that these operations can be performed faster together.

Assume for a moment that v is non-negative. The idea is to subtract a multiple of b0 from v such that the bottom 62 bits become zero, so that the remainder remains the same, but division by 262 can be performed with a shift. We’re looking for t such that

v−t⋅b0=0(mod262)

This is equivalent to t=v⋅b0−1(mod262), and by precomputing j=b0−1mod262, we obtain v−(vjmod262)b0 as the easily divisible value. Since v and (vjmod262)b0 have equal bottom 62 bits,

v−(vjmod262)b0262=⌊v262⌋−⌊(vjmod262)b0262⌋

We’ve just found that v≤2p≤262, so unless v=262 exactly, this is just

−⌊(vjmod262)b0262⌋=−⌊(v(4j)mod264)b0264⌋

This number is in range [−b0+1;0]. We know that 0 can never be an inverse, so it’s actually [−b0+1;−1], and by adding b0, we obtain the exact remainder. This can be computed with only two multiplications and some glue:

fn redc62(v: i64) -> u32 {
if v == (1 << 62) {
1
} else {
let x = v.unsigned_abs().wrapping_mul(j << 2).widening_mul(b0 as u64).1 as u32;
if v > 0 { b0 - x } else { x }
}
}

That’s it for 32-bit and smaller inputs. Yay! Buy yourself a cupcake.

For 64-bit inputs, coefficients only fit in i128. This makes each operation twice as slow. We can reduce u and v modulo b0 on each iteration so that coefficients fit in 64 bits, since we only need vmodb0, but this tanks performance too.

Hmm. Notice that at the beginning of the algorithm, u and v fit in 1 bit and then grow slowly. Only once their length exceeds 64 bits do we need long integers. What if we could somehow reset the length every few iterations, so that 64-bit integers suffice?

Just like a and b can be represented as weighted sums of a0,b0, u and v can be represented as weighted sums of their earlier versions u0,v0:

{u=f0u0+g0v0v=f1u0+g1v0

The trick is to save u0,v0 and update short coefficients fi,gi instead of long values u,v in the loop. We start with u0=1,v0=0 and trivial coefficients:

{u=u0=1u0+0v0v=v0=0u0+1v0

When the coefficients fi,gi grow past 64 bits, we pause, compute u,v based on these formulas, replace u0,v0 with u,v, and reset the coefficients fi,gi back to trivial, bringing the length back to 1.

let mut u0 = 1;
let mut v0 = 0;
let mut q = a.trailing_zeros();

while a != 0 {
// The coefficients relating (u, v) to (u0, v0).
let mut (f0, g0) = (1, 0);
let mut (f1, g1) = (0, 1);
let mut p = 0;

// Run the algorithm until p reaches the limit.
while a != 0 && p + q <= 62 {
a >>= q;
f1 <<= q;
g1 <<= q;
p += q;

q = (a - b).trailing_zeros();

if a < b {
(a, b) = (b - a, a);
(f0, f1) = (f1, f0);
(g0, g1) = (g1, g0);
} else {
(a, b) = (a - b, b);
}
f0 -= f1;
g0 -= g1;
}

// This section means different things depending on the reason the loop stopped:
// - If we ran out of precision, this performs as much of the last action as possible and
//   adjusts `q` so that the operation completes on the next iteration.
// - If `a = 0`, this effectively raises the precision of f1/g1 to 62. It doesn't adjust
//   `f0, g0` correctly, but this doesn't matter because `u` is not read on the exit path.
a >>= 62 - p;
f1 <<= 62 - p;
g1 <<= 62 - p;
q -= 62 - p;

// Apply the coefficients.
let f0 = redc62(f0);
let g0 = redc62(g0);
let f1 = redc62(f1);
let g1 = redc62(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % b0, (f1 * u0 + g1 * v0) % b0);
}

assert!(b == 1, "not invertible");
return v0;
Expand

The astute among you might realize this doesn’t improve much, since we went from updating two 128-bit numbers in a loop to updating four 64-bit numbers in a loop. But since we apply the exact same operations to fi and gi, we can vectorize them.

We can’t use SIMD because x86 doesn’t have cmov for vector registers, but we can decrease the coefficient length to 32 bits and pack two coefficients into one integer:

{c0=f0+232g0c1=f1+232g1

This simplifies the inner loop to:

while a != 0 && p + q <= 30 {
a >>= q;
c1 <<= q;
p += q;

q = (a - b).trailing_zeros();

if a < b {
(a, b) = (b - a, a);
(c0, c1) = (c1, c0);
} else {
(a, b) = (a - b, b);
}
c0 -= c1;
}

Just like u and v, c0 and c1 take p+2 bits, so we limit p by 32−2=30. But with care, we can squeeze out one more bit. Recall the inequalities:

{−2p+1<u<2p+1−2p<v≤2p

Only u takes p+2 bits. v fits in p+1, if barely: signed integer types represent the range [−2p;2p−1], while this is [−2p+1;2p], but the number of distinct values is the same. So even if we run out of the 30-bit limit, we can shift v once more. This affects the code after the inner loop:

// 31 would be 30 without this optimization
a >>= 31 - p;
c1 <<= 31 - p;
q -= 31 - p;

let (f0, g0) = parse_coefficients(c0);
let (f1, g1) = parse_coefficients(c1);
let f0 = redc31(f0);
let g0 = redc31(g0);
let f1 = redc31(f1);
let g1 = redc31(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % b0, (f1 * u0 + g1 * v0) % b0);

Note that the inner loop is still limited by 30, since it not only shifts v, but also subtracts from u, which could cause an overflow with a limit of 31.

Parsing coefficients from ci is slightly tricky due to the unusual signed integer format, but not impossibly so:

int(x)={xif x≤231x−232if x>231

{fi=int(cimod232)gi=int(⌊ci+231−1232⌋)

This assumes that ci is stored in an unsigned type.

With packed coefficients, the inner loop is similar to the unoptimized version, differing only in c0,c1 vs u,v. This allows us to cheaply combine two approaches: track the true values u,v for the first 62 iterations and then switch to coefficients. It’s faster than relying on coefficients alone because it recalculates u0,v0 less often.

The final implementation looks something like this:

let mut u0 = 1;
let mut v0 = 0;
let mut q = a.trailing_zeros();
let mut is_first_iteration = true;

while a != 0 {
// Either coefficients in SWAR format, or the values u/v, depending on the iteration.
let mut c0 = 1;
let mut c1 = if is_first_iteration { 0 } else { 1 << 32 };
let mut p_left = if is_first_iteration { 63 } else { 31 };

while a != 0 && q < p_left { // < instead of <= is load-bearing
a >>= q;
c1 <<= q;
p_left -= q;

q = (a - b).trailing_zeros();

if a < b {
(a, b) = (b - a, a);
(c0, c1) = (c1, c0);
} else {
(a, b) = (a - b, b);
}
c0 -= c1;
}

a >>= p_left;
c1 <<= p_left;
q -= p_left;

if is_first_iteration {
u0 = redc63(c0);
v0 = redc63(c1);
} else {
let (f0, g0) = parse_coefficient(c0);
let (f1, g1) = parse_coefficient(c1);
let f0 = redc31(f0);
let g0 = redc31(g0);
let f1 = redc31(f1);
let g1 = redc31(g1);
(u0, v0) = ((f0 * u0 + g0 * v0) % m, (f1 * u0 + g1 * v0) % m);
}

is_first_iteration = false;
}

assert!(b == 1, "not invertible");
return v0;
Expand

We store p_left instead of p so that p_left -= q and q < p_left can be computed with a single instruction.

The 32-bit and 64-bit cases can use the same implementation, as replacing q < p_left with true makes it identical to the 32-bit algorithm, and compilers recognize this.

redc31(x) can be implemented as redc63(x << 32).

And that’s it! You now know a cool way to compute 64-bit modular inverses.

To support variable b0, we can compute j=b0−1mod264 in runtime. This can be done very quickly with an algorithm by Jeffrey Hurchalla.

j only exists if b0 is odd. If it’s even, swap a0 and b0. If both are even, divide them by their common power of two and choose whichever becomes odd as b0.

To replace the extended Euclidean algorithm, we need to find integers x,y such that:

a0x+b0y=gcd(a0,b0)

Luckily, our v is no longer a fraction, but rather a remainder modulo b0, so we can substitute x=vmodb0. y can then be computed from the equation:

y=gcd(a0,b0)−a0xb0=b−a0xb0

Since this division is exact, it can be calculated with multiplication by j:

y=j⋅(b−a0x)(mod264)

Despite this complexity, I believe this method can be faster than the extended Euclidean algorithm, since the auxiliary logic takes constant time, except for computing j in 𝒪(log⁡k)=𝒪(log⁡log⁡a), which is still pretty good.

As a reminder, you can find my code on GitHub. The source of latency-optimized GCD is this post. Using coefficients to reset bit lengths of u,v comes from this paper, which also covers the case when values don’t fit in general-purpose registers.

Thanks to many friends of mine for contributing to the benchmarking results, to Ian Qvist for the motivation to complete this post and editorial comments, and to Yuki for saving me from going insane over unexplainable performance phenomena.

Similar Posts