PyPy, an alternative runtime for Python, uses a specially created JIT compiler to yield potentially massive speedups over CPython, the conventional Python runtime.
But PyPy’s exemplary performance has often come at the cost of compatibility with the rest of the Python ecosystem, particularly C extensions. And while those issues are improving, the PyPy runtime itself often lags in keeping up to date with the latest Python releases.
Meanwhile, the most recent releases of CPython included [the first editions of a JIT compiler native …
PyPy, an alternative runtime for Python, uses a specially created JIT compiler to yield potentially massive speedups over CPython, the conventional Python runtime.
But PyPy’s exemplary performance has often come at the cost of compatibility with the rest of the Python ecosystem, particularly C extensions. And while those issues are improving, the PyPy runtime itself often lags in keeping up to date with the latest Python releases.
Meanwhile, the most recent releases of CPython included the first editions of a JIT compiler native to CPython. The long-term promise there is better performance, and in some workloads, you can already see significant improvements. CPython also has a new alternative build that eliminates the GIL to allow fully free-threaded operations—another avenue of significant performance gains.
Could CPython be on track to displace PyPy for better performance? We ran PyPy and the latest JIT-enabled and no-GIL CPython builds side by side on the same benchmarks, with intriguing results.
PyPy still kills it at raw math
CPython has always performed poorly in simple numerical operations, due to all the indirection and abstraction required. There’s no such thing in CPython as a primitive, machine-level integer, for instance.
As a result, benchmarks like this one tend to perform quite poorly in CPython:
def transform(n: int):
q = 0
for x in range(0, n * 500):
q += x
return q
def main():
return [transform(x) for x in range(1000)]
main()
On a Ryzen 5 3600 with six cores, Python 3.14 takes about 9 seconds to run this benchmark. But PyPy chews through it in around 0.2 seconds.
This also isn’t the kind of workload that benefits from Python’s JIT, at least not yet. With the JIT enabled in 3.14, the time drops only slightly, to around 8 seconds.
But what happens if we use a multi-threaded version of the same code, and throw the no-GIL version of Python at it?
def transform(n: int):
q = 0
for x in range(0, n * 500):
q += x
return q
def main():
result = []
with ThreadPoolExecutor() as pool:
for x in range(1000):
result.append(pool.submit(transform, x))
return [_.result() for _ in result]
main()
The difference is dramatic, to say the least. Python 3.14 completes this job in 1.7 seconds. Still not the sub-second results of PyPy, but a big enough jump to make using threads and no-GIL worth it.
What about PyPy and threading? Ironically, running the multithreaded version on PyPy slows it down drastically, with the job taking around 2.1 seconds to run. Blame that on PyPy still having a GIL-like locking mechanism, and therefore no full parallelism across threads. Its JIT compilation is best exploited by running everything in a single thread.
If you’re wondering if swapping a process pool for a thread pool would help, the answer is, not really. A process pool version of the above does speed things up a bit—1.3 seconds on PyPy—but process pools and multiprocessing on PyPy are not as optimized as they are in CPython.
To recap: for “vanilla” Python 3.14:
- No JIT, GIL: 9 seconds
- With JIT, GIL: 8 seconds
- No JIT, no-GIL: 9.5 seconds
The no-GIL build is still slightly slower than the regular build for single-threaded operations. The JIT helps a little here, but not much.
Now, consider the same breakdown for Python 3.14 and a process pool:
- No JIT, GIL: 1.75 seconds
- With JIT, GIL: 1.5 seconds
- No JIT, no-GIL: 2 seconds
How about for Python 3.14, using other forms of the script?
- Threaded version with no-GIL: 1.7 seconds
- Multiprocessing version with GIL: 2.3 seconds
- Multiprocessing version with GIL and JIT: 2.4 seconds
- Multiprocessing version with no-GIL: 2.1 seconds
And here’s a summary of how PyPy fares:
- Single-threaded script: 0.2 seconds
- Multithreaded script: 2.1 seconds
- Multiprocessing script: 1.3 seconds
The n-body problem
Another common math-heavy benchmark vanilla Python is notoriously bad at is the “n-body” benchmark. This is also the kind of problem that’s hard to speed up by using parallel computation. It is possible, just not simple, so the easiest implementations are single-threaded.
If I run the n-body benchmark for 1,000,000 repetitions, I get the following results:
- Python 3.14, no JIT: 7.1 seconds
- Python 3.14, JIT: 5.7 seconds
- Python 3.15a4, no JIT: 7.6 seconds
- Python 3.15a4, JIT: 4.2 seconds
That’s an impressive showing for the JIT-capable editions of Python. But then we see that PyPy chews through the same benchmark in 0.7 seconds—as-is.
Computing pi
Sometimes even PyPy struggles with math-heavy Python programs. Consider this naive implementation to calculate digits of pi. This is another example of a task that can’t be parallelized much, if at all, so we’re using a single-threaded test.
When run for 20,000 digits, here’s what came out:
- Python 3.14, no JIT: 13.6 seconds
- Python 3.14, JIT: 13.5 seconds
- Python 3.15, no JIT: 13.7 seconds
- Python 3.15, JIT: 13.5 seconds
- PyPy: 19.1 seconds
It’s uncommon, but hardly impossible, for PyPy’s performance to be worse than regular Python’s. What’s surprising is to see it happen in a scenario where you’d expect PyPy to excel.
CPython is getting competitive for other kinds of work
Another benchmark I’ve used often with Python is a variant of the Google n-gram benchmark, which processes a multi-megabyte CSV file and generates some statistics about it. That makes it more I/O-bound than the previous benchmarks, which were more CPU-bound, but it’s still possible to use it for useful information about the speed of the runtime.
I’ve written three incarnations of this benchmark: single-threaded, multi-threaded, and multi-process. Here’s the single-threaded version:
import collections
import time
import gc
import sys
try:
print ("JIT enabled:", sys._jit.is_enabled())
except Exception:
...
def main():
line: str
fields: list[str]
sum_by_key: dict = {}
start = time.time()
with open("ngrams.tsv", encoding="utf-8", buffering=2 << 24) as file:
for line in file:
try:
fields = line.split("\t", 3)
except:
continue
try:
sum_by_key[fields[1]] += int(fields[2])
except:
sum_by_key[fields[1]] = int(fields[2])
summation = collections.Counter(sum_by_key)
max_entry = summation.most_common(1)
stop = time.time()
print(stop - start)
if len(max_entry) == 0:
print("No entries")
else:
print("max_key:", max_entry[0][0], "sum:", max_entry[0][1])
try:
gc.freeze()
gc.disable()
except Exception:
...
main()
Here’s how Python 3.14 handles this benchmark with different versions of the script:
- Single-threaded, GIL: 4.2 seconds
- Single-threaded, JIT, GIL: 3.7 seconds
- Multi-threaded, no-GIL: 1.05 seconds
- Multi-processing, GIL: 2.42 seconds
- Multi-processing, JIT, GIL: 2.4 seconds
- Multi-processing, no-GIL: 2.1 seconds
And here’s the same picture with PyPy:
- Single-threaded: 2.75 seconds
- Multi-threaded: 14.3 seconds (not a typo!)
- Multi-processing: 8.7 seconds
In other words, for this scenario, the CPython no-GIL multithreaded version beats even PyPy at its most optimal. As yet, there is no build of CPython that enables the JIT and uses free threading, but such a version is not far away and could easily change the picture even further.
Conclusion
In sum, PyPy running the most basic, unoptimized version of a math-heavy script still outperforms CPython. But CPython gets drastic relative improvements from using free-threading and even multiprocessing, where possible.
While PyPy cannot take advantage of those built-in features, its base speed is fast enough that using threading or multiprocessing for some jobs isn’t really required. For instance, the n-body problem is hard to parallelize well, and computing pi can hardly be parallelized at all, so it’s a boon to be able to run single-threaded versions of those algorithms fast.
What stands out most from these tests is that PyPy’s benefits are not universal, or even consistent. They vary widely depending on the scenario. Even within the same program, there can be a tremendous variety of scenarios. Some programs can run tremendously fast with PyPy, but it’s not easy to tell in advance which ones. The only way to know is to benchmark your application.
Something else to note is that one of the major avenues toward better performance and parallelism for Python generally—free-threading—isn’t currently available for PyPy. Multiprocessing doesn’t work well in PyPy either, due to it having a much slower data serialization mechanism between processes than CPython does.
As fast as PyPy can be, the benchmarks here show the benefits of true parallelism with threads in some scenarios. PyPy’s developers might find a way to implement that in time, but it’s unlikely they’d be able to do it by directly repurposing what CPython already has, given how different PyPy and CPython are under the hood.