, ML engineers, and software developers, optimising every bit of performance from our codebases can be a crucial consideration. If you are a Python user, you’ll be aware of some of its deficits in this respect. Python is considered a slow language, and you’ve probably heard that a lot of the reason for this is due to its Global Interpreter Lock (GIL) mechanism.
It is what it is, but what can we do about it? There are several ways we can ameliorate this issue when coding in Python, especially if you’re using a reasonably up-to-date version of Python.
- The very latest releases of Python have a way of running code without using the GIL.
- We can utilise high-performance third-party libraries, such as NumPy, to perform number crunching.
- There are also many methods for parallel and…
, ML engineers, and software developers, optimising every bit of performance from our codebases can be a crucial consideration. If you are a Python user, you’ll be aware of some of its deficits in this respect. Python is considered a slow language, and you’ve probably heard that a lot of the reason for this is due to its Global Interpreter Lock (GIL) mechanism.
It is what it is, but what can we do about it? There are several ways we can ameliorate this issue when coding in Python, especially if you’re using a reasonably up-to-date version of Python.
- The very latest releases of Python have a way of running code without using the GIL.
- We can utilise high-performance third-party libraries, such as NumPy, to perform number crunching.
- There are also many methods for parallel and concurrent processing built into the language now. One other method we can use is to call other high-performance languages from within Python for time-critical sections of our code. That’s what we’ll cover in this article as I show you how to call Mojo code from Python.
Have you heard of Mojo before? If not, here’s a quick history lesson.
Mojo is a relatively new systems-level language developed by Modular Inc**.** (an AI infrastructure company co-founded in 2022 by compiler writing legend **Chris Lattner,**of LLVM and Swift creator fame, and former Google TPUs lead Tim Davis) and first shown publicly in May 2023.
It was born from a simple pain point, i.e. Python’s lack of performance that we discussed earlier. Mojo tackles this head-on by grafting a superset of Python’s syntax onto an LLVM/MLIR-based compiler pipeline that delivers zero-cost abstractions, static typing, ownership-based memory management, automatic vectorisation, and seamless code generation for CPU and GPU accelerators.
Early benchmarks demoed at its launch ran kernel-dense workloads up to 35,000× fasterthan vanilla Python, proving that Mojo can match — or exceed — the raw throughput of C/CUDA while letting developers stay in familiar “pythonic” territory.
However, there is always a stumbling block, and that is folks’ inertia to move entirely to a new language. I’m one of those people, too, so I was delighted when I read that, as of a few weeks ago, it was now possible to call Mojo code directly from Python.
Does this mean we get the best of both worlds: the simplicity of Python and the performance of Mojo?
To test the claims, we will write some code using vanilla Python. Then, for each, we’ll also code a version using NumPy and, finally, a Python version that offloads some of its computation to a Mojo module. Ultimately, we’ll compare the various run times.
Will we see significant performance gains? Read on to find out.
Setting up a development environment
I’ll be using WSL2 Ubuntu for Windows for my development. The best practice is to set up a new development environment for each project you’re working on. I usually use conda for this, but as everyone and their granny seems to be moving towards using the new uv package manager, I’m going to give that a go instead. There are a couple of ways you can install uv.
$ curl -LsSf https://astral.sh/uv/install.sh | sh
or...
$ pip install uv
Next, initialise a project.
$ uv init mojo-test
$ cd mojo-test
$ uv venv
$ source .venv/bin/activate
Initialized project `mojo-test` at `/home/tom/projects/mojo-test`
(mojo-test) $ cd mojo-test
(mojo-test) $ ls -al
total 28
drwxr-xr-x 3 tom tom 4096 Jun 27 09:20 .
drwxr-xr-x 15 tom tom 4096 Jun 27 09:20 ..
drwxr-xr-x 7 tom tom 4096 Jun 27 09:20 .git
-rw-r--r-- 1 tom tom 109 Jun 27 09:20 .gitignore
-rw-r--r-- 1 tom tom 5 Jun 27 09:20 .python-version
-rw-r--r-- 1 tom tom 0 Jun 27 09:20 README.md
-rw-r--r-- 1 tom tom 87 Jun 27 09:20 main.py
-rw-r--r-- 1 tom tom 155 Jun 27 09:20 pyproject.toml
Now, add any external libraries we need
(mojo-test) $ uv pip install modular numpy matplotlib
How does calling Mojo from Python work?
Let’s assume we have the following simple Mojo function that takes a Python variable as an argument and adds two to its value. For example,
# mojo_func.mojo
#
fn add_two(py_obj: PythonObject) raises -> Python
var n = Int(py_obj)
return n + 2
When Python is trying to load add_two, it looks for a function called PyInit_add_two(). Within PyInit_add_two(), we have to declare all Mojo functions and types that are callable from Python using the PythonModuleBuilder library. So, in fact, our Mojo code in its final form will resemble this.
from python import PythonObject
from python.bindings import PythonModuleBuilder
from os import abort
@export
fn PyInit_mojo_module() -> PythonObject:
try:
var m = PythonModuleBuilder("mojo_func")
m.def_function[add_two]("add_two", docstring="Add 2 to n")
return m.finalize()
except e:
return abort[PythonObject](String("Rrror creating Python Mojo module:", e))
fn add_two(py_obj: PythonObject) raises -> PythonObject:
var n = Int(py_obj)
n + 2
The Python code requires additional boilerplate code to function correctly, as shown here.
import max.mojo.importer
import sys
sys.path.insert(0, "")
import mojo_func
print(mojo_func.add_two(5))
# SHould print 7
Code examples
For each of my examples, I’ll show three different versions of the code. One will be written in pure Python, one will utilise NumPy to speed things up, and the other will substitute calls to Mojo where appropriate.
Be warned that calling Mojo code from Python is in early development. You can expect significant changes to the API and ergonomics
Example 1 — Calculating a Mandelbrot set
For our first example, we’ll compute and display a Mandelbrot set. This is quite computationally expensive, and as we’ll see, the pure Python version takes a considerable amount of time to complete.
We’ll need four files in total for this example.
1/ mandelbrot_pure_py.py
# mandelbrot_pure_py.py
def compute(width, height, max_iters):
"""Generates a Mandelbrot set image using pure Python."""
image = [[0] * width for _ in range(height)]
for row in range(height):
for col in range(width):
c = complex(-2.0 + 3.0 * col / width, -1.5 + 3.0 * row / height)
z = 0
n = 0
while abs(z) <= 2 and n < max_iters:
z = z*z + c
n += 1
image[row][col] = n
return image
2/ mandelbrot_numpy.py
# mandelbrot_numpy.py
import numpy as np
def compute(width, height, max_iters):
"""Generates a Mandelbrot set using NumPy for vectorized computation."""
x = np.linspace(-2.0, 1.0, width)
y = np.linspace(-1.5, 1.5, height)
c = x[:, np.newaxis] + 1j * y[np.newaxis, :]
z = np.zeros_like(c, dtype=np.complex128)
image = np.zeros(c.shape, dtype=int)
for n in range(max_iters):
not_diverged = np.abs(z) <= 2
image[not_diverged] = n
z[not_diverged] = z[not_diverged]**2 + c[not_diverged]
image[np.abs(z) <= 2] = max_iters
return image.T
3/ mandelbrot_mojo.mojo
# mandelbrot_mojo.mojo
from python import PythonObject, Python
from python.bindings import PythonModuleBuilder
from os import abort
from complex import ComplexFloat64
# This is the core logic that will run fast in Mojo
fn compute_mandel_pixel(c: ComplexFloat64, max_iters: Int) -> Int:
var z = ComplexFloat64(0, 0)
var n: Int = 0
while n < max_iters:
# abs(z) > 2 is the same as z.norm() > 4, which is faster
if z.norm() > 4.0:
break
z = z * z + c
n += 1
return n
# This is the function that Python will call
fn mandelbrot_mojo_compute(width_obj: PythonObject, height_obj: PythonObject, max_iters_obj: PythonObject) raises -> PythonObject:
var width = Int(width_obj)
var height = Int(height_obj)
var max_iters = Int(max_iters_obj)
# We will build a Python list in Mojo to return the results
var image_list = Python.list()
for row in range(height):
# We create a nested list to represent the 2D image
var row_list = Python.list()
for col in range(width):
var c = ComplexFloat64(
-2.0 + 3.0 * col / width,
-1.5 + 3.0 * row / height
)
var n = compute_mandel_pixel(c, max_iters)
row_list.append(n)
image_list.append(row_list)
return image_list
# This is the special function that "exports" our Mojo function to Python
@export
fn PyInit_mandelbrot_mojo() -> PythonObject:
try:
var m = PythonModuleBuilder("mandelbrot_mojo")
m.def_function[mandelbrot_mojo_compute]("compute", "Generates a Mandelbrot set.")
return m.finalize()
except e:
return abort[PythonObject]("error creating mandelbrot_mojo module")
4/ main.py
This will call the other three programs and also allow us to plot out the Mandelbrot graph in a Jupyter Notebook. I’ll only show the plot once. You’ll have to take my word that it was plotted correctly on all three runs of the code.
# main.py (Final version with visualization)
import time
import numpy as np
import sys
import matplotlib.pyplot as plt # Now, import pyplot
# --- Mojo Setup ---
try:
import max.mojo.importer
except ImportError:
print("Mojo importer not found. Please ensure the MODULAR_HOME and PATH are set correctly.")
sys.exit(1)
sys.path.insert(0, "")
# --- Import Our Modules ---
import mandelbrot_pure_py
import mandelbrot_numpy
import mandelbrot_mojo
# --- Visualization Function ---
def visualize_mandelbrot(image_data, title="Mandelbrot Set"):
"""Displays the Mandelbrot set data as an image using Matplotlib."""
print(f"Displaying image for: {title}")
plt.figure(figsize=(10, 8))
# 'hot', 'inferno', and 'plasma' are all great colormaps for this
plt.imshow(image_data, cmap='hot', interpolation='bicubic')
plt.colorbar(label="Iterations")
plt.title(title)
plt.xlabel("Width")
plt.ylabel("Height")
plt.show()
# --- Test Runner ---
def run_test(name, compute_func, *args):
"""A helper function to run and time a test."""
print(f"Running {name} version...")
start_time = time.time()
# The compute function returns the image data
result_data = compute_func(*args)
duration = time.time() - start_time
print(f"-> {name} version took: {duration:.4f} seconds")
# Return the data so we can visualize it
return result_data
if __name__ == "__main__":
WIDTH, HEIGHT, MAX_ITERS = 800, 600, 5000
print("Starting Mandelbrot performance comparison...")
print("-" * 40)
# Run Pure Python Test
py_image = run_test("Pure Python", mandelbrot_pure_py.compute, WIDTH, HEIGHT, MAX_ITERS)
visualize_mandelbrot(py_image, "Pure Python Mandelbrot")
print("-" * 40)
# Run NumPy Test
np_image = run_test("NumPy", mandelbrot_numpy.compute, WIDTH, HEIGHT, MAX_ITERS)
# uncomment the below line if you want to see the plot
#visualize_mandelbrot(np_image, "NumPy Mandelbrot")
print("-" * 40)
# Run Mojo Test
mojo_list_of_lists = run_test("Mojo", mandelbrot_mojo.compute, WIDTH, HEIGHT, MAX_ITERS)
# Convert Mojo's list of lists into a NumPy array for visualization
mojo_image = np.array(mojo_list_of_lists)
# uncomment the below line if you want to see the plot
#visualize_mandelbrot(mojo_image, "Mojo Mandelbrot")
print("-" * 40)
print("Comparison complete.")
Finally, here is the output.
Image by Author
Ok, so that’s an impressive start for Mojo. It was almost 20 times faster than the pure Python implementation and 5 times faster than the NumPy code.
Example 2 — Numerical integration
For this example, we will perform numerical integration using Simpson’s rule to determine the value of sin(X) in the interval 0 to π. Recall that Simpson’s rule is a method of calculating an approximate value for an integral and is defined as,
∫ ≈ (h/3) * [f(x₀) + 4f(x₁) + 2f(x₂) + 4f(x₃) + … + 2f(xₙ-₂) + 4f(xₙ-₁) + f(xₙ)]
Where:
- h is the width of each step.
- The weights are 1, 4, 2, 4, 2, …, 4, 1.
- The first and last points have a weight of 1.
- The points at odd indices have a weight of 4.
- The points at even indices have a weight of 2. The true value of the integral we’re trying to calculate is two. Let’s see how accurate (and fast) our methods are.
Once again, we need four files.
1/ integration_pure_py.py
# integration_pure_py.py
import math
def compute(start, end, n):
"""Calculates the definite integral of sin(x) using Simpson's rule."""
if n % 2 != 0:
n += 1 # Simpson's rule requires an even number of intervals
h = (end - start) / n
integral = math.sin(start) + math.sin(end)
for i in range(1, n, 2):
integral += 4 * math.sin(start + i * h)
for i in range(2, n, 2):
integral += 2 * math.sin(start + i * h)
integral *= h / 3
return integral
2/ integration_numpy
# integration_numpy.py
import numpy as np
def compute(start, end, n):
"""Calculates the definite integral of sin(x) using NumPy."""
if n % 2 != 0:
n += 1
x = np.linspace(start, end, n + 1)
y = np.sin(x)
# Apply Simpson's rule weights: 1, 4, 2, 4, ..., 2, 4, 1
integral = (y[0] + y[-1] + 4 * np.sum(y[1:-1:2]) + 2 * np.sum(y[2:-1:2]))
h = (end - start) / n
3/ integration_mojo.mojo
# integration_mojo.mojo
from python import PythonObject, Python
from python.bindings import PythonModuleBuilder
from os import abort
from math import sin
# Note: The 'fn' keyword is used here as it's compatible with all versions.
fn compute_integral_mojo(start_obj: PythonObject, end_obj: PythonObject, n_obj: PythonObject) raises -> PythonObject:
# Bridge crossing happens ONCE at the start.
var start = Float64(start_obj)
var end = Float64(end_obj)
var n = Int(n_obj)
if n % 2 != 0:
n += 1
var h = (end - start) / n
# All computation below is on NATIVE Mojo types. No Python interop.
var integral = sin(start) + sin(end)
# First loop for the '4 * f(x)' terms
var i_1: Int = 1
while i_1 < n:
integral += 4 * sin(start + i_1 * h)
i_1 += 2
# Second loop for the '2 * f(x)' terms
var i_2: Int = 2
while i_2 < n:
integral += 2 * sin(start + i_2 * h)
i_2 += 2
integral *= h / 3
# Bridge crossing happens ONCE at the end.
return Python.float(integral)
@export
fn PyInit_integration_mojo() -> PythonObject:
try:
var m = PythonModuleBuilder("integration_mojo")
m.def_function[compute_integral_mojo]("compute", "Calculates a definite integral in Mojo.")
return m.finalize()
except e:
return abort[PythonObject]("error creating integration_mojo module")
4/ main.py
import time
import sys
import numpy as np
# --- Mojo Setup ---
try:
import max.mojo.importer
except ImportError:
print("Mojo importer not found. Please ensure your environment is set up correctly.")
sys.exit(1)
sys.path.insert(0, "")
# --- Import Our Modules ---
import integration_pure_py
import integration_numpy
import integration_mojo
# --- Test Runner ---
def run_test(name, compute_func, *args):
print(f"Running {name} version...")
start_time = time.time()
result = compute_func(*args)
duration = time.time() - start_time
print(f"-> {name} version took: {duration:.4f} seconds")
print(f" Result: {result}")
# --- Main Test Execution ---
if __name__ == "__main__":
# Use a very large number of steps to highlight loop performance
START = 0.0
END = np.pi
NUM_STEPS = 100_000_000 # 100 million steps
print(f"Calculating integral of sin(x) from {START} to {END:.2f} with {NUM_STEPS:,} steps...")
print("-" * 50)
run_test("Pure Python", integration_pure_py.compute, START, END, NUM_STEPS)
print("-" * 50)
run_test("NumPy", integration_numpy.compute, START, END, NUM_STEPS)
print("-" * 50)
run_test("Mojo", integration_mojo.compute, START, END, NUM_STEPS)
print("-" * 50)
print("Comparison complete.")
And the results?
Calculating integral of sin(x) from 0.0 to 3.14 with 100,000,000 steps...
--------------------------------------------------
Running Pure Python version...
-> Pure Python version took: 4.9484 seconds
Result: 2.0000000000000346
--------------------------------------------------
Running NumPy version...
-> NumPy version took: 0.7425 seconds
Result: 1.9999999999999998
--------------------------------------------------
Running Mojo version...
-> Mojo version took: 0.8902 seconds
Result: 2.0000000000000346
--------------------------------------------------
Comparison complete.
It’s interesting that this time, the NumPy code was marginally faster than the Mojo code, and its final value was more accurate. This highlights a key concept in high-performance computing: the trade-off between vectorisation and JIT-compiled loops.
NumPy’s strength lies in its ability to vectorise operations. It allocates a large block of memory and then calls highly optimised, pre-compiled C code that leverages modern CPU features, such as SIMD, to perform the sin() function on millions of values simultaneously. This “burst processing” is incredibly efficient.
Mojo, on the other hand, takes our simple while loop and JIT-compiles it into highly efficient machine code. While this avoids the large initial memory allocation of NumPy, in this specific case, the raw power of NumPy’s vectorisation gave it a slight edge.
Example 3— The sigmoid function
The sigmoid function is an important concept in AI as it’s the cornerstone of binary classification.
Also known as the logistic function, it is defined as this.
The sigmoid function takes any real-valued input x and “squashes” it smoothly into the open interval (0,1). In simple terms, no matter what is passed to the sigmoid function, it will always return a value between 0 and 1.
So, for example,
S(-197865) = 0
S(-2) = 0.0009111
S(3) = 0.9525741
S(10776.87) = 1
This makes it perfect for representing certain things like probabilities.
Because the Python code is simpler, we can include it in the benchmarking script, so we only have two files this time.
sigmoid_mojo.mojo
from python import Python, PythonObject
from python.bindings import PythonModuleBuilder
from os import abort
from math import exp
from time import perf_counter
# ----------------------------------------------------------------------
# Fast Mojo routine (no Python calls inside)
# ----------------------------------------------------------------------
fn sigmoid_sum(n: Int) -> (Float64, Float64):
# deterministic fill, sized once
var data = List[Float64](length = n, fill = 0.0)
for i in range(n):
data[i] = (Float64(i) / Float64(n)) * 10.0 - 5.0 # [-5, +5]
var t0: Float64 = perf_counter()
var total: Float64 = 0.0
for x in data: # single tight loop
total += 1.0 / (1.0 + exp(-x))
var elapsed: Float64 = perf_counter() - t0
return (total, elapsed)
# ----------------------------------------------------------------------
# Python-visible wrapper
# ----------------------------------------------------------------------
fn py_sigmoid_sum(n_obj: PythonObject) raises -> PythonObject:
var n: Int = Int(n_obj) # validates arg
var (tot, secs) = sigmoid_sum(n)
# safest container: build a Python list (auto-boxes scalars)
var out = Python.list()
out.append(tot)
out.append(secs)
return out # -> PythonObject
# ----------------------------------------------------------------------
# Module initialiser (name must match: PyInit_sigmoid_mojo)
# ----------------------------------------------------------------------
@export
fn PyInit_sigmoid_mojo() -> PythonObject:
try:
var m = PythonModuleBuilder("sigmoid_mojo")
m.def_function[py_sigmoid_sum](
"sigmoid_sum",
"Return [total_sigmoid, elapsed_seconds]"
)
return m.finalize()
except e:
# if anything raises, give Python a real ImportError
return abort[PythonObject]("error creating sigmoid_mojo module")
main.py
# bench_sigmoid.py
import time, math, numpy as np
N = 50_000_000
# --------------------------- pure-Python -----------------------------------
py_data = [(i / N) * 10.0 - 5.0 for i in range(N)]
t0 = time.perf_counter()
py_total = sum(1 / (1 + math.exp(-x)) for x in py_data)
print(f"Pure-Python : {time.perf_counter()-t0:6.3f} s - Σσ={py_total:,.1f}")
# --------------------------- NumPy -----------------------------------------
np_data = np.linspace(-5.0, 5.0, N, dtype=np.float64)
t0 = time.perf_counter()
np_total = float(np.sum(1 / (1 + np.exp(-np_data))))
print(f"NumPy : {time.perf_counter()-t0:6.3f} s - Σσ={np_total:,.1f}")
# --------------------------- Mojo ------------------------------------------
import max.mojo.importer # installs .mojo import hook
import sigmoid_mojo # compiles & loads shared object
mj_total, mj_secs = sigmoid_mojo.sigmoid_sum(N)
print(f"Mojo : {mj_secs:6.3f} s - Σσ={mj_total:,.1f}")
Here is the output.
$ python sigmoid_bench.py
Pure-Python : 1.847 s - Σσ=24,999,999.5
NumPy : 0.323 s - Σσ=25,000,000.0
Mojo : 0.150 s - Σσ=24,999,999.5
The** Σσ=…** outputs are showing the sum of all the calculated Sigmoid values. In theory, this should be precisely equal to the input N divided by 2, as N tends towards infinity.
But as we see, the mojo implementation represents a decent uplift of over 2x on the already fast NumPy code and is over 12x faster than the base Python implementation.
Not too shabby.
Summary
This article explored the exciting new capability of calling high-performance Mojo code directly from Python to accelerate computationally intensive tasks. Mojo, a relatively new systems programming language from Modular, promises C-level performance with a familiar Pythonic syntax, aiming to solve Python’s historical speed limitations.
To test this promise, we benchmarked three computationally expensive scenarios: Mandelbrot set generation, numerical integration, and the sigmoid calculation function, by implementing each in pure Python, optimised NumPy, and a hybrid Python-Mojo approach.
The results reveal a nuanced performance landscape for loop-heavy algorithms where data can be processed entirely with native Mojo types. Mojo can significantly outperform both pure Python and even highly optimised NumPy code. However, we also saw that for tasks that align perfectly with NumPy’s vectorised, pre-compiled C functions, NumPy can maintain a slight edge over Mojo.
This investigation demonstrates that while Mojo is a powerful new tool for Python acceleration, achieving maximum performance requires a thoughtful approach to minimising the “bridge-crossing” overhead between the two language runtimes.
As always, when considering performance enhancements to your code, test, test, test. That is the final arbiter as to whether it’s worthwhile or not.