Show HN: CUDA, Shmuda: Fold Proteins on a MacBook

What Lies Beneath

In my opinion, Apple shares a lot of design and engineering etho with Mercedes-Benz. Both tend to overbuild their consumer products, offering a lot more performance and polish than we really need. Take the Mercedes G-Class, for example. This is an originally military vehicle, capable of racing up 60 degree rocky grades and fording streams. Yet, I only ever see them in urban parking garages. Similarly, the M-series chips contained in all Macs since 2023 haven’t really been utilized for scientific compute tasks. This is a clean sheet design with a single memory pool (up to 512GB!), CPU, GPU, and a novel NPU, the Apple Neural Engine. For basically any everyday task or entertainment, this chip doesn’t break a sweat.

Wait a minute, you might be thinking, isn’…

What Lies Beneath

Wait a minute, you might be thinking, isn’t this basically a high performance CUDA-type setup, with larger memory? It is! Pytorch can support acceleration via the Apple GPU. Is this slept on, by and large? Absolutely. Academic software like bioinformatics tends to stay in its Linux lane, and very few non-gaming devs seem to be interested in Apple for Science.

Gamechanger: MLX

Apple’s MLX framework, a suped-up numpy for M chips, is great for science workloads. Unlike PyTorch or TensorFlow, which treat Apple’s unified memory architecture as an afterthought, MLX was designed from the ground up to take advantage of it.

This matters for protein folding because models like AlphaFold are memory-intensive. Traditional frameworks shuttle data between CPU and GPU, creating bottlenecks. On MLX, there is no RAM/VRAM–it’s all one big pool.

Key Insight

Unified memory means the model weights and input data live in the same memory space. No copying, no waiting, just computation.

The Porting

Here’s the issue: essentially all protein software, and most ML software in general, is heavily optimized for CUDA. For a long time, this has been fair–there was not an alternative within an order of magnitude of RTX-speed. However, I feel it’s time to branch out. ARM chips like the Apple M are extremely fast, and uses a fraction of the wattage that a single CUDA card needs. Imagine how many kilowatt-plus GPU racks are spinning right now to produce scientific data, and it may make your head spin as well. Getting your hands on a Mac device is also currently a lot easier than getting a Blackwell GPU...

The good news? I can use MLX to replace various CUDA enhancements. I was eager to try out the amazing OpenFold3, an open-source replica of AlphaFold3, and equally unwilling to pay for a GPU instance or mess with my desktop. I’ll port this to MLX with some careful code:

For the audacity of running this on a laptop To see how long inference takes, if it works

Implementation Details

Getting Openfold3 to work with MLX wasn’t straightforward. After digging in to the components, there were multiple CUDA-specific ops to relace:

DeepSpeed4Science implementation of Evoformer attention cuEquivariance for triangular attention Low-memory attention

Let’s examine the porting of a “secret sauce” for AlphaFold, triangle attention. In a nutshell, this allows for comparisons of protein residue matrices, but also triplets. The extra triplet context is essential for building out a atomic position constraints in the new structure.

Triangle Attention

Here’s the raw CUDA code from Openfold3:

@torch.compiler.disable
def _cueq_triangle_attn(q, k, v, biases, scale):
is_batched_input = False
assert len(biases) == 2, (
"CUEQ triangle attention kernel requires two bias terms: "
"mask_bias and triangle_bias"
)
mask_bias, triangle_bias = biases

# Handle high-dimensional inputs for template module
if len(q.shape) > 5:
assert len(q.shape) == 6, (
"max number of dimensions for CUEQ triangle attention kernel is 6"
)
is_batched_input = True
batch, n_tmpl, n_res, n_head, c_hidden = q.shape[:5]
q = q.view(batch * n_tmpl, *q.shape[2:])
k = k.view(batch * n_tmpl, *k.shape[2:])
v = v.view(batch * n_tmpl, *v.shape[2:])
mask_bias = mask_bias.view(batch * n_tmpl, *mask_bias.shape[2:])
triangle_bias = triangle_bias.view(batch * n_tmpl, *triangle_bias.shape[2:])

# The mask for the triangle attention kernel needs to be a
# boolean mask - the default mask is an additive mask, where
# 0 means no masking and -inf means masking. so we need to
# convert this to a boolean mask where positions to keep are
# True, and positions to mask are False.
if mask_bias.dtype != torch.bool:
mask_bias = mask_bias == 0

o = triangle_attention(q, k, v, bias=triangle_bias, mask=mask_bias, scale=scale)

# Handle dimension bugs in cuequivariance
if len(q.shape) == 4:
# There's a bug in cueq where if the input is missing the batch dim
# the outputs adds it in and so we need to remove it here
o = o.squeeze(0)

if is_batched_input:
o = o.view(batch, n_tmpl, *o.shape[1:])

o = o.transpose(-2, -3)

return o

and, the corresponding MLX rewrite:

def mlx_triangle_attention(
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
biases: List[torch.Tensor],
scale: float
) -> torch.Tensor:
"""
MLX-optimized triangle attention implementation.

This provides a drop-in replacement for cuEquivariance triangle attention
using MLX for computation on Apple Silicon.
"""
if not MLX_AVAILABLE:
raise ImportError(
"MLX not available. Install MLX for Apple Silicon optimization: "
"pip install mlx"
)

if len(biases) != 2:
raise ValueError("Triangle attention requires exactly 2 bias terms: mask_bias and triangle_bias")

mask_bias, triangle_bias = biases

# Handle high-dimensional inputs (like cuEquivariance does)
is_batched_input = False
original_shape = None

if len(q.shape) > 5:
if len(q.shape) != 6:
raise ValueError("Max number of dimensions for triangle attention is 6")

is_batched_input = True
original_shape = q.shape
batch, n_tmpl = q.shape[:2]

# Flatten batch and template dimensions
q = q.view(batch * n_tmpl, *q.shape[2:])
k = k.view(batch * n_tmpl, *k.shape[2:])
v = v.view(batch * n_tmpl, *v.shape[2:])
mask_bias = mask_bias.view(batch * n_tmpl, *mask_bias.shape[2:])
triangle_bias = triangle_bias.view(batch * n_tmpl, *triangle_bias.shape[2:])

# Convert mask from additive format to boolean format (like cuEquivariance does)
if mask_bias.dtype != torch.bool:
# Convert -inf masked positions to False, valid positions (0) to True
mask_bias = mask_bias == 0

# Convert boolean mask back to additive format for our MLX attention
# True (valid) -> 0, False (masked) -> -inf
mask_bias_additive = torch.where(
mask_bias,
torch.zeros_like(mask_bias, dtype=q.dtype),
torch.full_like(mask_bias, -float('inf'), dtype=q.dtype)
)

# Call our MLX attention with the processed biases
processed_biases = [mask_bias_additive, triangle_bias]
output = mlx_evo_attention(q, k, v, processed_biases)

# Handle dimension issues (like cuEquivariance does)
if len(q.shape) == 4 and output.shape[0] == 1:
# Remove spurious batch dimension if it was added
output = output.squeeze(0)

# Restore original batch/template dimensions if needed
if is_batched_input:
output = output.view(original_shape[0], original_shape[1], *output.shape[1:])

# Apply transpose (like cuEquivariance does)
output = output.transpose(-2, -3)

return output

Here are some of the major changes:

Mask behavior

The CUDA kernel wants a boolean mask; the MLX kernel wants an additive mask. I perform a two-step conversion (additive → boolean → additive) purely to imitate cuEquivariance.

Bias handling differs at the API layer

CUDA hands `triangle_bias` directly into the native kernel. MLX must bundle both mask and triangle biases into a unified `processed_biases` list for `mlx_evo_attention`.

Dimensionality handling is more defensive in MLX

CUDA assumes the cueq kernel’s quirks and reshapes with minimal bookkeeping. MLX preserves the full `original_shape` and reconstructs carefully to guarantee backend-agnostic behavior.

Backend expectations diverge

CUDA path expects GPU kernels with fixed semantics and numerical quirks. MLX path must behave consistently across CPU, GPU, and ANE, so it uses explicit conversions and more predictable shaping.

Fun: Legacy bug emulation

cuEquivariance sometimes invents a batch dimension when one is missing; the CUDA wrapper compensates with a squeeze. The current MLX version imitates this quirk so the API behaves identically across backends, even though MLX itself doesn’t misbehave that way. Look for this to be trimmed in the future as Openfold3-MLX evolves.

Not too bad to adapt to, all things in consideration.

The Performance Numbers

On my M4 MacBook Air, I’m seeing:

Small proteins (<200 residues): 20-30 seconds
Medium proteins (200-400 residues): ~1:30
Large proteins (400+ residues): ~3 min (Inference time, not total model loading. Stay tuned for speedups!)

Compare that to running the same models on a CPU-only torch setup, where you’d be waiting quite a while.

Important Note

Make sure you’re using MLX 0.5.0 or later. Earlier versions have issues with certain activation functions that SimpleFold relies on.

What This Means for Research

The bottleneck for computational biology has shifted. It’s no longer about access to expensive hardware, it’s about software that actually uses the hardware we already have.

Your MacBook Air is one hell of an engineering feat. Time to start reaping the benefits!

Want to try it yourself? Download the Openfold3-MLX beta and see how fast protein folding can be on your own Mac. *

What Lies Beneath

What Lies Beneath

Gamechanger: MLX

Key Insight

The Porting

Implementation Details

Triangle Attention

The Performance Numbers

Important Note

What This Means for Research

Similar Posts