The state of SIMD in Rust in 2025

7 min readJust now

–

If you’re already familiar with SIMD, the table below is all you need.

And if you’re not, you will understand the table by the end of this article!

What’s SIMD? Why SIMD?

Hardware that does arithmetic is cheap, so any CPU made this century has plenty of it. But you still only have one instruction decoding block and it is hard to get it to go fast, so the arithmetic hardware is vastly underutilized.

To get around the instruction decoding bottleneck, you can feed the CPU a batch of numbers all at once for a single arithmetic operation like addition. Hence the name: “single instruction, multiple data,” or SIMD for short.

Instead of adding two numbers together, you can add two batches or “vectors” of numbers and it takes about the same amount of time.

On…

7 min readJust now

–

If you’re already familiar with SIMD, the table below is all you need.

And if you’re not, you will understand the table by the end of this article!

What’s SIMD? Why SIMD?

Instead of adding two numbers together, you can add two batches or “vectors” of numbers and it takes about the same amount of time.

On recent x86 chips these batches can be up to 512 bits in size, so in theory you can get an 8x speedup for math on u64 or a 64x speedup on u8!

Instruction sets

Historically, SIMD instructions were added after the CPU architecture was already designed, so SIMD is an extension with its own marketing name on each architecture.

ARM calls theirs “NEON”, and all 64-bit ARM CPUs have it.

WebAssembly doesn’t have a marketing department, so they just call theirs “WebAssembly 128-bit packed SIMD extension”.

64-bit x86 shipped with one called “SSE2” which has basic instructions for 128-bit vectors, but later they added a whole menagerie of extensions on top of that, with SSE 4.2 adding more operations, AVX and AVX2 adding 256-bit vectors and AVX-512 adding 512-bit vectors.

The word “later” in the above paragraph creates a problem.

Does this CPU have that instruction?

If you’re running a program on an x86 CPU, it’s not a given that the CPU has any particular SIMD extension. So by default the compiler isn’t allowed to use instructions beyond SSE2 because that won’t work on all x86 CPUs.

There are two ways around this problem.

If you work for a company that only ever runs their binaries on their own servers or on a public cloud, you can just assert that they’re all recent enough to at least have AVX2 that was introduced over 10 years ago, and have the program crash or misbehave if it ever runs on anything without AVX2:

RUSTFLAGS='-C target-cpu=x86–64-v3' cargo build --release

However, if you are distributing the binaries for other people to run, that’s not really an option.

Instead you can do something called **function multiversioning: **compile the same function multiple times for different SIMD extensions, and when the program actually runs, check what features the CPU supports and select the appropriate version based on that.

Fortunately, this problem only exists on x86.

ARM made its NEON mandatory in all 64-bit CPUs and then didn’t bother expanding the width beyond 128 bits. (Technically SVE exists, but in 2025 it is still mostly on paper, and Rust support for it is still in progress).

WebAssembly makes you compile two different binaries, one with SIMD and one without, and use JavaScript to check if the browser supports SIMD.

Solution space

There are four approaches to SIMD in Rust, in ascending order of effort:

Automatic vectorization
Fancy iterators
Portable SIMD abstractions
Raw intrinsics

Automatic vectorization

The easiest approach to SIMD is letting the compiler do it for you.

It works surprisingly well, as long as you structure your code in a way that is amenable to vectorization. This article covers it:

You can check if it’s working with cargo-show-asm or godbolt.org, but your benchmarks are the ultimate judge of the results.

Sadly there is a limit on the complexity of the code that the compiler will vectorize, and it may change between compiler versions. If something vectorizes today that doesn’t necessarily mean it still will in a year from now.

The other drawback of this method is that the optimizer won’t even touch anything involving floats (f32 and f64 types). It’s not permitted to change any observable outputs of the program, and reordering float operations may alter the result due to precision loss. (There is a way to tell the compiler not to worry about precision loss, but it’s currently nightly-only).

So right now,** if you need to process floats, autovectorization is a no-go **unless you can use nightly builds of the Rust compiler.

(Floats are cursed even without SIMD. Something as simple as summing an array of them in a usable way turns out to be really hard).

There is no built-in way to multiversion functions, but the multiversion crate works great with autovectorization.

Fancy iterators

Just like rayon lets you run your iterators in parallel by swapping .iter() with .par_iter(), there have been attempts to do the same for SIMD. After all, what is SIMD but another kind of parallelism?

This is the approach that the faster crate takes. That crate has been abandoned for years, and it doesn’t look like this approach has panned out.

Portable SIMD abstractions

The idea is to let you write your algorithm by explicitly operating on chunks of data, something like [f32; 8] but wrapped in a custom type, and then provide custom implementations of operations like + that compile down into SIMD instructions.

[std::simd](https://doc.rust-lang.org/stable/std/simd/index.html) is exactly that. It supports all instruction sets LLVM supports, so its platform support is unparalleled. It pairs well with the multiversion crate. Sadly it’s nightly-only and will remain such for the foreseeable future, so it’s unusable in most situations.

The wide crate is a mature, established option. It supports NEON, WASM and all the x86 instruction sets. But it doesn’t support multiversioning at all, save for very exotic and limited approaches like cargo-multivers.

The pulp crate is a great design with built-in multiversioning, and is quite mature and complete. It powers faer, so its performance is clearly proven. The drawbacks are that it doesn’t support WASM, and that on x86 it only supports targeting AVX2 and AVX-512 but not the older extensions. But AVX2 was introduced in 2012 and in the Steam hardware survey 95% systems have it, so that might not be a big deal.

The macerator crate is a fork of pulp with vastly expanded instruction set support. It supports all x86 extensions, WASM, NEON, and even the LoongArch SIMD extensions. It’s used only by burn-ndarray, and even there it’s an optional dependency. It sounds great on paper, but it’s oddly obscure and therefore unproven. I’d probably write code using pulp, then replace it with macerator and see if everything still works and runs as fast as it should.

The fearless_simd crate is largely a copy of pulp’s design made for use in vello. It’s far less mature than pulp, but it’s under active development. As of this writing it supports NEON, WASM and SSE4.2, but not the newer x86 extensions. Seems too immature just yet, but something to keep an eye on.

simdeez is a rather old crate that supports all instruction sets except AVX-512 and comes with built-in multiversioning. What gives me pause is that despite existing for many years, it’s still barely used. Everyone else who needed SIMD built their own instead of using it. And its README says:

Currently things are well fleshed out for i32, i64, f32, and f64 types.

So I guess the other types aren’t complete?

TL;DR: use std::simd if you don’t mind nightly, wide if you don’t need multiversioning, and otherwise pulp or macerator.

If it’s not 2025 when you’re reading this, check out fearless_simd, because std::simd is still in nightly in your glorious future, isn’t it?

Raw intrinsics

If you want to get really close to the metal, there are always the raw intrinsics, just one step removed from the processor instructions.

The problem looming over any use of raw intrinsics is that you have to manually write them for every platform and instruction set you’re targeting. Whereas std::simd or wide let you write your logic once and compile it down to the assembly automatically, with intrinsics you have to write a separate implementation for every single platform and instruction set (SSE, AVX, NEON…) you care to support. That’s a lot of code!

It’s really not helped by the fact that they are all named something like _mm256_srli_epi32 and your code ends up as a long list of calls to these arcanely named functions. And wrappers that help readability introduce their own problems, such as clashes with multiversioning or unsafe code or arcane macros.

You also have to build your own multiversioning. Or rather, you have to manually dispatch to the dedicated implementation you have manually written for each instruction set. [std::is\_x86\_feature\_detected\!](https://doc.rust-lang.org/stable/std/macro.is_x86_feature_detected.html) macro takes care of the feature detection, but it is somewhat slow. In some cases it is beneficial to detect available features exactly once and then cache the results, but you have to implement that manually too.

On the bright side, this year writing intrinsics got markedly less awful. Most of them are no longer unsafe to call in Rust 1.86 and later, and the safe_unaligned_simd crate provides safe wrappers for the rest.

So at least this approach is no longer unsafe on top of all the other problems it has!

Which one is right for you?

The right tool for the job ultimately depends on the use case.

Want zero dependencies and little up-front hassle? Autovectorization. Porting existing C code or targeting very specific hardware? Intrinsics. Anything else? Portable SIMD abstraction.

And now that you made it this far, you can understand the table at the top of the article, which will help guide your decision!

What’s SIMD? Why SIMD?

What’s SIMD? Why SIMD?

Instruction sets

Does this CPU have that instruction?

Solution space

Automatic vectorization

Fancy iterators

Portable SIMD abstractions

Raw intrinsics

Which one is right for you?

Similar Posts