This is a follow-up to making the rav1d video decoder 1% faster, where we compared profiler snapshots of rav1d (the Rust implementation) and dav1d (the C baseline) to find specific functions that were slower in the Rust implementation1.
Today, we are going to pay off a small debt from that post: since dav1d and rav1d share the same hand-written assembly functions, we used them as anchors to navigate the different implementations - they, at least, should match exactly! And they did. Well, almost all of them did.
This, dear reader, is the story of the one function that didn’t.
An Overview
We’ll need to ask - and answer! - three ‘Whys’ today: Using …
This is a follow-up to making the rav1d video decoder 1% faster, where we compared profiler snapshots of rav1d (the Rust implementation) and dav1d (the C baseline) to find specific functions that were slower in the Rust implementation1.
Today, we are going to pay off a small debt from that post: since dav1d and rav1d share the same hand-written assembly functions, we used them as anchors to navigate the different implementations - they, at least, should match exactly! And they did. Well, almost all of them did.
This, dear reader, is the story of the one function that didn’t.
An Overview
We’ll need to ask - and answer! - three ‘Whys’ today: Using the same techniques from last time, we’ll see that a specific assembly function is, indeed, slower in the Rust version.
- But why? ➡️ Because loading data in the Rust version is slower, which we discover using
samply’s special asm view. 1 - But why? ➡️ Because the Rust version stores much more data on the stack, which we find by playing with some arguments and looking at the generated LLVM IR. 2
- But why? ➡️ Because the compiler cannot optimize away a specific Rust abstraction across function pointers! 3
Which we fix by switching to a more compiler-friendly version (PR). 4
Side note: again, we’ll be running all these benchmarks on a MacBook, so our tools are a tad limited and we’ll have to resort to some guesswork. Leave a comment if you know more - or, even better, write an article about profiling on macOS 🍎💨.

filter4_pri_edged_8bpc
Let’s rerun the benchmark after the previous post’s changes:
./rav1d $ git checkout cfd3f59 && cargo build --release
./rav1d $ sudo samply record ./target/release/dav1d -q -i Chimera-AV1-8bit-1920x1080-6736kbps.ivf -o /dev/null --threads 1
We’ll switch to the inverted call stack view and filter for the cdef_ functions, resulting in the following clippings2. The assembly functions are the ones with the _neon suffix.
On the left is dav1d (C), and on the right rav1d (Rust):
On the top is dav1d (C), and on the bottom rav1d (Rust):
Looking at the sample count, most of the functions match (to within ~10%)3, except the highlighted cdef_filter4_pri_edged_8bpc_neon which is 30% slower. We see a difference of 350 samples. Sampling at 1000 Hz, this corresponds to 0.35 seconds, or ~0.5% of the total runtime.
This is very sus: obviously this is the exact same function, and barring a logical bug in the implementation, it must process the exact same data.
So how can this be?
Looking at the Opcodes
Luckily for us, samply has exactly what we need here: we can get into the asm view by double-clicking on the function, which shows a per-instruction sample count.
And it seems that fortune favors the bold, because we find the entire difference in a single instruction less than 25 lines into the call.
Let’s look at the ld1 {v0.s}[2], [x13] line, highlighted below in yellow.
It appears in 10 samples in the dav1d run (C), but in 441 (!) samples in the rav1d run (Rust):
At this point, you might be wondering: what is ld1? What’s that {v0.s}[2] syntax? And… why is x13 that different from x2, x12, or x14?
ld1
Let’s try to decode what ld1 {v0.s}[2], [x13] means.
A quick search leads us to the LD1 page in the Arm A-profile A64 Instruction Set Architecture documentation, which helpfully says the following:
LD1 - Load one single-element structure to one lane of one register
This instruction loads a single-element structure from memory and writes the result to the specified lane of the SIMD&FP register
It also explains that v0 is a SIMD register, and .s is its 32-bit variant.
So, TL;DR: this instruction loads data from the address in the x13 register into lane 2 of the v0 SIMD register.
Which means that the three adjacent instructions also do almost the exact same thing.
A Good Guess
Ignoring the start of the function, let’s look at the lines that appear right before the load instructions:
add x12, x2, #0x8
add x13, x2, #0x10
add x14, x2, #0x18
ld1 {v0.s}[0], [x2] ; Fast - 20 samples.
ld1 {v0.s}[1], [x12] ; Fast - 16 samples.
ld1 {v0.s}[2], [x13] ; Slow - 441 samples.
ld1 {v0.s}[3], [x14] ; Fast as well!
PSA: if you don’t see syntax highlighting, disable the 1Password extension.
Seems simple enough - we load 32-bit values from the addresses at x2 + {0,8,16,24} into v0. But what address is stored in x2?
On AArch64, integer and pointer parameters are passed in x0 through x7, and sure enough, looking at the extern "C" fn definition, we find:
unsafe extern "C" fn filter(
dst: *mut DynPixel, // x0
dst_stride: ptrdiff_t, // x1
tmp: *const MaybeUninit<u16>, // x2
// ...
) -> ()
Our old friend tmp! We saw in the previous post that these assembly functions are dispatched from a function called cdef_filter_neon_erased. This function defines tmp on the stack as a buffer of (uninitialized) u16s, and partially fills it using a padding function which is also written in assembly.
So, why would reading from a contiguous smallish buffer be slow for one particular part of that buffer?
At this point, we are going to take a guess (leave a comment if you know more!): there’s likely a caching issue somewhere that causes the CPU to stall for that particular load.
But why? Maybe it’s something in the way data is written to the buffer? Time to take a closer look. In particular, there’s something a bit unexpected in the arguments of the cdef_filter_neon_erased function:
unsafe extern "C" fn cdef_filter_neon_erased<BD: BitDepth, ..>(
dst: *mut DynPixel,
stride: ptrdiff_t,
left: *const [LeftPixelRow2px<DynPixel>; 8],
top: *const DynPixel,
bottom: *const DynPixel,
..,
_dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
_top: *const FFISafe<CdefTop>,
_bottom: *const FFISafe<CdefBottom>,
) {
let mut tmp_buf = Align16([MaybeUninit::uninit(); TMP_LEN]);
let tmp = &mut tmp_buf.0[..];
padding::Fn::neon::<BD, W>().call::<BD>( //
tmp, // <--- Fills tmp by calling a `cdef_padding_XYZ_neon` function.
dst, stride, left, top, bottom, .. //
);
filter::Fn::neon::<BD, W>().call( // <--- Calls the specific `cdef_filter_XYZ_neon` function.
dst,
stride,
tmp,
..
)
}
This is… a bit much, but as you can imagine, dav1d doesn’t have the last 3 arguments (an _ in Rust denotes an unused variable). Looking around some more, they are only used in a function called cdef_filter_**block_c**_erased, which is - despite the name - a pure-Rust fallback in case the asm functions are unavailable.
I wonder what will happen if we… if we just remove them?
A “Fix”
If we do remove them:
- _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
- _top: *const FFISafe<CdefTop>,
- _bottom: *const FFISafe<CdefBottom>,
+ // _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
+ // _top: *const FFISafe<CdefTop>,
+ // _bottom: *const FFISafe<CdefBottom>,
and (temporarily) replace cdef_filter_**block_c**_erased with a stub:
unsafe extern "C" fn cdef_filter_block_c_erased<BD: BitDepth, const W: usize, const H: usize>(
_dst_ptr: *mut DynPixel,
...
edges: CdefEdgeFlags,
bitdepth_max: c_int,
// dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
// top: *const FFISafe<CdefTop>,
// bottom: *const FFISafe<CdefBottom>,
) {
todo!()
}
When we re-run our benchmark, we see something cool:
Our dear cdef_filter4_pri_edged_8bpc_neon, which accounted for 1,562 samples before, is now down to 1,268 samples (now within 5% of dav1d’s 1,199), and all our ld1 (memory load) instructions are down to dav1d levels! No more stalling.
Huzzah! Or… Huzzah?

An Elegant Weapon for a More Civilized Age
Let’s recap: On the one hand, we found a meaningful slowdown between the Rust and the C versions, and we even managed to create a “fixed” version that doesn’t exhibit the same problem.
On the other hand, we only have vibes about what the problem is (memory is haunted?), nothing about the fix makes sense (removing unused stuff helps how?), and we can’t use this code because we removed an important fallback.
The only silver lining is that because we have a faster version, we can try to compare it to the original and find out what changed, and that might lead us to the real fix.
Which is where cargo asm comes into play.
Our theory is that something is different in the way the memory is laid out between the versions.
We’ll guess that it’s probably something with the stack, because (a) removing arguments made a difference, and arguments are (sometimes) passed on the stack and (b) all the heap data structures closely follow the original dav1d ones, and there aren’t that many of them anyway.
So what can cargo asm tell us?
Peeking Under the Hood
We can compare cdef_filter_neon_erased, using either --asm or --llvm modes4, but long story short, there doesn’t seem to be any differences between the baseline and the faster version. Which at least makes sense - we didn’t change anything about this function because it wasn’t using those arguments in the first place!
But what if we go one level up? _erased is called from a function named rav1d_cdef_brow (which we also briefly saw in the last post), which is a very complex, 300-line behemoth. However, it seems like this function receives its data via a few nice structs, which means that either one of them is messed up - which is relatively easy to check - or that the problem is somewhere inside this function.
fn rav1d_cdef_brow<BD: BitDepth>(
c: &Rav1dContext,
tc: &mut Rav1dTaskContext,
f: &Rav1dFrameData,
p: [Rav1dPictureDataComponentOffset; 3],
// .. a few simple arguments ..
) { ... }
And this time, cargo asm5 lights up like a Christmas tree 🎄.
Here’s our faster version:
; rav1d::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::cdef_apply::rav1d_cdef_brow(...) {
start:
%dst.i = alloca [16 x i8], align 8
%variance = alloca [4 x i8], align 4
%lr_bak = alloca [96 x i8], align 16
%_17 = icmp sgt i32 %by_start, 0
%. = select i1 %_17, i32 12, i32 8
...
}
And here’s the baseline version:
; rav1d::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::cdef_apply::rav1d_cdef_brow(...) {
start:
%top.i400 = alloca [16 x i8], align 8
%dst.i401 = alloca [16 x i8], align 8
%top.i329 = alloca [16 x i8], align 8
%dst.i330 = alloca [16 x i8], align 8
%top.i = alloca [16 x i8], align 8
%dst.i317 = alloca [16 x i8], align 8
%dst.i = alloca [16 x i8], align 8
%bot5 = alloca [24 x i8], align 8
%bot = alloca [24 x i8], align 8
%variance = alloca [4 x i8], align 4
%lr_bak = alloca [96 x i8], align 16
%_17 = icmp sgt i32 %by_start, 0
%. = select i1 %_17, i32 12, i32 8
...
}
Which means that somehow, the baseline version allocates on the stack - using alloca - 144 bytes more than the faster version! It would also seem that all these extra allocations are for multiple instances of dst, top, and bot (i.e., bottom), which matches the arguments we removed in the faster version.
So now we only need to… not do that, I guess?
From Top to Bottom
Our revised but incomplete theory is thus:
(1) cdef_filter4_pri_edged_8bpc_neon reads data from or via dst, top and/or bot, which ends up affecting the third ld1 line.
More
The calls to the filter functions are defined like this:
unsafe extern "C" fn filter(
dst: *mut DynPixel,
dst_stride: ptrdiff_t,
tmp: *const MaybeUninit<u16>,
// ..
) -> { .. }
and the assembly function template is located in src/arm/64/cdef.S:
// void cdef_filterX_edged_8bpc_neon(pixel *dst, ptrdiff_t dst_stride,
// const uint8_t *tmp, int pri_strength,
// int sec_strength, int dir, int damping,
// int h);
.macro filter_func_8 w, pri, sec, min, suffix
function cdef_filter\w\suffix\()_edged_8bpc_neon
// ..
ld1 {v0.s}[0], [x2] // px
ld1 {v0.s}[1], [x12] // px
ld1 {v0.s}[2], [x13] // px
ld1 {v0.s}[3], [x14] // px
(2) cdef_filter_neon_erased accepts two sets of these, one as raw pointers for the asm version and one as these *FFISafe pointers that are only used in the pure-Rust version.
More
The assembly dispatch function (_erased) only uses the *mut DynPixel versions:
pub unsafe extern "C" fn cdef_filter_neon_erased<BD: BitDepth, const W: usize, const H: usize, .. >(
dst: *mut DynPixel,
stride: ptrdiff_t,
left: *const [LeftPixelRow2px<DynPixel>; 8],
top: *const DynPixel,
bottom: *const DynPixel,
// ..
_dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
_top: *const FFISafe<CdefTop>,
_bottom: *const FFISafe<CdefBottom>,
) {
// ...
padding::Fn::neon::<BD, W>().call::<BD>(tmp, dst, stride, left, top, bottom, H, edges);
filter::Fn::neon::<BD, W>().call(dst, stride, tmp, pri_strength, sec_strength, dir, damping, H, edges, bd);
}
While the pure Rust version uses only the fully typed and safe [BD::Pixel] versions:
fn cdef_filter_block_rust<BD: BitDepth>(
dst: Rav1dPictureDataComponentOffset,
dst: Rav1dPictureDataComponentOffset,
left: &[LeftPixelRow2px<BD::Pixel>; 8],
top: CdefTop,
bottom: CdefBottom,
// ...
) { .. }
(3) rav1d_cdef_brow sets up all of these in a few different ways, probably for the different variations of cdef_filter4_{pri_edged,pri_sec_edge,sec_edge,sec,pri}_8bpc_neon.
More
For example, this is a small unedited part of rav1d_cdef_brow. See how top and bot have a non-trivial setup:
let (top, bot) = top_bot.unwrap_or_else(|| {
let top = WithOffset {
data: &f.lf.cdef_line_buf,
offset: f.lf.cdef_line[tf as usize][0],
} + have_tt as isize * (sby * 4) as isize * y_stride
+ (bx * 4) as isize;
let bottom = bptrs[0] + (8 * y_stride);
(top, WithOffset::pic(bottom))
});
if y_pri_lvl != 0 {
let adj_y_pri_lvl = adjust_strength(y_pri_lvl, variance);
if adj_y_pri_lvl != 0 || y_sec_lvl != 0 {
f.dsp.cdef.fb[0].call::<BD>(
bptrs[0],
&lr_bak[bit as usize][0],
top,
bot,
adj_y_pri_lvl,
y_sec_lvl,
dir,
damping,
edges,
bd,
);
}
}
Having the two sets of pointers prevents the compiler from performing some optimizations, and it just so happens that this results in a layout that causes the CPU to stall.
There’s so much more going on here, but let’s keep our focus and try to actually fix the issue at hand.
Why is it FFISafe-ed?
Simplified, rav1d_cdef_brow sets up top like so:
let cdef_line_buf: AlignedVec64<u8>;
let top = WithOffset {
data: &cdef_line_buf,
offset,
} + ... as isize;
with dst and bottom following similar patterns.
Checking WithOffset, we see that it’s a utility for accessing a buffer using an index:
#[derive(Clone, Copy)]
pub struct WithOffset<T> {
pub data: T,
pub offset: usize,
}
impl<T> AddAssign<usize> for WithOffset<T> { .. }
impl<T> SubAssign<usize> for WithOffset<T> { .. }
// A few more impl like this.
impl<P: Pixels> WithOffset<P> {
pub fn as_ptr<BD: BitDepth>(&self) -> *const BD::Pixel {
self.data.as_ptr_at::<BD>(self.offset)
}
// A few more of these as well.
}
Looking at this struct, we start to see what’s going on: WithOffset is, on a 64-bit architecture, the size of T plus 8 bytes, which matches the alloca calls of 16 and 24 bytes we saw before.
It is also not “FFI-safe”, which means that passing it as an argument in an extern "C" function - such as our asm functions - is somewhat controversial, and rav1d gets around that by having this special FFISafe struct that makes this problem magically6 go away.
Because WithOffset is a buffer-access utility, it can be used to create raw pointers into the underlying buffer. But because the safe Rust fallback doesn’t want raw pointers, we end up having both versions when we call either the asm or the Rust version of the function:
let top_ptr: *mut DynPixel = top.as_ptr::<BD>().cast();
let bottom_ptr: *mut DynPixel = bottom.wrapping_as_ptr::<BD>().cast();
let top = FFISafe::new(&top);
let bottom = FFISafe::new(&bottom);
// We're simplifying here, and we also ignore the differences between u8, DynPixel and BD::Pixel.
pub type CdefTop<'a> = WithOffset<&'a u8>;
// A function pointer to the best available impl...
let callback: extern "C" fn(
..,
top_ptr: *mut DynPixel,
..,
top: *const FFISafe<CdefTop>,
) = /* ... selected at runtime */;
// Maybe end up in Rust, maybe in assembly, who knows!
callback(.., top_ptr, bottom_ptr, .., top, bottom);
OK! Phew! Wow! This is great (or, sorry that happened to you), but what can we do about this?
Switch It Up
Move it up, down, left, right, oh - Switch it up like Nintendo ~ S. A. Carpenter
Because we have this *const FFISafe<WithOffset<..>> at an extern "C" function boundary, the compiler is more limited in what it can do with the values of top, bottom, and dst.
What if we switched it up?
We can make WithOffset FFI-safe by slapping a #[repr(C)] on it, as long as T is FFI-safe:
#[derive(Clone, Copy)]
#[repr(C)] // <- New!
pub struct WithOffset<T> {
pub data: T,
pub offset: usize,
}
Then, we can change each variable from *const FFISafe<WithOffset<?>> to
WithOffset<*const FFISafe<?>>.
For example, before we had something like:
top: *const FFISafe<WithOffset<&'a u8>>
We can change that to:
top: WithOffset<*const FFISafe<&'a u8>>
The key difference is that now, instead of creating an FFI-safe pointer to our arguments, we actually destructure them and create new instances of WithOffset:
let top: WithOffset<&'a u8> = /* an argument */;
// Used to be `let top = FFISafe::new(&top)`.
let top = WithOffset {
data: FFISafe::new(&top.data),
offset: top.offset,
};
This should - in theory - let the compiler see we only use a single instance of each parameter at any given time.
But does it?
Will It Blend?
We can use the same shtick for dst and bot, and the final diff turns out shorter than this article 🫠.
Click to see the full diff
Avoid using `FFISafe<WithOffset<..>>` across FFI boundary
---
src/cdef.rs | 61 ++++++++++++++++++++++++++++++++--------------
src/with_offset.rs | 1 +
2 files changed, 44 insertions(+), 18 deletions(-)
diff --git a/src/cdef.rs b/src/cdef.rs
index 3b58d2e6..46302f73 100644
--- a/src/cdef.rs
+++ b/src/cdef.rs
@@ -19,6 +19,7 @@ use crate::include::common::bitdepth::LeftPixelRow2px;
use crate::include::common::bitdepth::BPC;
use crate::include::common::intops::apply_sign;
use crate::include::common::intops::iclip;
+use crate::include::dav1d::picture::Rav1dPictureDataComponent;
use crate::include::dav1d::picture::Rav1dPictureDataComponentOffset;
use crate::pic_or_buf::PicOrBuf;
use crate::strided::Strided as _;
@@ -55,9 +56,9 @@ wrap_fn_ptr!(pub unsafe extern "C" fn cdef(
damping: c_int,
edges: CdefEdgeFlags,
bitdepth_max: c_int,
- _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
- _top: *const FFISafe<CdefTop>,
- _bottom: *const FFISafe<CdefBottom>,
+ _dst: WithOffset<*const FFISafe<Rav1dPictureDataComponent>>,
+ _top: WithOffset<*const FFISafe<DisjointMut<AlignedVec64<u8>>>>,
+ _bottom: WithOffset<*const FFISafe<PicOrBuf<'_, AlignedVec64<u8>>>>,
) -> ());
pub type CdefTop<'a> = WithOffset<&'a DisjointMut<AlignedVec64<u8>>>;
@@ -87,12 +88,23 @@ impl cdef::Fn {
let left = ptr::from_ref(left).cast();
let top_ptr = top.as_ptr::<BD>().cast();
let bottom_ptr = bottom.wrapping_as_ptr::<BD>().cast();
- let top = FFISafe::new(&top);
- let bottom = FFISafe::new(&bottom);
let sec_strength = sec_strength as c_int;
let damping = damping as c_int;
let bd = bd.into_c();
- let dst = FFISafe::new(&dst);
+
+ let dst = WithOffset {
+ data: FFISafe::new(dst.data),
+ offset: dst.offset,
+ };
+ let top = WithOffset {
+ data: FFISafe::new(top.data),
+ offset: top.offset,
+ };
+ let bottom = WithOffset {
+ data: FFISafe::new(&bottom.data),
+ offset: bottom.offset,
+ };
+
// SAFETY: Rust fallback is safe, asm is assumed to do the same.
unsafe {
self.get()(
@@ -385,18 +397,31 @@ unsafe extern "C" fn cdef_filter_block_c_erased<BD: BitDepth, const W: usize, co
damping: c_int,
edges: CdefEdgeFlags,
bitdepth_max: c_int,
- dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
- top: *const FFISafe<CdefTop>,
- bottom: *const FFISafe<CdefBottom>,
+ dst: WithOffset<*const FFISafe<Rav1dPictureDataComponent>>,
+ top: WithOffset<*const FFISafe<DisjointMut<AlignedVec64<u8>>>>,
+ bottom: WithOffset<*const FFISafe<PicOrBuf<'_, AlignedVec64<u8>>>>,
) {
- // SAFETY: Was passed as `FFISafe::new(_)` in `cdef_dir::Fn::call`.
- let dst = *unsafe { FFISafe::get(dst) };
+ let dst = WithOffset {
+ // SAFETY: Was passed as `FFISafe::new(_)` in `cdef::Fn::call`.
+ data: unsafe { FFISafe::get(dst.data) },
+ offset: dst.offset,
+ };
+
// SAFETY: Reverse of cast in `cdef::Fn::call`.
let left = unsafe { &*left.cast() };
- // SAFETY: Was passed as `FFISafe::new(_)` in `cdef::Fn::call`.
- let top = *unsafe { FFISafe::get(top) };
- // SAFETY: Was passed as `FFISafe::new(_)` in `cdef::Fn::call`.
- let bottom = *unsafe { FFISafe::get(bottom) };
+
+ let top = WithOffset {
+ // SAFETY: Was passed as `FFISafe::new(_)` in `cdef::Fn::call`.
+ data: unsafe { FFISafe::get(top.data) },
+ offset: top.offset,
+ };
+
+ let bottom = WithOffset {
+ // SAFETY: Was passed as `FFISafe::new(_)` in `cdef::Fn::call`.
+ data: *unsafe { FFISafe::get(bottom.data) },
+ offset: bottom.offset,
+ };
+
let bd = BD::from_c(bitdepth_max);
cdef_filter_block_rust(
dst,
@@ -632,9 +657,9 @@ mod neon {
damping: c_int,
edges: CdefEdgeFlags,
bitdepth_max: c_int,
- _dst: *const FFISafe<Rav1dPictureDataComponentOffset>,
- _top: *const FFISafe<CdefTop>,
- _bottom: *const FFISafe<CdefBottom>,
+ _dst: WithOffset<*const FFISafe<Rav1dPictureDataComponent>>,
+ _top: WithOffset<*const FFISafe<DisjointMut<AlignedVec64<u8>>>>,
+ _bottom: WithOffset<*const FFISafe<PicOrBuf<'_, AlignedVec64<u8>>>>,
) {
use crate::align::Align16;
diff --git a/src/with_offset.rs b/src/with_offset.rs
index b84c4bd2..06c8bc69 100644
--- a/src/with_offset.rs
+++ b/src/with_offset.rs
@@ -7,6 +7,7 @@ use std::ops::Sub;
use std::ops::SubAssign;
#[derive(Clone, Copy)]
+#[repr(C)]
pub struct WithOffset<T> {
pub data: T,
pub offset: usize,
--
Now we can run cargo asm again:
; rav1d::cdef_apply::rav1d_cdef_brow
; Function Attrs: nounwind
define internal fastcc void @rav1d::cdef_apply::rav1d_cdef_brow(...) {
start:
%dst.i = alloca [16 x i8], align 8
%bot5 = alloca [24 x i8], align 8
%bot = alloca [24 x i8], align 8
%variance = alloca [4 x i8], align 4
%lr_bak = alloca [96 x i8], align 16
%_17 = icmp sgt i32 %by_start, 0
%. = select i1 %_17, i32 12, i32 8
It’s not perfect - we didn’t have these extra bot and bot5 in our original fix - but it’s much better! Let’s run the profiler 🎶 one last time 🎶.
Remember: cdef_filter4_pri_edged_8bpc_neon had 1,562 samples in the slow Rust baseline, vs. 1,199 in dav1d.
Yes! We are down from 1,562 to 1,260! samples (which is within 5% of dav1d),
and the ld1 lines are no longer slow, and the pure-Rust fallback works as expected.
Huzzah!
Given a specific function that is known to be slow, one can usually compare the implementations and find the culprit - we found an unneeded zero-initialization of a large buffer and missing optimized equality comparisons for a number of small structs. We actually also saw that one of these structs, Mv, was not 4-byte aligned, which I was too quick to dismiss - @daxtens got an additional %1 improvement by fixing this in this PR. ↩︎
1.
Again, these are non-interactive clippings, created using the excellent Save Page WE extension and creative use of Delete element. See more in the previous. post’s profiling section. ↩︎ 1.
dav1d_cdef_padding4_edged_8bpc_neon is also a bit slower - but it’s a smaller function overall, so we’re going to ignore that. ↩︎
1.
Using cargo asm -p rav1d --lib --asm cdef_filter_neon_erased 2 and cargo asm -p rav1d --lib --llvm cdef_filter_neon_erased 2. ↩︎
1.
Specifically, cargo asm -p rav1d --lib --llvm rav1d_cdef_brow 1. ↩︎
1.
See ffi_safe.rs ↩︎