In June 2024 I observed a regression that caused Wayland-based compositors to stop working on one of my systems. Later that year, in October, I dedicated a weekend to see if I could figure out what went wrong and where.
Table of Contents
In June 2024 I observed a regression that caused Wayland-based compositors to stop working on one of my systems. Later that year, in October, I dedicated a weekend to see if I could figure out what went wrong and where.
Table of Contents
- The Issue
- Starting My Search in the Kernel
- The Problem is Somewhere in Userland
- Using bpftrace(8) to Narrow Down Search
- Finding the Regression in Mesa
- Reporting My Findings
- Project Timeline
- Conclusion
- Thank You!
The Issue
The affected system is my 2005 Fujitsu Lifebook S2110, which is running Arch Linux. The primary symptom is a garbled framebuffer when running a Wayland compositor, such as sway(1), seen here:
The effect doesn’t show up in a screenshot, so I had to snap a photo. Everything else seems to still work. I can see a blob of pixels moving if I mouse around, and I can just about make out the floating terminal I spawned, and htop(1)
running within it. So while I can sort of see what’s happening, the system is clearly unusable in this state.
If I exit sway(1)
and return back to the Linux console, the screen returns back to normal, and I can continue working. The i3(1) window manager, which is based on X11 instead of Wayland, still works as expected, which seems interesting.
A while later, another clue: I happened to notice that this error appears in the kernel ring buffer whenever I start sway(1)
:
$ sudo dmesg | grep ERROR
[14496.202817] [drm:radeon_crtc_do_set_base [radeon]] *ERROR* trying to scanout microtiled buffer
At this point, I don’t have much else to go off of. It had been a while since I had run Sway on this system, so I’m not sure which specific system upgrade was the one that started this.1 I also don’t know if this problem is actually in the kernel, or if this error occurs due to bad input from somewhere in Sway.
Starting My Search in the Kernel
Since I do have that error message I’m fairly certain I haven’t seen before, I decided a good first step would be to find where in the kernel it’s coming from, and why.
$ cd ~/linux
$ rg 'trying to scanout microtiled buffer'
drivers/gpu/drm/radeon/radeon_legacy_crtc.c
467: DRM_ERROR("trying to scanout microtiled buffer\n");
So it seems to be coming from the kernel Radeon GPU driver. The GPU in this laptop is an ATI RS480M integrated in its ATI Mobility Radeon Xpress 200 chipset, so that tracks.
Before looking at the code, I skimmed the kernel git log to see if there might be an obvious recent change that could be the culprit:
$ git log --pretty=format:"%h %an %ad" drivers/gpu/drm/radeon/radeon_legacy_crtc.c | head -n5
f7d17cd4e16a Thomas Zimmermann Mon Jan 16 14:12:27 2023 +0100
da7faee2a158 Thomas Zimmermann Wed Jan 11 14:02:06 2023 +0100
98e3f08f6198 Thomas Zimmermann Wed Jan 11 14:02:05 2023 +0100
720cf96d8fec Ville Syrjälä Tue Jun 14 12:54:49 2022 +0300
27b4118d5c1b Thomas Zimmermann Thu Jan 23 14:59:31 2020 +0100
Okay, no recent changes here, which suggests that the regression is somewhere else. It could be some other change elsewhere in the radeon driver code. I spent some time skimming and grepping commit messages for the entirety of drivers/gpu/drm/radeon/
, but nothing obvious jumped out there.
So if we assume this regression isn’t caused by a change in this driver, it must then be caused by a change in the input this driver receives from somewhere else in the system. I’ll study this driver more closely to understand what causes that error message to be emitted.
The Linux Radeon GPU driver is vastly more complex than the simple platform device driver shown in the last post, but that shouldn’t be much of a problem. The skill of finding the start of and then staying firmly on the ‘golden path’ of code of interest is really valuable when studying a large system. Most of my early attempts at studying kernel code failed specifically because I didn’t have a narrow enough area of focus, which then quickly overwhelms the brain. In this case, the goal is to find the specific chain of events leading up to this error being printed. All unrelated code is noise that should be ignored for now.
That error is printed in radeon_legacy_crtc.c:467, which is in this radeon_crtc_do_set_base()
function:
int radeon_crtc_do_set_base(struct drm_crtc *crtc,
struct drm_framebuffer *fb,
int x, int y, int atomic)
{
/* ... */
struct drm_framebuffer *target_fb;
struct drm_gem_object *obj;
struct radeon_bo *rbo;
/* ... */
uint32_t tiling_flags;
/* ... */
if (atomic)
target_fb = fb;
else
target_fb = crtc->primary->fb;
/* ... */
obj = target_fb->obj[0];
rbo = gem_to_radeon_bo(obj);
/* ... */
radeon_bo_get_tiling_flags(rbo, &tiling_flags, NULL);
radeon_bo_unreserve(rbo);
if (tiling_flags & RADEON_TILING_MICRO)
DRM_ERROR("trying to scanout microtiled buffer\n");
/* ... */
}
Now, at this point, I have no idea what most of this code is doing, but I can see the error is printed if that tiling_flags
bitfield has its RADEON_TILING_MACRO
bit set. Figuring out where this value comes from might provide clues for where to look next.
I can see it’s calling radeon_bo_get_tiling_flags()
to read the value into a local tiling_flags
variable from this rbo
struct, which I think stands for “radeon buffer object”. Where does the value in that struct get set? Grepping for RADEON_TILING_MICRO
doesn’t turn up any assignments in kernel code, but it does show that this define lives in include/uapi/drm/radeon_drm.h
:
/* ... */
#define RADEON_TILING_MACRO 0x1
#define RADEON_TILING_MICRO 0x2
#define RADEON_TILING_SWAP_16BIT 0x4
/* ... */
uapi
in that path is short for “userspace API”, meaning this is a value that some userspace program would pass to the kernel. Let’s figure out where that happens. Looking at radeon_bo_get_tiling_flags()
next:
void radeon_bo_get_tiling_flags(struct radeon_bo *bo,
uint32_t *tiling_flags,
uint32_t *pitch)
{
dma_resv_assert_held(bo->tbo.base.resv);
if (tiling_flags)
*tiling_flags = bo->tiling_flags;
if (pitch)
*pitch = bo->pitch;
}
It just copies the values for tiling_flags
and pitch
from the corresponding fields in that buffer object struct bo
. clangd
lists 7 references to the tiling_flags
field of struct radeon_bo
, only one of which is a write, here at the very end of radeon_bo_set_tiling_flags()
in radeon_object.c:604:
int radeon_bo_set_tiling_flags(struct radeon_bo *bo,
uint32_t tiling_flags, uint32_t pitch)
{
/* ... */
r = radeon_bo_reserve(bo, false);
if (unlikely(r != 0))
return r;
bo->tiling_flags = tiling_flags;
bo->pitch = pitch;
radeon_bo_unreserve(bo);
return 0;
}
clangd
tells me this function is called from two places:
radeon_gem_set_tiling_ioctl()
in radeon_gem.c:555radeon_fbdev_create_pinned_object()
in radeon_fbdev.c:55
I think it’s safe to ignore that fbdev
2 path, since I’m pretty sure that wayland uses drm
3.
In the gem
4 path, it does some housekeeping, and then calls radeon_bo_set_tiling_flags()
to set the flags and pitch:
int radeon_gem_set_tiling_ioctl(struct drm_device *dev, void *data,
struct drm_file *filp)
{
struct drm_radeon_gem_set_tiling *args = data;
struct drm_gem_object *gobj;
struct radeon_bo *robj;
int r = 0;
DRM_DEBUG("%d \n", args->handle);
gobj = drm_gem_object_lookup(filp, args->handle);
if (gobj == NULL)
return -ENOENT;
robj = gem_to_radeon_bo(gobj);
r = radeon_bo_set_tiling_flags(robj, args->tiling_flags, args->pitch);
drm_gem_object_put(gobj);
return r;
}
This function, as the name suggests, is called when a userspace program invokes the RADEON_GEM_SET_TILING
ioctl(2)
on this device. We can see it in the list of ioctls this radeon driver registers in radeon_drv.c:545:
static const struct drm_ioctl_desc radeon_ioctls_kms[] = {
/* ... */
DRM_IOCTL_DEF_DRV(RADEON_INFO, radeon_info_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(RADEON_GEM_SET_TILING, radeon_gem_set_tiling_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
DRM_IOCTL_DEF_DRV(RADEON_GEM_GET_TILING, radeon_gem_get_tiling_ioctl, DRM_AUTH|DRM_RENDER_ALLOW),
/* ... */
};
Userspace programs generally communicate with device drivers by calling open(2)
on the appropriate device file and calling I/O syscalls on the file descriptor it returend. The ioctl(2)
syscall is the main interface for calling arbitrary driver-specific commands from userspace. It takes a file descriptor, an integer indicating the command/op to call, and a variable number of arguments. The driver ioctl handler then processes the request and returns a status code back to userspace.
The Problem is Somewhere in Userland
So now we know the culprit is some code in userland calling the RADEON_GEM_SET_TILING
ioctl(2)
with the RADEON_TILING_MICRO
flag, which this hardware evidently doesn’t support. That hardly narrows it down! I’m not particularly familiar with the Linux graphics stack, so this ioctl
could be coming from any number of places. It could be coming from the compositor directly, or more likely, some library it’s using to talk to the GPU.
To narrow down to the specific bit of userland code responsible for setting these tiling flags, I need a way to track ioctl
calls to the kernel, and where they originate. There are many ways to do this. I could attach a debugger, set a breakpoint on ioctl
and look at backtraces, or maybe grep strace(1)
output combined with some extra logging. These approaches could work, but today, I’ll show a really powerful method, using bpftrace(8).
Using bpftrace(8) to Narrow Down Search
If you write and debug software on Linux, but haven’t used bpftrace(8)
yet, you’re about to have your mind blown. Using it here might be slight overkill, but I really think more people need to know about this tool.
bpftrace(8)
lets us instrument any code running on the system, both in the kernel and in userland processes, without the overhead of stopping for a breakpoint in a debugger. I’ll use it here to hook up some instrumentation to show me the call stack leading up to this radeon_gem_set_tiling_ioctl()
call in the kernel radeon driver.
Starting a graphical environment will take over the Fujitsu’s display, so rather than running bpftrace(8)
in a second virtual console and jumping between screens, I’ll instead run it in a remote shell on another machine:
$ ssh s2110.local
$ sudo bpftrace -e 'kprobe:radeon_gem_set_tiling_ioctl { printf("PID %d (%s)\nkernel stack:%suserspace stack:%s", pid, comm, kstack, ustack); }'
Attaching 1 probe...
This invocation attached a tiny eBPF program (just that single printf()
call here) to a kernel function named radeon_gem_set_tiling_ioctl
. It actually modified the live kernel code in that function to insert a kprobe
that calls our little printf()
program. That code was analyzed, JIT’d to native machine code, and set up to run in the kernel when it’s called by the probe in radeon_gem_set_tiling_ioctl()
. In other words, we just dynamically modified kernel code while it’s running, just like that!
This is like swapping a tire on a car while it’s doing 80MPH on a highway, only way safer!
With that probe attached, I’ll go back to the laptop and run a Wayland compositor. Since this issue isn’t specific to sway(1)
, I’ll instead use a much simpler compositor called cage(1)
for testing. It’s intended for use in kiosks, so it just runs a single graphical application in fullscreen. I’ll run a terminal here, but the specific program doesn’t really matter:
$ cage xfce4-terminal
While cage
is starting up, this output appears on my remote shell:
PID 2728 (cage)
kernel stack:
radeon_gem_set_tiling_ioctl+5
drm_ioctl_kernel+171
drm_ioctl+667
radeon_drm_ioctl+78
__x64_sys_ioctl+148
do_syscall_64+129
entry_SYSCALL_64_after_hwframe+118
userspace stack:
ioctl+61
drmIoctl+49
drmCommandWriteRead+33
0x7f9d438ee48c
0x7f9d438a4bd0
0x7f9d438a905d
0x7f9d430edaef
0x7f9d4303deb7
driCreateContextAttribs+795
0x7f9d45dfa24f
0x7f9d45dec2bd
0x7f9d46aa9b4c
wlr_egl_create_with_drm_fd+655
wlr_gles2_renderer_create_with_drm_fd+19
0x7f9d46aaa569
0x564e979da3eb
0x7f9d46827675
__libc_start_main+137
0x564e979db605
< 12 similar backtraces omitted >
How cool is that?! This shows us the complete chain of function calls, across the system call boundary into kernel space, and the kernel call chain leading up to our function of interest.
We can do much better than this, though! I’ll extend that oneliner into a small script, and have it also extract and print the arguments to this ioctl
. Recall that the ioctl
handler accesses these arguments with a struct drm_radeon_gem_set_tiling
:
int radeon_gem_set_tiling_ioctl(struct drm_device *dev, void *data,
struct drm_file *filp)
{
struct drm_radeon_gem_set_tiling *args = data;
/* ... */
I can just copy that struct definition into my new bpftrace program and use it to access the arguments, like this:5
#!/usr/bin/bpftrace
// copied from include/uapi/drm/radeon_drm.h
// I just changed the types from __u32 -> uint32_t
struct drm_radeon_gem_set_tiling {
uint32_t handle;
uint32_t tiling_flags;
uint32_t pitch;
};
kprobe:radeon_gem_set_tiling_ioctl {
// arg1 is the second argument to the function, `void *data`.
$a = (struct drm_radeon_gem_set_tiling *)arg1;
printf("PID %d (%s) (handle: %d, tiling_flags: %d, pitch: %d)\nkernel stack:%suserspace stack:%s",
pid, comm, $a->handle, $a->tiling_flags, $a->pitch, kstack, ustack(perf));
}
I also switched to ustack(perf)
, which has nicer output that shows the corresponding shared library for each symbol in the userland stack trace.
I saved this script in tiling_ioctl.bt
. I can then run it in my remote shell, like before:
$ chmod +x tiling_ioctl.bt
$ sudo --preserve-env=DEBUGINFOD_URLS ./tiling_ioctl.bt
Attached 1 probe
Most of the userland call frames in that previous example showed up as memory addresses, because those libraries don’t have debugging information in them. bpftrace(8)
supports debuginfod(8), which allows it to fetch that information from a server to give us a nice fully symbolicated backtrace. I just have to ensure bpftrace(8)
can see the DEBUGINFOD_URLS
environment variable, so I tell sudo
to pass it along here.
Then I’ll run cage
on the laptop again, and after a brief delay, this appears in my remote shell:
PID 1032 (cage) (handle: 1, tiling_flags: 0, pitch: 32)
kernel stack:
radeon_gem_set_tiling_ioctl+5
drm_ioctl_kernel+171
drm_ioctl+667
radeon_drm_ioctl+79
__x64_sys_ioctl+148
do_syscall_64+129
entry_SYSCALL_64_after_hwframe+118
userspace stack:
7f5bd5b1674d ioctl+61 (/usr/lib/libc.so.6)
7f5bd58e2691 drmIoctl+49 (/usr/lib/libdrm.so.2.126.0)
7f5bd58e5d91 drmCommandWriteRead+33 (/usr/lib/libdrm.so.2.126.0)
7f5bd2aeeacc radeon_bo_set_metadata+316 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd2aa5210 r300_texture_create_object+624 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd2aa969d r300_create_context+3869 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd22edb6f st_api_create_context+111 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd223def7 dri_create_context+631 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd2241f7b driCreateContextAttribs+795 (/usr/lib/libgallium-25.2.4-arch1.2.so)
7f5bd500a2ef dri2_create_context+671 (/usr/lib/libEGL_mesa.so.0.0.0)
7f5bd4ffc2bd eglCreateContext+317 (/usr/lib/libEGL_mesa.so.0.0.0)
7f5bd5cc3e7a egl_init+218 (/usr/lib/libwlroots-0.19.so)
7f5bd5cc429b wlr_egl_create_with_drm_fd+763 (/usr/lib/libwlroots-0.19.so)
7f5bd5cc4673 wlr_gles2_renderer_create_with_drm_fd+19 (/usr/lib/libwlroots-0.19.so)
7f5bd5cc4a77 renderer_autocreate.lto_priv.0+887 (/usr/lib/libwlroots-0.19.so)
5636c94413e6 main+838 (/usr/bin/cage)
7f5bd5a27675 __libc_start_call_main+117 (/usr/lib/libc.so.6)
7f5bd5a27729 __libc_start_main_alias_1+137 (/usr/lib/libc.so.6)
5636c94427d5 _start+37 (/usr/bin/cage)
< 12 similar backtraces omitted >
Nice! This instantly narrows down our search to just a few libraries. We can see that cage
calls into libwlroots
, which then eventually calls into libEGL_mesa
, which calls into libgallium
, and so on. The first line now also shows the arguments to this ioctl
.6
I might do an entire post about
bpftrace(8)
at some point, because both how it works and the things it can do are absolutely incredible. :]
It looks like the bulk of GPU-specific logic is happening in libgallium
, which is a component of mesa, the standard 3D graphics library on Linux:
$ pacman -Qo /usr/lib/libEGL_mesa.so.0.0.0
/usr/lib/libEGL_mesa.so.0.0.0 is owned by mesa 1:25.2.4-2
$ pacman -Qo /usr/lib/libgallium-25.2.4-arch1.2.so
/usr/lib/libgallium-25.2.4-arch1.2.so is owned by mesa 1:25.2.4-2
It’s still possible that the regression is in libwlroots
or libdrm
, but I’ll start by taking a closer look at mesa
, since that seems to occupy the majority of this stack trace.
Finding the Regression in Mesa
To confirm that the regression is actually in Mesa and not some other package, I’ll first check to see if downgrading it fixes the problem. Since I run Arch btw on my main desktop PC as well, I can use its much faster CPU to build a package that I can then scp(1) to the Fujitsu for testing.
I’ll start by preparing the build environment. I chose an arbitrary Mesa commit 2c48ce81 from about a ~year before this issue started for this first test:
$ git clone https://aur.archlinux.org/mesa-git.git
Cloning into 'mesa-git'...
remote: Enumerating objects: 994, done.
remote: Counting objects: 100% (994/994), done.
remote: Compressing objects: 100% (682/682), done.
remote: Total 994 (delta 341), reused 940 (delta 311), pack-reused 0 (from 0)
Receiving objects: 100% (994/994), 398.42 KiB | 9.96 MiB/s, done.
Resolving deltas: 100% (341/341), done.
$ cd mesa-git
$ time makepkg --syncdeps --nobuild
$ pushd src/mesa/
$ git checkout 2c48ce81a82436d2aff3e0d6b9169d83e33038bf
$ popd
$ time makepkg -efs
< ... >
==> WARNING: Using existing $srcdir/ tree
==> Starting pkgver()...
==> Updated version: mesa-git 23.2.0_devel.174122.2c48ce81a82.d41d8cd-1
==> Starting build()...
The Meson build system
Version: 1.8.2
Source dir: /home/vkoskiv/projects/FINISHED/fujitsu_microtile/mesa-git/src/mesa
Build dir: /home/vkoskiv/projects/FINISHED/fujitsu_microtile/mesa-git/src/_build
Build type: native build
mesa/meson.build:21:0: ERROR: Values "softpipe, llvmpipe" for option "gallium-drivers" are not in allowed choices: "auto, kmsro, radeonsi, r300, r600, nouveau, freedreno, swrast, v3d, vc4, etnaviv, tegra, i915, svga, virgl, panfrost, iris, lima, zink, d3d12, asahi, crocus"
A full log can be found at /home/vkoskiv/projects/FINISHED/fujitsu_microtile/mesa-git/src/_build/meson-logs/meson-log.txt
==> ERROR: A failure occurred in build().
Aborting...
Looks like this older checkout of Mesa doesn’t support some options this newer Arch PKGBUILD
file is setting, I’ll remove those:
diff --git a/PKGBUILD b/PKGBUILD
index ecc36c4..9b4efe3 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -256,7 +256,7 @@ build () {
-D b_ndebug=true
-D b_lto=false
-D egl=enabled
- -D gallium-drivers=r300,r600,radeonsi,nouveau,virgl,svga,softpipe,llvmpipe,i915,iris,crocus,zink
+ -D gallium-drivers=r300,r600,radeonsi,nouveau,virgl,svga,i915,iris,crocus,zink
-D gallium-extra-hud=true
-D gallium-rusticl=${_rusticl}
-D gallium-va=enabled
Then try building again:
$ time makepkg -efs
< ... >
mesa/meson.build:21:0: ERROR: Option "glvnd" value enabled is not boolean (true or false).
Okay, I think I can start to see a pattern here. After 8 more rounds of fixing errors one at a time, it finally builds:
$ time makepkg -efs
< ... >
==> Finished making: mesa-git 23.2.0_devel.174122.2c48ce81a82.d41d8cd-1 (Sat 18 Oct 2025 04:01:39 PM EEST)
real 4m59.269s
user 31m8.683s
sys 3m4.703s
And here is the final PKGBUILD
diff to get this revision of Mesa to build:
diff --git a/PKGBUILD b/PKGBUILD
index ecc36c4..562a033 100644
--- a/PKGBUILD
+++ b/PKGBUILD
@@ -256,24 +256,23 @@ build () {
-D b_ndebug=true
-D b_lto=false
-D egl=enabled
- -D gallium-drivers=r300,r600,radeonsi,nouveau,virgl,svga,softpipe,llvmpipe,i915,iris,crocus,zink
+ -D gallium-drivers=r300,r600,nouveau,virgl,svga,i915,iris,crocus,zink
-D gallium-extra-hud=true
- -D gallium-rusticl=${_rusticl}
+ -D gallium-rusticl=false
-D gallium-va=enabled
-D gbm=enabled
-D gles1=disabled
-D gles2=enabled
- -D glvnd=enabled
+ -D glvnd=true
-D glx=dri
-D libunwind=enabled
- -D llvm=enabled
+ -D llvm=disabled
-D lmsensors=enabled
-D microsoft-clc=disabled
-D platforms=x11,wayland
-D valgrind=disabled
- -D video-codecs=all
- -D vulkan-drivers=amd,intel,intel_hasvk,swrast,virtio,nouveau
- -D vulkan-layers=device-select,intel-nullhw,overlay,anti-lag
+ -D vulkan-drivers=amd,intel,intel_hasvk,virtio
+ -D vulkan-layers=device-select,intel-nullhw,overlay
-D tools=[]
-D zstd=enabled
-D buildtype=plain
@@ -281,7 +280,6 @@ build () {
--force-fallback-for=syn,paste,rustc-hash
-D prefix=/usr
-D sysconfdir=/etc
- -D legacy-x11=dri2
)
# Build only minimal debug info to reduce size
Next, I’ll transfer the package to the S2110 and install it:
$ scp mesa-git-23.2.0_devel.174122.2c48ce81a82.d41d8cd-1-x86_64.pkg.tar.zst s2110.local:
$ ssh s2110.local
$ sudo pacman -U mesa-git-23.2.0_devel.174122.2c48ce81a82.d41d8cd-1-x86_64.pkg.tar.zst
Installing this custom
mesa-git
package replaces the upstreammesa
from the Arch repository, but I can easily switch back to it withpacman -S mesa
I then fired up sway(1)
on the laptop, and it seems to work as expected! Confirming with glxinfo to ensure it’s actually running the expected version:
$ glxinfo | grep 'OpenGL version'
OpenGL version string: 2.1 Mesa 23.2.0-devel (git-2c48ce81a8)
So now we know for sure that the regression is somewhere in Mesa, and that it appeared at some point between this version 23.2.0 (2c48ce81a8)
(July 2023) and the version where I noticed the problem, 24.2.3-1
(September 2024). My report about this regression to the Mesa developers would be much more useful if I could point to a specific commit that broke Mesa on my hardware, so I’ll do a git bisect next.
The Arch Wiki has good instructions for bisecting Arch packages. If you aren’t familiar with it, git bisect
finds commits that break something or introduce a bug. It first asks for a “known good” and a “known bad” commit, and then performs a binary search to find the first bad commit. In each step, it asks if the commit it’s checked out is good or bad, so here I’ll fiddle with the build options, build and install mesa-git
, check if sway(1)
works, mark that commit good/bad, then rinse and repeat.
Unfortunately, as that initial build already hinted, the build system the Mesa developers chose, Meson, doesn’t appear to be very bisect-friendly. I have no idea if this is a problem with Meson itself, or with how the Mesa project is using it, but to cut a long story short, this was a pretty painful bisect.
Pretty much every step of the bisect required a bunch of manual fiddling of the build configuration to fix multiple errors one at a time, often just changing obvious things like some_option=enabled
-> some_option=true
or vice versa. There was a discussion about this on the mesa-dev
mailing list, but that was after I had reported my findings.
Here is the log of my bisect:
$ git bisect log
git bisect start
# status: waiting for both good and bad commits
# bad: [b339c525f449f19f6515201509d8a7455d239195] Revert "ci/lima: Temporarily disable"
git bisect bad b339c525f449f19f6515201509d8a7455d239195
# status: waiting for good commit(s), bad commit known
# good: [03ecd8b0a517e528d790a3d252194f7b898d0e8f] VERSION: bump for 24.0.0
git bisect good 03ecd8b0a517e528d790a3d252194f7b898d0e8f
# good: [2fab92ed9adcb6fb4e3d1480d5eae23f6e82a09b] docs: close the 23.2 cycle
git bisect good 2fab92ed9adcb6fb4e3d1480d5eae23f6e82a09b
# bad: [3d6957268b24a74519adda1a93d3653df55d4961] aco: use new common helpers for building buffer descriptors
git bisect bad 3d6957268b24a74519adda1a93d3653df55d4961
# good: [7c480c206622b525713a7caa53abe01d736bb8b5] nouveau/ci: only trigger jobs for relevant changes
git bisect good 7c480c206622b525713a7caa53abe01d736bb8b5
# bad: [4408aff896a87fca231e3856f4ac154a1bfc35ac] tu: Fix missing implementation of creating images from swapchains
git bisect bad 4408aff896a87fca231e3856f4ac154a1bfc35ac
# bad: [eb693cfec6c3b6d264e6684b0cfbef6c780e385c] radeonsi/vcn: use num_instances from radeon_info
git bisect bad eb693cfec6c3b6d264e6684b0cfbef6c780e385c
# bad: [561fae6845479b81d8f41f23376c469524004166] nvk: fix valve segfault from setting a descriptor set from NULL
git bisect bad 561fae6845479b81d8f41f23376c469524004166
# bad: [08e899852b61a85e73caa2a5372f697bd3c96c6b] isaspec: Remove not used isa_decode_hook
git bisect bad 08e899852b61a85e73caa2a5372f697bd3c96c6b
# bad: [72c3769437926906b679ee61e22a9cc0685b1ec2] v3dv: add helper to check if we need to use a draw for a depth/stencil clear
git bisect bad 72c3769437926906b679ee61e22a9cc0685b1ec2
# bad: [b48a101d8f54ac835c4d988ea56216fd435bbd8a] aco/builder: improve v_mul_imm for negative imm
git bisect bad b48a101d8f54ac835c4d988ea56216fd435bbd8a
# good: [5bbb279e7d6bc844a98621dd27f38f95e826d30d] radeonsi: set the lower_mediump_io callback for GLSL
git bisect good 5bbb279e7d6bc844a98621dd27f38f95e826d30d
# bad: [f424ef18010751aae1e70ebda363ada0bed82bda] r300: enable tiling for scanout to fix DRI3 performance
git bisect bad f424ef18010751aae1e70ebda363ada0bed82bda
# good: [70fd817278d101e1cca8ca062d90d3db8073e9ea] st/mesa: skip a few NIR passes that don't work with lowered IO
git bisect good 70fd817278d101e1cca8ca062d90d3db8073e9ea
$ git bisect good
Bisecting: 0 revisions left to test after this (roughly 1 step)
[58b773bd9a4c108ee7c2b8a1405f832fa147b13a] r300: port scanout pitch alignment from the DDX to fix DRI3
$ git bisect good
f424ef18010751aae1e70ebda363ada0bed82bda is the first bad commit
commit f424ef18010751aae1e70ebda363ada0bed82bda
Author: Marek Olšák <maraeo@gmail.com>
Date: Wed Mar 13 17:24:28 2024 -0400
r300: enable tiling for scanout to fix DRI3 performance
Also don't use square tiling for scanout because the DDX doesn't use it
either.
Reviewed-off-by: Pavel Ondračka <pavel.ondracka@gmail.com>
Part-of: <https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/28209>
src/gallium/drivers/r300/ci/r300-rv530-nohiz-fails.txt | 1 -
src/gallium/drivers/r300/r300_texture.c | 4 ++--
src/gallium/drivers/r300/r300_texture_desc.c | 4 +++-
3 files changed, 5 insertions(+), 4 deletions(-)
This seems promising, the commit message mentions something about tiling!7 I reverted this commit from the latest version, installed it, and:
It works again!
Reporting My Findings
This should be enough information for a bug report, so I gathered all my findings and emailed them to the mesa-dev
mailing list. A short while later, a Mesa maintainer replied asking me to submit an issue on GitLab with pictures, so I submitted issue 12010 a few days later after registering an account on the FreeDesktop GitLab.
A few months after my report, Mesa version 24.3.3 containing the fix lands in the Arch package repository, I run pacman -Syu
on the S2110, and it happily runs Sway again! :]
Project Timeline
- 2024-10-05: I spend a few hours investigating and conclude the regression must be in Mesa
- 2024-10-08: I finish a git bisect of Mesa, and report my findings on the mesa-dev mailing list
- 2024-10-12: I file an issue on GitLab with photos showing the corrupted framebuffer, per the request of a Mesa maintainer.
- 2024-12-14: Merge Request 32638 is filed to fix the regression.
- 2024-12-30: After two reports that the fix is working, the MR is merged.
- 2025-01-03: Mesa 24.3.3 is released, and contains the fix for this regression.
- 2025-01-08: I upgrade Arch on the S2110, and Sway now works again! :]
Conclusion
This hardware is absolutely ancient, so I was a bit worried about burdening the Mesa maintainers with my bug report. I did consider submitting a patch directly, but I wasn’t convinced that simply reverting the offending commit was the correct approach. Eventually, a few other people reported that they were having this same issue on their systems, so it was a relief to find out I wasn’t the only affected user. Needless to say, I want to extend a huge thank you to the maintainers for their hard work on maintaining and improving Mesa (and for taking the time to address regressions affecting 20+ year old hardware :])
As I mentioned in my previous post, I still regularly use this laptop despite its age, so it’s nice to be able to use Sway on it again. I later found out that the reason i3(1)
was unaffected by this regression likely has to do with the fact that, unlike Sway, it’s actually not using GPU acceleration at all! I haven’t been able to figure out why it prefers llvmpipe
. If you have ideas, let me know!
Thank You!
Thank you for reading! The response to my first post was so far beyond anything I could have ever imagined. It’s been a little over a week now, and I’m still floored. Over 600 points on Hacker News? That really happened?! I don’t think I’ve ever made anything that has been seen by this many people. And numbers aside, all the kind words and encouragement I’ve received in comments and messages is something I don’t think I’ll ever be able to forget. I can’t thank all of you enough! :’]
As before, I’d really appreciate feedback on how to improve my writing, so if you have any questions, corrections, suggestions or other thoughts, do get in touch!
The topic for my next post will be either electronics, or vintage computing, depending on which draft I finish up first. I’m quite excited about both! Release date is TBD, you can subscribe to my RSS feed to be notified when these are published.
The distribution running on this system, Arch Linux, is a rolling-release distribution, so updates to different packages are released very frequently. ↩ 1.
Graphics Execution Manager ↩ 1.
Kernel datatypes are actually usually available automatically via BPF Type Format (BTF) metadata, but I copied the struct from the kernel header here for clarity. ↩ 1.
I didn’t record this output when I was doing my investigation in 2024, so this output is for the latest versions of these libraries at time of writing (October 2025). We’d also be able to see the invalid tiling_flags
value of 2 here, if I were running Mesa <24.3.3. ↩
1.
Before bisecting, I had actually already marked this same commit as “suspicious” from just skimming git logs. I even tried reverting reverting it, but my testing methodology was flawed. This was before I had set up the full Arch Build System, so instead of installing the Mesa libraries, I tried the LD_PRELOAD trick and evidently neglected to verify that it was actually using these injected libraries. ↩