A case for learning GPU programming with a compute-first mindset

Beginners coming into our little corner of the programming world have it rough. Normal CPU-centric programming tends to start out with a “Hello World” sample, which can be completed in mere minutes. It takes far longer to simply download the toolchains and set them up. If you’re on a developer friendly OS, this can be completed in seconds as well.

However, in the graphics world, young whippersnappers cut their teeth at rendering the elusive “Hello Triangle” to demonstrate that yes, we can indeed do what our forebears accomplished 40 years ago, except with 20x the effort and 100000x the performance.

There’s no shortage of examples of beginners rendering a simple triangle (or a cube), and with the new APIs having completely displaced the oxygen of older APIs, there is a certain expe…

There’s no shortage of examples of beginners rendering a simple triangle (or a cube), and with the new APIs having completely displaced the oxygen of older APIs, there is a certain expectation of ridiculous complexity and raw grit required to tickle some pixels on the display. 1000 lines of code, two weeks of grinding, debugging black screens etc, etc. Something is obviously wrong here, and it’s not going to get easier.

I would argue that trying to hammer through the brick wall of graphics is the wrong approach in 2025. Graphics itself is less and less relevant for any hopeful new GPU programmer. Notice I wrote “GPU programmer”, not graphics programmer, because most interesting work these days happens with compute shaders, not traditional “graphics” rendering.

Instead, I would argue we should start teaching compute with a debugger/profiler first mindset, building up the understanding of how GPUs execute code, and eventually introduce the fixed-function rasterization pipeline as a specialization once all the fundamentals are already in place. The raster pipeline was simple enough to teach 20 years ago, but those days are long gone, and unless you plan to hack on pre-historic games as a hobby project, it’s an extremely large API surface to learn.

When compute is the focus, there’s a lot of APIs we could ponder, like CUDA and OpenCL, but I think Vulkan compute is the best compute focused API to start out with. I’m totally not biased obviously 😀 The end goal is of course to also understand the graphics pipeline, and pure compute APIs will not help you there.

Goal for this blog

I don’t intend to write a big book here that has all the answers on how to become a competent GPU programmer. Instead, I want to try outlining some kind of “meta-tutorial” that could be fleshed out further. I’ve been writing compute shaders since the release of OpenGL 4.3 ages ago and I still learn new things.

To abstract or not to abstract

For this exercise, I will rely on a mid-level API abstraction like my own Granite. I don’t think throwing developers into the raw API is the best idea to begin with, but there must be some connection with the underlying API, i.e., no multi-API abstraction that barely resembles the actual API which you’ll typically find in large production engines. Granite is a pure Vulkan abstraction I’ve been chiseling away at for years for my own needs (I ship it in tons of side projects and stuff), but it’s not really an API I’ve intended others to actively use and ship software on. Migrating away from the training wheels quickly is important though and compute makes that fairly painless. Granite is just one of many approaches to tackling Vulkan, and the intent is not to present this as the one true way.

The debugging first approach

Getting something to show up on screen is important to keep the dopamine juices flowing. Fortunately, we can actually do this without messing around with graphics directly. With RenderDoc captures we get debugging + something visual in the same package, and learning tooling early is critical to be effective. Debugging sufficiently interesting GPU code is impossible without this.

Shading language

The debug flow I propose with RenderDoc will rely on a lot of shader replacements and roundtrips via SPIRV-Cross’ GLSL backend, so Vulkan GLSL is the appropriate language to start with. It’s more or less a dead language at this point, but it’s also the most documented language that has support for buffer device addresses, which I will introduce right away to avoid having the brick wall of descriptors and binding models. This is a very compute-centric move, but makes other parts of the API easier to grasp later.

HLSL from the Direct3D ecosystem is a popular option, but as a compute language, HLSL is weaker than Vulkan GLSL in my experience, due to lack of a lot of features that come up in more interesting compute workloads, but being bilingual in this area is unavoidable these days. No matter which language you use, someone will call you a filthy degenerate anyway. :v

Deferring synchronization

Being debugger-centric we can avoid poking at explicit synchronization for a very long time and once we get there, we can simplify a ton. You can do a lot of interesting things with a single dispatch after all.

Writing the first program

Here’s a very basic program that copies some data around. It should trivially build on Linux or Windows on the usual compilers. Make sure to clone or symlink Granite so that it can be picked up by the CMake build.

If you try to run this, the output might look something like this:

...
[INFO]: Enabling VK_LAYER_KHRONOS_validation.
[INFO]: Enabling instance extension: VK_EXT_debug_utils.
[INFO]: Found Vulkan GPU: AMD Radeon RX 6800 (RADV NAVI21)
[INFO]: API: 1.4.328
[INFO]: Driver: 25.2.99
[INFO]: Found Vulkan GPU: NVIDIA GeForce RTX 4070
[INFO]: API: 1.4.312
[INFO]: Driver: 580.328.576
[INFO]: Using Vulkan GPU: AMD Radeon RX 6800 (RADV NAVI21)
[INFO]: Enabling device extension: VK_KHR_external_semaphore_fd.
[INFO]: Enabling device extension: VK_KHR_external_memory_fd.
[INFO]: Enabling device extension: VK_EXT_external_memory_dma_buf.
[INFO]: Enabling device extension: VK_EXT_image_drm_format_modifier.
[INFO]: Enabling device extension: VK_KHR_calibrated_timestamps.
[INFO]: Enabling device extension: VK_EXT_conservative_rasterization.
[INFO]: Enabling device extension: VK_KHR_compute_shader_derivatives.
[INFO]: Enabling device extension: VK_KHR_performance_query.
[INFO]: Enabling device extension: VK_EXT_memory_priority.
[INFO]: Enabling device extension: VK_EXT_memory_budget.
[INFO]: Enabling device extension: VK_EXT_device_generated_commands.
[INFO]: Enabling device extension: VK_EXT_mesh_shader.
[INFO]: Enabling device extension: VK_EXT_external_memory_host.
[INFO]: Enabling device extension: VK_KHR_fragment_shader_barycentric.
[INFO]: Enabling device extension: VK_EXT_image_compression_control.
[INFO]: Graphics queue: family 0, index 0.
[INFO]: Compute queue: family 1, index 0.
[INFO]: Transfer queue: family 1, index 1.
[INFO]: Detected attached tool:
[INFO]: Name: Khronos Validation Layer
[INFO]: Description: Khronos Validation Layer
[INFO]: Version: 327
[INFO]: Detected tool which cares about debug markers.
[INFO]: Allocating 64.0 MiB on heap #1 (mode #3), before allocating budget: (0.0 MiB / 15096.3 MiB) [0.0 / 16368.0].
[INFO]: Allocating 64.0 MiB on heap #0 (mode #0), before allocating budget: (0.0 MiB / 64216.8 MiB) [0.0 / 64372.4].
[ERROR]: Failed to load RenderDoc, make sure RenderDoc started the application in capture mode.
[INFO]: Allocating 64.0 MiB on heap #1 (mode #1), before allocating budget: (64.6 MiB / 15096.3 MiB) [64.0 / 16368.0].

Capturing Vulkan code in RenderDoc

The code we just wrote executes on the GPU, but we have no easy way to observe the code actually running on the device. This is where RenderDoc comes in. Point to the executable we built.

After launching, the capture happens automatically, and when the process terminates, the capture should appear.

Clicking on the copy command and double-clicking the destination buffer, we can see the raw contents:

The zero-initialization flag we passed into buffer creation was technically not needed, but it helped make the capture a little easier to understand. That clear happened automagically inside Granite. Normally, memory is not assumed to be zero-cleared on allocation.

Running some actual shaders

Instead of using copies, we can create our own little memcpy. Here’s an updated sample gist. To keep things simple, we can use shaderc‘s method of compiling GLSL into a C header file.

glslc -o copy.h -mfmt=c --target-env=vulkan1.2 copy.comp

Vulkan 1.2 is used here since that introduced buffer device addresses in core. Building and capturing again gives us:

To inspect the push constants, it’s under uniform buffers:

RenderDoc understands how to resolve pointers in buffers into links that open the relevant buffer.

If you click on an event before the dispatch you’ll see the writes disappear.

Shader replacement workflow with SPIRV-Cross and debug prints

This workflow is extremely powerful for difficult debugging scenarios and I cannot do my job without this. It’s imperative to learn this early. Debug gives you a more traditional step debugger. Depending on the bug, this may be the correct tool (e.g. inspecting a particular broken pixel), but in my experience, when working with a ton of GPU threads in parallel, it’s often required to study it in aggregate to see what is going on, since you might not even know which thread is at fault to begin with.

For now, select*** Edit -> Decompile with SPIRV-Cross***

The Vulkan API uses the SPIR-V intermediate representation and SPIRV-Cross converts this back to equivalent GLSL. Fortunately, this result looks very similar to our original shader. This is one of the main reasons I prefer working with GLSL since the translation back and forth to SPIR-V is the least lossy compared to alternatives.

E.g. we can hack in some debug prints, hit Apply and the dispatch will have messages attached to them. RenderDoc implements debugPrintfEXT by rewriting the SPIR-V to write the printf values back to host. The Vulkan driver itself does not understand how to printf. It will just ignore the SPIR-V command to printf.

Shader replacement like this is not just for debug prints, but you can modify the code and see the results without having to recompile the entire program and recapturing.

For fun, debug print “Hello World” instead, and you have your checkbox ticked off.

Inspecting ISA

If you’re on a driver that exposes it, you can study the machine code. For this purpose, I highly recommend AMD GPUs with RADV driver on Linux. The ISA is arguably the most straight forward to read. All Mesa drivers should give you ISA no matter the graphics card if you don’t have an AMD card lying around.

; s6 = WorkgroupID.x
; v0 = LocalInvocationID.x
; GlobalInvocationID.x =
;   (WorkgroupID.x << 4) + LocalInvocationID.x
v_lshl_add_u32 v0, s6, 4, v0

; v1 = int(v0) >> 31, i.e. a simple sign extend
; s4, s5: Holds dst GPU pointer
; This is a slower 64-bit address computation path which
; only shows up for raw device addresses
v_ashrrev_i32_e32 v1, 31, v0
v_lshlrev_b64 v[0:1], 2, v[0:1]
v_add_co_u32 v2, vcc_lo, s4, v0
v_add_co_ci_u32_e32 v3, vcc_lo, s5, v1, vcc_lo

; Load 32-bits from pointer
global_load_dword v2, v[2:3], off

; In parallel, compute dest address with 64-bit math.
v_add_co_u32 v4, vcc_lo, s2, v0
v_add_co_ci_u32_e32 v5, vcc_lo, s3, v1, vcc_lo

; Wait until the read request completes
s_waitcnt vmcnt(0)

; Store
global_store_dword v[4:5], v2, off
s_endpgm

Understanding how a compute dispatch is organized

Before trying to make sense of this, we need a mental model for how GPU compute executes on the device. This model was more or less introduced by CUDA in 2007 and has remained effectively unchanged since, neat!

At the highest level, in the CPU side, we dispatch a 3D grid of workgroups. In this sample, we just have a 1x1x1 cube or dispatches.

cmd->dispatch(1, 1, 1);

For every workgroup, there is another 3D grid of invocations. Multiple invocations work together and are able to efficiently communicate with each other. Communicating across workgroups is possible in some situations, but requires some scissor juggling.

Why is there a two level hierarchy?

GPUs are extremely parallel machines. To get optimal performance, we have to map the very scalar-looking code to SIMT. The model employed for essentially all modern GPUs is that one lane in the vector units maps to one thread.

Inside the workgroup, the invocations are split up into subgroups. The mental model to understand the distinction is:

Workgroup -> runs concurrently on the same shader core
Subgroup -> runs in lock-step in a SIMT fashion For a workgroup to be running well, the number of invocations in it should be an integer multiple of the subgroup size, otherwise there will be lanes doing nothing, and that’s no fun.

The subgroup sizes in the wild vary quite a lot, but there is an upper legal limit of 128. In practice, these are the values you can expect to find in the wild:

4: Really old Mali Bifrost, old iPhones
8: Intel Arc Alchemist
16: Intel Arc Battlemage, Mali Valhall, Intel (runs slower if not Battlemage)
32: AMD RDNA, NVIDIA, Intel upper limit (runs even slower)
64: Adreno, AMD GCN + RDNA
128: Adreno Some vendors support multiple subgroup sizes. Usually we don’t have to care too much about this until you graduate to the more hardcore level of compute shader programming, but Vulkan gives you control to force subgroup sizes when need be.

For desktop use cases, catering to the range of 16 to 64 is reasonable. In the example we’ve been looking at, the workgroup size is just 16, so this is not optimal. On mobile GPUs, you might need to consider a wider range of hardware.

The rule of thumb (for desktop) is to use one of three constellations for local_size:

(64, 1, 1) for 1D
(8, 8, 1) for 2D
(4, 4, 4) for 3D Integer multiples of these are fine too. This should make almost any GPU happy. The maximum limit is 1024 invocations, but I never recommend going that high unless there are very good reasons to.

AMD specifics

For the AMD case,* v_* instructions are vector instructions, meaning while it looks like a simple register, there are multiple instances of it, one for every invocation in the subgroup.

s_ instructions are scalar. It runs once per subgroup and runs in parallel with the vector units. Taking advantage of subgroup uniform code can be very powerful.

One way to think of this is that v_ instructions look more like SIMD, except that the SIMD width is much larger than CPU instruction sets:

addps xmm0, xmm1

where scalar instructions look more like regular CPU instructions:

add eax, ebx

Vector load-store looks more like gather/scatter, and scalar load-store is more like normal CPU load-store.

Shader replacement by modifying SPIR-V assembly

In rare cases, it’s useful to do minor in-place modifications to SPIR-V itself. As a very ad-hoc sample, we can attempt to clean up the awkward 64-bit pointer math by using OpInBoundsAccessChain instead of OpAccessChain. Edit -> Decompile with spirv-dis

Now replace OpAccessChain with OpInBoundsAccessChain and apply. While keeping the shader tab open, go back to Pipeline State viewer and look at GPU ISA:

OpInBoundsAccessChain tells the compiler that we cannot index outside the array, which means negative indices and massively large indices are not allowed, and this allows the compiler to emit u64 + u32 addressing format. Sure looks much nicer now. This is way beyond what a beginner should care about, but the point is to demonstrate that we can replace raw SPIR-V too, and also demonstrates how you can inspect the SPIR-V of shaders easily.

Introducing descriptors

We can do a lot with simple buffer device addresses to process data, but there is a limit to how far we can get with that approach if the end goal is game rendering. There are things GPUs can do that raw pointers cannot:

Efficiently sample textures
Automatic format conversions
“Free” bounds checking CPU ISAs do none of these. In this updated memcpy sample, I introduce two descriptor types, STORAGE_BUFFER and UNIFORM_TEXEL_BUFFER. Using descriptors like this is the “normal” way to use Vulkan, and should be preferred when feasible.

For pragmatic reasons, it’s easier to debug and validate descriptors compared to raw pointers. Raw pointers are also prone to GPU crashes, which are very painful and annoying to debug. Unlike CPU debugging, we cannot just capture a SIGSEGV in a debugger and call it a day.

Already, the resources show up in a more convenient way. Even though the shader didn’t specify 8-bit inputs anywhere, it just works. Typed formats have up to 4 components. This is enough to cover RGB and Alpha channels, which of course has its origins in graphics concepts. texelFetch cannot know if we have R8_UINT or R16G16B16A16_UINT for example, so we have to select the .x component in the shader.

s_mov_b32 s0, s3

; On RADV, descriptors live in a fixed 32-bit VA region:
; 0xffff8000'xxxxxxxx
; Since this is known by compiler, we only need to pass down 32-bits
; and synthesize the upper half with s_movk_i32 (0x8000 is sign-extended).
s_movk_i32 s3, 0x8000

; STORAGE_BUFFER and TEXEL_BUFFER are both 16 bytes.
; Both are loaded here in one go into scalar registers.
; Notice the s_load. These loads go into the constant cache,
; and is almost "free". The same load is shared for all threads
; in the subgroup.
s_load_dwordx8 s[8:15], s[2:3], null
v_lshl_add_u32 v0, s0, 4, v0

; Wait until the scalar load completes so we can use the descriptor.
s_waitcnt lgkmcnt(0)

; Typed load. The descriptor holds information about e.g. R8_UINT.
; Bounds checking is also free.
buffer_load_format_x v1, v0, s[12:15], 0 idxen
v_lshlrev_b32_e32 v0, 2, v0
s_waitcnt vmcnt(0)

; Store, but uses a descriptor instead.
; This is automatically bounds checked and is free on AMD.
buffer_store_dword v1, v0, s[8:11], 0 offen
s_endpgm

The basic Vulkan binding model

Granite implements a rather old school binding model, but I think this model is overall very easy to understand and use. More modern bindless design is introduced later when it becomes relevant.

In the shader, I declare things like layout(set = 0, binding = 1). In Granite, this simply means that we have to bind a resource of the appropriate type before dispatching, e.g.:

cmd->set_buffer_view(/* set */ 0, /* binding */ 1, *buffer_view);

This papers over a ton of concepts, and makes it very easy and convenient to program. In reality, there are a lot of API objects in play here. When the compiler sees layout(set = 0, binding = 1) for example, it needs to check the provided VkPipelineLayout given to pipeline creation. A set denotes a group of resources that are bound together as one contiguous entity. In the ISA, set = 0 is determined to be initialized in a certain scalar register:

s_load_dwordx8 s[8:15], s[2:3], null

The VkPipelineLayout also contains information about what e.g. binding = 1 means. In this case, the driver happened to decide that binding = 0 is at offset 0, and binding = 1 is at offset 16. Since these descriptors are adjacent in memory we got a lucky optimization where we load 32 bytes at once.

On the API side, we need a compatible VkPipelineLayout object when recording the command buffer to ensure that everything lines up. Granite does this automatically, through shader reflection, which synthesizes a working layout for us.

Based on the contained VkDescriptorSetLayout inside the pipeline layout, it knows how to allocate a VkDescriptorSet from a VkDescriptorPool and write descriptors to it. Then it can bind the descriptor set to the command buffer before dispatching. We can see all of this in effect in the capture. Turn off the Filter and we get:

The descriptor set is updated, then later bound. In reality vkCmdBindDescriptorSets is just a 32-bit push constant, which the shader ends up reading in s3 register.

Deeper understanding with VK_EXT_descriptor_buffer

Managing descriptors is always a point of contention in Vulkan programming if you’re writing the raw API. There’s a ton of concepts to juggle and it’s mostly pretty dull stuff.

As an extension to the original old school model I outlined about, it’s possible to treat a descriptor set as raw memory which gets rid of a ton of jank. Granite supports using this model by opting in to it. Change the sample to use and recapture:

if (!ctx.init_instance_and_device(
nullptr, 0, nullptr, 0,
Vulkan::CONTEXT_CREATION_ENABLE_ROBUSTNESS_2_BIT |
Vulkan::CONTEXT_CREATION_ENABLE_DESCRIPTOR_BUFFER_BIT))
return EXIT_FAILURE;

Make sure to make a release build of the test and not a debug build, otherwise descriptor buffers are disabled. Note that this requires a recent build of RenderDoc. The latest stable v1.40 release supports descriptor buffers.

Now we explicitly tell the driver that the descriptor set lives at offset 0 from the bound buffer. If we then inspect the bound descriptor buffer …

Now we can see the raw guts of the storage buffer and texel buffer being encoded. You can even see the 0x40 and 0x10 being encoded there which corresponds to the sizes of the descriptors.

Porting ShaderToy shaders to compute

To get something interesting on screen to end this bringup exercise, we could port some shadertoy shaders. These are super convenient since many of them don’t require anything fancy to run like external textures or anything. I picked some shadertoy arbitrarily. Store this to mandelbulb.glsl, and then we replace our shader with a mandelbulb.comp that calls the shadertoy code:

#version 460

// We're writing 2D images now, so this makes more sense.
layout(local_size_x = 8, local_size_y = 8) in;

layout(set = 0, binding = 0) writeonly uniform image2D output_image;

// Constants used by shadertoys.
layout(push_constant) uniform Registers
{
vec2 iResolution;
float iTime;
};

#include "mandelbulb.glsl"

void main()
{
vec4 color;
mainImage(color, vec2(gl_GlobalInvocationID.xy) + 0.5);

// Stores the result to texture.
imageStore(output_image, ivec2(gl_GlobalInvocationID.xy), color);
}

On the API side, we simply need to create a storage texture and bind it to the shader.

Just with this simple setup, you can go completely nuts and play around with the more math-heavy side of graphics if you want.

Where to go from here?

From here, I think the natural evolution is to learn about:

Atomics
Living in a world without mutexes: lockless programming with millions of threads
Shared memory
Subgroup operations
Case study: Scalarization
Case study: Bindless and non-uniform indexing of descriptors
Texture sampling and mip-mapping
The bread and butter of graphics
Case study: do some simple image processing with simple filters
Memory coherency and how to communicate with other workgroups
Case study: Single pass down-sampling
If relevant, start porting over the code to more shading languages
API synchronization and how to keep CPU and GPU pipelined
… and maybe only then start looking at getting some images on screen (with compute)
Bring up your own Vulkan code from scratch to get rid of the training wheels and make sure you understand how everything comes together After that, it’s a matter of learning the common algorithms that show up all the time, like parallel scans, classification, binning, etc, etc. This naturally leads to indirect dispatches, and once these concepts are in place, we can design a very simple compute shader rasterizer that renders some simple glTF models. Only when those concepts land do we consider the graphics pipeline.