Color Inversion Kernel: Weird GPU Issue on Local Machine

18 Dec, 2025 *

TL;DR: My CUDA toolkit (13.1) was newer than the CUDA version supported by my driver (13.0), so the driver couldn’t JIT the PTX into GPU machine code. Compiling with my GPU’s compute capability (-arch=sm_86) made nvcc embed native SASS for my card, so the driver could run it directly and my kernels started working.

I was writing a color inversion CUDA kernel today on leetgpu and wanted to try out the kernel on my local system.

Figure 1: Color inversion kernel source code (leetgpu)

Specs: Gpu: RTX 3050 (Ampere architecture) Compute capability: 8.6 OS: fedora 43 Driver Version: 580.105.08 (CUDA Version: 13.0) Nvcc compiler and related tools version: 13.1

I w…

18 Dec, 2025 *

I was writing a color inversion CUDA kernel today on leetgpu and wanted to try out the kernel on my local system.

Figure 1: Color inversion kernel source code (leetgpu)

Specs: Gpu: RTX 3050 (Ampere architecture) Compute capability: 8.6 OS: fedora 43 Driver Version: 580.105.08 (CUDA Version: 13.0) Nvcc compiler and related tools version: 13.1

I wrote the program and compiled as usual:

nvcc -o <binary_name> <file_name>.cu
./<binary_name>

Output:

Failed color inversion output Figure 2: Failed output - kernel "ran" but produced no change

Clearly there is no change in the input and surprisingly the program ran without any errors or warnings. I became suspicious so I checked if my kernel was wrong but it wasn’t. I ran another file just to check and here is the output.

Failed hello world CUDA output Figure 3: Hello world test also failed silently

Now I realized there was something wrong so I asked around an LLM and figured it out.

Problem

My driver is 580.105.08 and nvidia-smi reports CUDA Version: 13.0, while my nvcc is from CUDA 13.1. nvcc compiles the code first to an intermediate CUDA language called PTX (Parallel Thread Execution). Since the PTX was generated by the newer 13.1 toolchain, the driver’s PTX JIT (which only supports PTX up to CUDA 13.0) could not understand it properly and couldn’t turn it into runnable GPU machine code.

Under the hood, PTX is supposed to be JIT-compiled by the driver into SASS (Streaming ASsembly), which is the actual architecture-dependent machine code that runs on the GPU. Because the PTX version was too new for the driver’s JIT, the PTX → SASS step effectively failed, so the GPU side of my program didn’t actually do any useful work even though the code compiled and ran without obvious errors.

I thought wow, this is a bummer and was thinking of reinstalling a lower CUDA toolkit version as my drivers are the latest. However there is a smart hack around it.

Solution

There is something called the compute capability of the GPU which, when used as a parameter, tells nvcc to generate native machine code (SASS) for that specific GPU architecture and embed it directly into the binary. That way, the driver can just load and run that SASS without needing to JIT-compile newer PTX at runtime.

My RTX 3050 has compute capability 8.6, so I used that in the compile command and was able to compile my code and run the program successfully to produce the desired output:

nvcc -arch=sm_86 -o <binary_name> <file_name>.cu
./<binary_name>

Notice the new parameter -arch=sm_86 that was added. It specifies the compute capability of my GPU and makes nvcc generate SASS for that architecture and embed it into the executable, so the driver no longer has to understand the newer PTX version.

Successful color inversion output Figure 4: Fixed output after `nvcc -arch=sm_86`

That’s it. Hope you learned something new. Thanks and cheers.

Problem

Solution

Similar Posts