Yusheng Zheng=, Tong Yu=, Yiwei Yang=
As a revolutionary technology that provides programmability in the kernel, eBPF has achieved tremendous success in CPU observability, networking, and security. However, for the increasingly important field of GPU computing, we also need a flexible and efficient means of observation. Currently, most GPU performance analysis tools are limited to observing from the CPU side through drivers/user-space APIs or vendor-specific performance analysis interfaces (like CUPTI), making it difficult to gain deep insights into the internal execution of the GPU. To address this, bpftime provides GPU support through its CUDA/SYCL attachment implementation, enabling eBPF programs to execute within GPU kernels on NVIDIA and AMD GPUs. This brings eBPF’s progr…
Yusheng Zheng=, Tong Yu=, Yiwei Yang=
As a revolutionary technology that provides programmability in the kernel, eBPF has achieved tremendous success in CPU observability, networking, and security. However, for the increasingly important field of GPU computing, we also need a flexible and efficient means of observation. Currently, most GPU performance analysis tools are limited to observing from the CPU side through drivers/user-space APIs or vendor-specific performance analysis interfaces (like CUPTI), making it difficult to gain deep insights into the internal execution of the GPU. To address this, bpftime provides GPU support through its CUDA/SYCL attachment implementation, enabling eBPF programs to execute within GPU kernels on NVIDIA and AMD GPUs. This brings eBPF’s programmability, observability, and customization capabilities to GPU computing workloads. By doing so, it enables real-time profiling, debugging, and runtime extension of GPU applications without source code modification, addressing a gap in the current observability landscape.
Note: GPU support is still experimental. For questions or suggestions, open an issue or contact us.
The Problem: GPU Observability Challenges
The Core Challenge: An Opaque Execution Model
GPUs have become the dominant accelerators for machine learning, scientific computing, and high-performance computing workloads, but their SIMT (Single Instruction, Multiple Thread) execution model introduces significant observability and extensibility challenges. Modern GPUs organize thousands of threads into warps ( collect data when kernel execution exceeds 100ms.“ - Intrusive Profiling Sessions: These tools require special profiling sessions that perturb workload behavior through counter multiplexing and replay mechanisms, making them unsuitable for always-on continuous telemetry in production, and causing replay-based collection to miss transient anomalies and rare events. - Lack of On-Device Filtering and Aggregation: Nsight lacks in-situ filtering and aggregation, forcing all raw data to be exported then post-processed, which creates multi-GB traces from large applications with massive async pipelines and no programmable adaptive response to change sampling logic on-the-fly based on observed state. - Limited System Integration: Nsight cannot attach dynamic probes to persistent kernels without restart, lacks integration with Linux eBPF infrastructure (kprobes/uprobes/tracepoints), and cannot share data structures (maps) across CPU and GPU instrumentation, making it extremely difficult to stitch causality chains like page fault (host) → delayed launch enqueue → warp stall spike. - Vendor Lock-in: These are NVIDIA-only tools with no clear path to vendor-neutral deployment across AMD, Intel, or other accelerators in heterogeneous systems.
In practice, developers face iterative root cause analysis slowdowns with large traces, miss production issues that don’t reproduce under profiling overhead, and cannot correlate GPU events with existing production observability stacks (perf, bpftrace, custom eBPF agents) without complex mode-switching to special “profiling sessions.”
3. Research Tools and Interfaces for Fine-Grained Analysis
When deeper visibility than Nsight is required, the industry has explored tools based on binary instrumentation or lower-level interfaces, such as NVIDIA CUPTI, NVBit, and NEUTRINO.
CUPTI (CUDA Profiling Tools Interface): As a mature interface, CUPTI is well-suited for obtaining kernel-level high-level metrics (like start/end times) and hardware performance counters. Projects like xpu-perf have demonstrated its effectiveness in correlating CPU-GPU data. However, when it comes to understanding “why” a kernel is slow, the high-level metrics provided by CUPTI are often insufficient.
Binary Instrumentation Tools (NVBit, NEUTRINO): These tools achieve fine-grained, instruction-level observability by instrumenting at the assembly or PTX (NVIDIA GPU’s intermediate language) level. For instance, NEUTRINO, which emerged around the same time, uses assembly-level probes to gather data. However, it typically requires programming directly in assembly, which is not only complex but also lacks the safety and portability offered by eBPF. They are also often independent of CPU profilers, making it difficult to provide unified, cross-layer, multi-device visibility, correlate event logic, and handle clock drift, which can be very challenging. The process of correlating events may also involve multiple data copies, leading to additional performance overhead. Neutrino is not designed for always-on monitoring; it is also session-based, generating large amounts of information that await subsequent processing.
In summary, while existing tools are powerful in certain aspects, they are either too high-level or too cumbersome and isolated. Developers face a difficult trade-off between iteration speed, production safety, and the depth of problem diagnosis, and they cannot easily correlate GPU events with existing CPU observability stacks (like perf, bpftrace).
The Solution: Extending eBPF to the GPU
To overcome the limitations of existing tools, we need a solution that can unify CPU and GPU observability while providing programmable, low-overhead monitoring. The eBPF technology and its implementation in the bpftime project make this possible.
Why eBPF?
To understand why eBPF is the right tool for this challenge, it’s helpful to look at its impact on the CPU world. eBPF (extended Berkeley Packet Filter) is a revolutionary technology in the Linux kernel that allows sandboxed programs to be dynamically loaded to extend kernel capabilities safely. On the CPU side, eBPF has become a cornerstone of modern observability, networking, and security due to its unique combination of programmability, safety, and performance. It enables developers to attach custom logic to thousands of hook points, collecting deep, customized telemetry with minimal overhead. The core idea behind bpftime is to bring this same transformative power to the traditionally opaque world of GPU computing.
By running eBPF programs natively inside GPU kernels, bpftime provides safe, programmable, unified observability and extensibility across the entire stack. - Unified Cross-Layer Observability: The architecture treats CPU and GPU probes as peers in a unified control plane. Shared BPF maps and ring buffers enable direct data exchange, and dynamic instrumentation works without recompilation or restart. Integration with existing eBPF infrastructure (perf, bpftrace, custom agents) requires no mode-switching. Developers can simultaneously trace CPU-side CUDA API calls via uprobes, kernel driver interactions via kprobes, and GPU-side kernel execution via CUDA probes, all using the same eBPF toolchain and correlating events across the host-device boundary. Example questions now become answerable: “Did the CPU syscall delay at T+50μs cause the GPU kernel to stall at T+150μs?” or “Which CPU threads are launching the kernels that exhibit high warp divergence?” This cross-layer visibility enables root-cause analysis that spans the entire heterogeneous execution stack, from userspace application logic through kernel drivers to GPU hardware behavior, without leaving the production observability workflow. - Low-Overhead Production Monitoring: Unlike session-based profilers, it enables always-on production monitoring with dynamic load/unload of probes and device-side predicate filtering to reduce overhead. - Restoring Asynchronous Visibility: It recovers async-mode visibility with per-phase timestamps (H→D at T+200μs, kernel at T+206μs, D→H at T+456μs), exposes GPU-internal details with nanosecond-granularity telemetry for warp execution and memory patterns, and correlates CPU and GPU events without the heavyweight overhead of traditional separate profilers.
bpftime: Running eBPF Natively on the GPU
bpftime’s approach bridges this gap by extending eBPF’s programmability and customization model directly into GPU execution contexts, enabling eBPF programs to run natively inside GPU kernels alongside application workloads. It employs a PTX injection technique, dynamically injecting eBPF programs into the intermediate assembly language (PTX) of NVIDIA GPUs, allowing for direct hooking of GPU threads.
This approach allows us to obtain extremely fine-grained, in-kernel runtime information that is difficult to access with high-level APIs like CUPTI. For example, with bpftime, we can:
- Trace memory access patterns of individual threads or thread blocks: Understand exactly how memory access instructions are executed and identify issues like uncoalesced accesses.
- Observe the scheduling behavior of Streaming Multiprocessors (SMs): See how warps are scheduled and executed on the SMs.
- Analyze in-kernel control flow: Identify the specific branches causing warp divergence and quantify their impact.
bpftime’s PTX injection is not intended to replace CUPTI but to complement its capabilities. When developers need to dive(&tid_x, &tid_y, &tid_z); // Per-thread counter in GPU array map u64 *count = bpf_map_lookup_elem(&thread_counts, &tid_x); if (count) { __atomic_add_fetch(count, 1, __ATOMIC_SEQ_CST); // Thread N executed once more } }
Measures the time between cudaLaunchKernel() on CPU and actual kernel execution on GPU. Reveals hidden queue delays, stream dependencies, and scheduling overhead that make fast kernels slow in production.
Use case: Your kernels execute in 100μs each, but users report 50ms latency. launchlate shows 200-500μs launch latency per kernel because each waits for the previous one and memory transfers to complete. Total time is 5ms, not 1ms. You switch to CUDA graphs, batching all launches, and latency drops to 1.2ms.