CUDA Profiler for Production Inference (opens in new tab)

Discussed on Hacker News

Why dev-time CUDA profilers don't fit production inference, and what a profiler built for it looks like: low-overhead kernel attribution, host sync waits, and integrated telemetry.

Read the original article