Using Nsight Compute with large codebases - Part 2 : Profiling large code bases

Table of contents

tldr;

This section will go over how you can combine NVTX filtering, targeted metric selection, and kernel selection to reduce ncu profiling overhead on vLLM from 18+ hours to under 1 second — while still extracting roofline metrics for specific kernels, without isolating or reproducing them.

What causes `ncu`’s high profiling overheads

The overheads of using ncu on a large code base like vLLM directly comes from 3 sources:

The sheer number of kernels profiled.
The number of execution replay passes that are run per kernel
Cost of each replay

This post will go over how we can use various flags that ncu exposes to reduce each of these costs.m

What is execution replay

When ncu is profiling a workload it creates sub-group…

tldr;

What causes `ncu`’s high profiling overheads

What is execution replay

tldr;

What causes `ncu`’s high profiling overheads

What is execution replay

When is a replay triggered

What happens during a replay

Similar Posts

tldr;

What causes ncu’s high profiling overheads

What is execution replay

tldr;

What causes ncu’s high profiling overheads

What is execution replay

When is a replay triggered

What happens during a replay

Similar Posts

What causes `ncu`’s high profiling overheads

What causes `ncu`’s high profiling overheads