- 12-25-2025 12-25-2025* blog** 15 minutes read (About 2214 words)** ** visits
Introduction
The NVIDIA official utility nvidia-smi provides a lot of useful information about the GPU. It is built on top of the NVIDIA Management Library (NVML), which provides a set of APIs for monitoring the statistics of NVIDIA GPUs. In practice, sometimes we would like to monitor the GPU statistics in our custom applications.
In this blog post, I would like to discuss how to use NVIDIA NVML library to monitor the GPU statistics and replicated nvidia-smi dmon in a custom C++ application.
NVIDIA NVML GPU Statistics
NVIDIA-SMI DMON
nvidia-smi dmon will display basic GPU statistics, including power (pwr), GPU temperature (gtemp), mem…
- 12-25-2025 12-25-2025* blog** 15 minutes read (About 2214 words)** ** visits
Introduction
The NVIDIA official utility nvidia-smi provides a lot of useful information about the GPU. It is built on top of the NVIDIA Management Library (NVML), which provides a set of APIs for monitoring the statistics of NVIDIA GPUs. In practice, sometimes we would like to monitor the GPU statistics in our custom applications.
In this blog post, I would like to discuss how to use NVIDIA NVML library to monitor the GPU statistics and replicated nvidia-smi dmon in a custom C++ application.
NVIDIA NVML GPU Statistics
NVIDIA-SMI DMON
nvidia-smi dmon will display basic GPU statistics, including power (pwr), GPU temperature (gtemp), memory temperature (mtemp), GPU utilization (sm) (the percentage of time that at least one SM is being used), memory utilization (mem), encoder utilization (enc), decoder utilization (dec), JPEG utilization (jpg), OFA utilization (ofa), memory clock (mclk), and graphics clock (pclk). The following is an example of nvidia-smi dmon output:
| ``` | |
| 123456 | |
| ``` | ``` |
| $ nvidia-smi dmon 0 8 42 - 1 11 0 0 0 0 405 502 0 14 43 - 0 1 0 0 0 0 7001 1492 0 15 43 - 0 1 0 0 0 0 7001 1492 |
In addition, `nvidia-smi dmon` can also display additional GPU Performance Metrics \(GPM\) for Hopper and later GPUs\. The following example shows how to display the GPMs for GPU activity \(`gract`\) \(same as the `sm` metric\), SM utilization \(`smutil`\) \(the percentage of SMs that are actively being used\), and FP16 activity \(`fp16`\)\.
| | |
| - | - |
| ```
1234567
``` | ```
$ nvidia-smi dmon --gpm-metrics 1,2,13 0 12 43 - 0 1 0 0 0 0 7001 682 - - - 0 12 43 - 0 1 0 0 0 0 810 532 - - - 0 11 43 - 2 7 0 0 0 0 810 495 1 1 0 0 9 43 - 1 7 0 0 0 0 810 502 9 8 0
``` |
The other GPM metrics can be queried using `nvidia-smi dmon --help`\.
| | |
| - | - |
| ```
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586
``` | ```
$ nvidia-smi dmon --help GPU statistics are displayed in scrolling format with one line per sampling interval. Metrics to be monitored can be adjusted based on the width of terminal window. Monitoring is limited to a maximum of 16 devices. If no devices are specified, then up to first 16 supported devices under natural enumeration (starting with GPU index 0) are used for monitoring purpose. It is supported on Tesla, GRID, Quadro and limited GeForce products for Kepler or newer GPUs under x64 and ppc64 bare metal Linux. Note: On MIG-enabled GPUs, querying the utilization of encoder, decoder, jpeg, ofa, gpu, and memory is not currently supported. Usage: nvidia-smi dmon [options] Options include: [-i | --id]: Comma separated Enumeration index, PCI bus ID or UUID [-d | --delay]: Collection delay/interval in seconds [default=1sec] [-c | --count]: Collect specified number of samples and exit [-s | --select]: One or more metrics [default=puc] Can be any of the following: p - Power Usage and Temperature u - Utilization c - Proc and Mem Clocks v - Power and Thermal Violations m - FB, Bar1 and CC Protected Memory e - ECC Errors and PCIe Replay errors t - PCIe Rx and Tx Throughput [N/A | --gpm-metrics]: Comma-separated list of GPM metrics (no space in between) to watch Available metrics: Graphics Activity = 1 SM Activity = 2 SM Occupancy = 3 Integer Activity = 4 Tensor Activity = 5 DFMA Tensor Activity = 6 HMMA Tensor Activity = 7 IMMA Tensor Activity = 9 DRAM Activity = 10 FP64 Activity = 11 FP32 Activity = 12 FP16 Activity = 13 PCIe TX = 20 PCIe RX = 21 NVDEC 0-7 Activity = 30-37 NVJPG 0-7 Activity = 40-47 NVOFA 0 Activity = 50 NVLink Total RX = 60 NVLink Total TX = 61 NVLink L0-17 RX = 62,64,66,...,96 NVLink L0-17 TX = 63,65,67,...,97 C2C TOTAL TX = 100 C2C TOTAL RX = 101 C2C DATA TX = 102 C2C DATA RX = 103 C2C LINK0-13 TOTAL TX = 104,108,112,...,156 C2C LINK0-13 TOTAL RX = 105,109,113,...,157 C2C LINK0-13 DATA TX = 106,110,114,...,158 C2C LINK0-13 DATA RX = 107,111,115,...,159 HOSTMEM CACHE HIT = 160 HOSTMEM CACHE MISS = 161 PEERMEM CACHE HIT = 162 PEERMEM CACHE MISS = 163 DRAM CACHE HIT = 164 DRAM CACHE MISS = 165 NVENC 0-3 Activity = 166-169 GR0-7 CTXSW CYCLES ELAPSED = 170,175,180,...,205 GR0-7 CTXSW CYCLES ACTIVE = 171,176,181,...,206 GR0-7 CTXSW REQUESTS = 172,177,182,...,207 GR0-7 CTXSW ACTIVE AVERAGE = 173,178,183,...,208 GR0-7 CTXSW ACTIVE PERCENT = 174,179,184,...,209 [N/A | --gpm-options]: options of which level of GPM metrics to monitor: d - Display Device level GPM Metrics only m - Display MIG level GPM Metrics only dm - Display both Device and MIG level GPM Metrics only md - Display both Device and MIG level GPM Metrics only [-o | --options]: One or more from the following: D - Include Date (YYYYMMDD) in scrolling output T - Include Time (HH:MM:SS) in scrolling output [-f | --filename]: Log to a specified file, rather than to stdout [-h | --help]: Display help information [N/A | --format]: Output format specifiers: csv - Format dmon output as a CSV nounit - Remove units line from dmon output noheader - Remove heading line from dmon output
``` |
### GPU Stats Using NVIDIA NVML
In turns out that we could query the basic GPU statistics, including `sm`, `mem`, `enc`, `dec`, `jpg`, `ofa` using the [nvmlDeviceGetProcessesUtilizationInfo](https://docs.nvidia.com/deploy/nvml-api/group__nvmlDeviceQueries.html#group__nvmlDeviceQueries_1g0fe806054e54932231596396ea8f9c12) API, and the additional GPM statistics, including every GPM metric listed in the `nvidia-smi dmon --help`, using the [nvmlGpmMetricsGet](https://docs.nvidia.com/deploy/nvml-api/group__nvmlGpmFunctions.html#group__nvmlGpmFunctions_1g0f0408fc31522711493960dd7b47ba44) API\. All the GPM metric IDs can be found in the [nvmlGpmMetricId\_t](https://docs.nvidia.com/deploy/nvml-api/group__nvmlGpmEnums.html) definition\. For example, the GPM metric ID for `gract` is `NVML_GPM_METRIC_GRAPHICS_UTIL = 1`, for `smutil` is `NVML_GPM_METRIC_SM_UTIL = 2`, and for `fp16` is `NVML_GPM_METRIC_FP16_UTIL = 13`\.
The following `gpu_stats` program demonstrates the usage of the NVIDIA NVML library APIs mentioned above and will produce exactly the same output as `nvidia-smi dmon`\. The source code is also available in the [“NVIDIA NVML GPU Statistics”](https://github.com/leimao/NVIDIA-NVML-GPU-Statistics) repository on GitHub\.
gpu\_stats\.cpp
| | |
| - | - |
| ```
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401402403404405406407408409410411412413414415416417418419420421422423424425426427428429430431432433434435436437438439440441442443444445446447448449450451452453454455456457458459460461462463464465466467468469470471472473474475476477478
``` | ```
#include <chrono>#include <cstring>#include <iomanip>#include <iostream>#include <map>#include <memory>#include <sstream>#include <string>#include <thread>#include <vector>#include <nvml.h>std::map<int, std::string> gpmMetricNames = { {1, "gract"}, {2, "smutil"}, {3, "smoccu"}, {4, "intact"}, {5, "tenact"}, {6, "dfmact"}, {7, "hmmact"}, {9, "immact"}, {10, "dramac"}, {11, "fp64"}, {12, "fp32"}, {13, "fp16"}, {20, "pcitx"}, {21, "pcirx"}, {30, "nvd0"}, {31, "nvd1"}, {32, "nvd2"}, {33, "nvd3"}, {34, "nvd4"}, {35, "nvd5"}, {36, "nvd6"}, {37, "nvd7"}, {40, "nvj0"}, {41, "nvj1"}, {42, "nvj2"}, {43, "nvj3"}, {44, "nvj4"}, {45, "nvj5"}, {46, "nvj6"}, {47, "nvj7"}, {50, "ofa0"}, {60, "nvlrx"}, {61, "nvltx"}};struct GPUStats{ unsigned int power; unsigned int gpuTemp; int memTemp; unsigned int smUtil; unsigned int memUtil; unsigned int encUtil; unsigned int decUtil; unsigned int jpgUtil; unsigned int ofaUtil; unsigned int memClock; unsigned int smClock; std::map<int, double> gpmMetrics; };void printError(char const* func, nvmlReturn_t const result){ std::cerr << "Error in " << func << ": " << nvmlErrorString(result) << std::endl;}bool getUtilization(nvmlDevice_t const device, GPUStats& stats){ stats.smUtil = 0; stats.memUtil = 0; stats.encUtil = 0; stats.decUtil = 0; stats.jpgUtil = 0; stats.ofaUtil = 0; nvmlProcessesUtilizationInfo_t procUtilInfo{}; memset(&procUtilInfo, 0, sizeof(procUtilInfo)); procUtilInfo.version = nvmlProcessesUtilizationInfo_v1; procUtilInfo.lastSeenTimeStamp = 0; nvmlReturn_t result{ nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo)}; if (result == NVML_ERROR_INSUFFICIENT_SIZE && procUtilInfo.processSamplesCount > 0) { std::vector<nvmlProcessUtilizationInfo_v1_t> procUtilArray( procUtilInfo.processSamplesCount); procUtilInfo.procUtilArray = procUtilArray.data(); result = nvmlDeviceGetProcessesUtilizationInfo(device, &procUtilInfo); if (result == NVML_SUCCESS) { for (unsigned int i{0}; i < procUtilInfo.processSamplesCount; ++i) { stats.smUtil = std::max(stats.smUtil, procUtilArray[i].smUtil); stats.memUtil = std::max(stats.memUtil, procUtilArray[i].memUtil); stats.encUtil = std::max(stats.encUtil, procUtilArray[i].encUtil); stats.decUtil = std::max(stats.decUtil, procUtilArray[i].decUtil); stats.jpgUtil = std::max(stats.jpgUtil, procUtilArray[i].jpgUtil); stats.ofaUtil = std::max(stats.ofaUtil, procUtilArray[i].ofaUtil); } return true; } } nvmlUtilization_t utilization{}; result = nvmlDeviceGetUtilizationRates(device, &utilization); if (result == NVML_SUCCESS) { stats.smUtil = utilization.gpu; stats.memUtil = utilization.memory; } unsigned int encoderUtil{}, encoderSamplingPeriod{}; result = nvmlDeviceGetEncoderUtilization(device, &encoderUtil, &encoderSamplingPeriod); if (result == NVML_SUCCESS) { stats.encUtil = encoderUtil; } unsigned int decoderUtil{}, decoderSamplingPeriod{}; result = nvmlDeviceGetDecoderUtilization(device, &decoderUtil, &decoderSamplingPeriod); if (result == NVML_SUCCESS) { stats.decUtil = decoderUtil; } return true;}bool getGPUStats(nvmlDevice_t const device, GPUStats& stats){ nvmlReturn_t result{}; result = nvmlDeviceGetPowerUsage(device, &stats.power); if (result != NVML_SUCCESS) { stats.power = 0; } else { stats.power /= 1000; } result = nvmlDeviceGetTemperature(device, NVML_TEMPERATURE_GPU, &stats.gpuTemp); if (result != NVML_SUCCESS) { stats.gpuTemp = 0; } stats.memTemp = -1; getUtilization(device, stats); result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_MEM, &stats.memClock); if (result != NVML_SUCCESS) { stats.memClock = 0; } result = nvmlDeviceGetClockInfo(device, NVML_CLOCK_SM, &stats.smClock); if (result != NVML_SUCCESS) { stats.smClock = 0; } return true;}bool getGPMMetrics(nvmlDevice_t const device, std::vector<int> const& metricIds, GPUStats& stats){ if (metricIds.empty()) { return true; } auto gpmSampleDeleter = [](nvmlGpmSample_t* sample) { if (sample && *sample) { nvmlGpmSampleFree(*sample); } delete sample; }; std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample1( new nvmlGpmSample_t{}, gpmSampleDeleter); nvmlReturn_t result{nvmlGpmSampleAlloc(sample1.get())}; if (result != NVML_SUCCESS) { for (int id : metricIds) { stats.gpmMetrics[id] = -1.0; } return false; } std::unique_ptr<nvmlGpmSample_t, decltype(gpmSampleDeleter)> sample2( new nvmlGpmSample_t{}, gpmSampleDeleter); result = nvmlGpmSampleAlloc(sample2.get()); if (result != NVML_SUCCESS) { for (int id : metricIds) { stats.gpmMetrics[id] = -1.0; } return false; } result = nvmlGpmSampleGet(device, *sample1); if (result != NVML_SUCCESS) { for (int id : metricIds) { stats.gpmMetrics[id] = -1.0; } return false; } std::this_thread::sleep_for(std::chrono::milliseconds(100)); result = nvmlGpmSampleGet(device, *sample2); if (result != NVML_SUCCESS) { for (int id : metricIds) { stats.gpmMetrics[id] = -1.0; } return false; } nvmlGpmMetricsGet_t metricsGet{}; memset(&metricsGet, 0, sizeof(metricsGet)); metricsGet.version = NVML_GPM_METRICS_GET_VERSION; metricsGet.numMetrics = metricIds.size(); metricsGet.sample1 = *sample1; metricsGet.sample2 = *sample2; for (size_t i{0}; i < metricIds.size() && i < 210; ++i) { metricsGet.metrics[i].metricId = static_cast<nvmlGpmMetricId_t>(metricIds[i]); } result = nvmlGpmMetricsGet(&metricsGet); if (result == NVML_SUCCESS) { for (size_t i{0}; i < metricIds.size(); ++i) { stats.gpmMetrics[metricIds[i]] = metricsGet.metrics[i].value; } } else { for (int id : metricIds) { stats.gpmMetrics[id] = -1.0; } } return result == NVML_SUCCESS;}void printHeader(std::vector<int> const& gpmMetricIds){ std::cout << "# gpu pwr gtemp mtemp sm mem enc dec " "jpg ofa mclk pclk"; for (int id : gpmMetricIds) { if (gpmMetricNames.find(id) != gpmMetricNames.end()) { std::cout << std::setw(11) << gpmMetricNames[id]; } } std::cout << std::endl; std::cout << "# Idx W C C % % % % " "% % MHz MHz"; for (size_t i{0}; i < gpmMetricIds.size(); ++i) { std::cout << " GPM:%"; } std::cout << std::endl;}void printStats(unsigned int const deviceId, GPUStats const& stats, std::vector<int> const& gpmMetricIds){ std::cout << std::setw(5) << deviceId; std::cout << std::setw(7) << stats.power; std::cout << std::setw(7) << stats.gpuTemp; if (stats.memTemp >= 0) { std::cout << std::setw(7) << stats.memTemp; } else { std::cout << std::setw(7) << "-"; } std::cout << std::setw(7) << stats.smUtil; std::cout << std::setw(7) << stats.memUtil; std::cout << std::setw(7) << stats.encUtil; std::cout << std::setw(7) << stats.decUtil; std::cout << std::setw(7) << stats.jpgUtil; std::cout << std::setw(7) << stats.ofaUtil; std::cout << std::setw(7) << stats.memClock; std::cout << std::setw(7) << stats.smClock; for (int id : gpmMetricIds) { if (stats.gpmMetrics.find(id) != stats.gpmMetrics.end()) { double const value{stats.gpmMetrics.at(id)}; if (value < 0) { std::cout << std::setw(11) << "-"; } else { std::cout << std::setw(11) << static_cast<int>(value); } } else { std::cout << std::setw(11) << "-"; } } std::cout << std::endl;}std::vector<int> parseGpmMetrics(std::string const& str){ std::vector<int> metrics{}; std::stringstream ss{str}; std::string token{}; while (std::getline(ss, token, ',')) { try { metrics.push_back(std::stoi(token)); } catch (...) { std::cerr << "Invalid GPM metric ID: " << token << std::endl; } } return metrics;}int main(int argc, char* argv[]){ std::vector<int> gpmMetricIds{}; int delay{1}; int count{-1}; for (int i{1}; i < argc; ++i) { std::string const arg{argv[i]}; if (arg == "--gpm-metrics" && i + 1 < argc) { gpmMetricIds = parseGpmMetrics(argv[++i]); } else if ((arg == "-d" || arg == "--delay") && i + 1 < argc) { delay = std::atoi(argv[++i]); } else if ((arg == "-c" || arg == "--count") && i + 1 < argc) { count = std::atoi(argv[++i]); } else if (arg == "-h" || arg == "--help") { std::cout << "Usage: " << argv[0] << " [options]" << std::endl << "Options:" << std::endl << " --gpm-metrics <ids> Comma-separated list of GPM " "metric IDs" << std::endl << " -d, --delay <sec> Collection delay/interval in " "seconds [default=1]" << std::endl << " -c, --count <n> Collect specified number of " "samples and exit" << std::endl << " -h, --help Display this help" << std::endl; return 0; } } nvmlReturn_t result{nvmlInit()}; if (result != NVML_SUCCESS) { printError("nvmlInit", result); return 1; } unsigned int deviceCount{}; result = nvmlDeviceGetCount(&deviceCount); if (result != NVML_SUCCESS) { printError("nvmlDeviceGetCount", result); nvmlShutdown(); return 1; } if (deviceCount == 0) { std::cerr << "No NVIDIA GPUs found" << std::endl; nvmlShutdown(); return 1; } std::vector<nvmlDevice_t> devices(deviceCount); for (unsigned int i{0}; i < deviceCount; ++i) { result = nvmlDeviceGetHandleByIndex(i, &devices[i]); if (result != NVML_SUCCESS) { printError("nvmlDeviceGetHandleByIndex", result); nvmlShutdown(); return 1; } } printHeader(gpmMetricIds); int iteration{0}; while (count < 0 || iteration < count) { for (unsigned int i{0}; i < deviceCount; ++i) { GPUStats stats{}; getGPUStats(devices[i], stats); if (!gpmMetricIds.empty()) { getGPMMetrics(devices[i], gpmMetricIds, stats); } printStats(i, stats, gpmMetricIds); } iteration++; if (count < 0 || iteration < count) { std::this_thread::sleep_for(std::chrono::seconds(delay)); } } nvmlShutdown(); return 0;}
``` |
The `gpu_stats` program can be built and run using the following commands\.
| | |
| - | - |
| ```
1234567891011
``` | ```
$ g++ -o gpu_stats gpu_stats.cpp -I/usr/local/cuda/include -L/usr/local/cuda/lib64 -lnvidia-ml$ ./gpu_stats --gpm-metrics 1,2,13 0 19 34 - 3 2 0 0 0 0 7001 712 13 11 0 0 19 34 - 3 2 0 0 0 0 7001 637 11 8 0 0 18 34 - 3 2 0 0 0 0 7001 667 9 10 0 0 19 34 - 3 2 0 0 0 0 7001 577 8 10 0 0 18 34 - 3 2 0 0 0 0 7001 615 7 8 0
``` |
## References
- [NVIDIA Management Library](https://docs.nvidia.com/deploy/nvml-api/index.html)
- [NVML GPM API](https://docs.nvidia.com/deploy/nvml-api/group__GPM.html#group__GPM)
---
**