Non-Uniform L2 Cache Latency Across the Streaming Multiprocessors of an NVIDIA L40 (opens in new tab)

The NVIDIA L40 exposes a 96 MiB L2 cache usually modeled as one uniform pool with a single hit latency. We show this is wrong at the granularity a kernel sees: L2-hit latency depends strongly and reproducibly on which physical streaming multiprocessor (SM) issues the load. A turn-serialized, %smid-resolved probe maps the hit latency across all 142 SMs in one launch; it is not a constant near 279 cycles but spans 222-339 cycles (a 52% range), w...

Read the original article