Speeding Up Data Decompression with nvCOMP and the NVIDIA Blackwell Decompression Engine

Compression is a common technique to reduce storage costs and accelerate input/output transfer times across databases, data-center communications, high-performance computing, deep learning, and more. But decompressing that data often introduces latency and consumes valuable compute resources, slowing overall performance.

To address these challenges, NVIDIA introduced the hardware** Decompression Engine (DE)** in the NVIDIA Blackwell architecture—and paired it with the nvCOMP library. Together, they offload decompression from general-purpose compute, accelerate widely used formats like Snappy, and make adoption seamless.

This blog will walk through how DE and nvCOMP work, the usage guidelines, and the performance benefits they unlock for data-intensive workloads.

How the De…

This blog will walk through how DE and nvCOMP work, the usage guidelines, and the performance benefits they unlock for data-intensive workloads.

How the Decompression Engine works

The new DE in the Blackwell architecture is a fixed-function hardware block designed to accelerate decompression of Snappy, LZ4, and Deflate-based streams. By handling decompression in hardware, the DE frees up valuable streaming multiprocessor (SM) resources for compute, rather than burning cycles on data movement.

Integrated as part of the copy engine, the DE eliminates the need for sequential host-to-device copies followed by software decompression. Instead, compressed data can be transferred directly across PCIe or C2C and decompressed in transit, reducing a major I/O bottleneck.

Beyond raw throughput, the DE enables true concurrency of data movement and compute. Multi-stream workloads can issue decompression operations in parallel with SM kernels, keeping the GPU fully utilized. In practice, this means data-intensive applications such as training LLMs, analyzing massive genomics datasets, or running HPC simulations can keep pace with the bandwidth of next-generation Blackwell GPUs without stalling on I/O.

The benefits of nvCOMP’s GPU-accelerated compression

The NVIDIA nvCOMP library provides GPU-accelerated compression and decompression routines. It supports a wide range of standard formats, along with formats that NVIDIA has optimized for the best possible GPU performance.

In the case of standard formats, CPUs and fixed function hardware frequently have architectural advantages over the GPU because of the limited parallelism available. The decompress engine is our solution to this problem for a range of workloads. The following sections will discuss further how to leverage nvCOMP to use DE.

How to use DE and nvCOMP

It’s best for developers to leverage DE through nvCOMP APIs. Since the DE is only available on selected GPUs (as of now, the B200, B300, GB200, and GB300), using nvCOMP enables developers to write portable code that scales and works across GPUs as the DE footprint evolves over time. When the DE is available, nvCOMP will make use of it without changes to user code. If not, nvCOMP will fall back to its accelerated SM-based implementations.

There are a few things you need to do to ensure this behavior on DE-enabled GPUs. nvCOMP generally allows input and output buffers of any type that are accessible to the device. The DE has stricter requirements. If your buffers do not meet these requirements, nvCOMP will also execute the decompress on SM. See Table 1 for a description of the allowed allocation types and their intended usages.


cudaMalloc	Standard device-only allocation	Device
cudaMallocFromPoolAsync	Easy-to-use pool-based allocations with more	Host/device
cuMemCreate	Low-level control of host/device allocations	Host/device

Table 1. Allowed allocation types and their intended usages cudaMalloc allocations can be allocated as normal for device-to-device decompression. Host-to-device or even host-to-host decompression is possible if using cudaMallocFromPoolAsync or cuMemCreate, but care must be taken to set up the allocators properly.

The following section will provide worked examples of how to use these different allocators. Note that in both cases, the only difference in standard use of these APIs is the addition of the cudaMemPoolCreateUsageHwDecompress and CU_MEM_CREATE_USAGE_HW_DECOMPRESS flags. In both examples, these allocations are placed on the first CPU NUMA node.

Using cudaMallocFromPoolAsync

The code example below shows how to create a pinned host memory pool with the cudaMemPoolCreateUsageHwDecompress flag, enabling allocations compatible with the DE.

cudaMemPoolProps props = {};
props.location.type = cudaMemLocationTypeHostNuma;
props.location.id = 0;
props.allocType 	= cudaMemAllocationTypePinned;
props.usage     	= cudaMemPoolCreateUsageHwDecompress;
cudaMemPool_t mem_pool;
CUDA_CHECK(cudaMemPoolCreate(&mem_pool, &props));
char* mem_pool_ptr;
CUDA_CHECK(cudaMallocFromPoolAsync(&mem_pool_ptr, 1024, mem_pool, stream));

Using cuMemCreate

This example demonstrates how to use the low-level CUDA driver API (cuMemCreate) to allocate pinned host memory with the CU_MEM_CREATE_USAGE_HW_DECOMPRESS flag. It ensures the buffer is compatible with the DE.

CUdeviceptr mem_create_ptr;
CUmemGenericAllocationHandle allocHandle;
CUmemAllocationProp props = {};
props.location.type = CU_MEM_LOCATION_TYPE_HOST_NUMA;
props.location.id = 0;
props.type = CU_MEM_ALLOCATION_TYPE_PINNED;
props.allocFlags.usage = CU_MEM_CREATE_USAGE_HW_DECOMPRESS;
size_t granularity;
CU_CHECK(cuMemGetAllocationGranularity(&granularity, &props, CU_MEM_ALLOC_GRANULARITY_MINIMUM));

// Create the allocation handle
CU_CHECK(cuMemCreate(&allocHandle, granularity, &props, 0));

// Reserve virtual address space
CU_CHECK(cuMemAddressReserve(&mem_create_ptr, granularity, 0, 0, 0));

// Map the physical memory to the virtual address
CU_CHECK(cuMemMap(mem_create_ptr, granularity, 0, allocHandle, 0));

Best practices for buffer batching

For best performance, the batch of buffers used for decompression (input/output/sizes) should be pointers that are offset into the same allocations. If providing a batch of buffers from different allocations, host driver launch overhead can be significant.

uint8_t* d_decompressed_buffer;
CUDA_CHECK(cudaMalloc(&d_decompressed_buffer, total_decompressed_size));

// Create pinned host arrays for device decompression pointers
uint8_t** h_d_decompressed_ptrs;
CUDA_CHECK(cudaHostAlloc(&h_d_decompressed_ptrs, actual_num_buffers * sizeof(uint8_t*), cudaHostAllocDefault));

// Fill the pinned host pointer arrays for device decompression using offsets
size_t decompressed_offset = 0;
for (int i = 0; i < actual_num_buffers; ++i) {
h_d_decompressed_ptrs[i] = d_decompressed_buffer + decompressed_offset;
decompressed_offset += input_sizes[i]
}

Note that due to synchronization requirements associated with the DE, nvCOMP’s asynchronous APIs will synchronize with the calling stream. Generally, nvCOMP will still return before the API finishes, so you’ll still need to synchronize the calling stream again before using the result of decompression if decompressing to the host. For device-side access, the decompress result is available in normal stream-ordering.

On B200, if any buffer is larger than 4 MB, nvCOMP will fall back to an SM-based implementation. This limit might change in the future, and can be queried by the following code:

int max_supported_size = 0;
res = CudaDriver::cuDeviceGetAttribute(&max_supported_size,
CU_DEVICE_ATTRIBUTE_MEM_DECOMPRESS_MAXIMUM_LENGTH,
device_id);

How SM performance compares to DE

DE provides faster decompression while freeing the SM for other work. The DE provides dozens of execution units compared to the thousands of warps available on the SMs. Each DE execution unit is much faster than an SM at executing decompress, but in some workloads, SM speed will approach DE when fully saturating the SM resources. Either SM or DE can execute using host pinned data as input, enabling zero-copy decompression.

The following figure will demonstrate SM versus DE performance on the Silesia benchmark for LZ4, Deflate, and Snappy algorithms. Note that Snappy is newly optimized in nvCOMP 5.0, and further software optimization opportunities are possible for Deflate and LZ4.

The performance measurement is done for 64 KiB and 512 KiB chunk sizes using “small” and “large” datasets. The large dataset is the full Silesia dataset, while the small dataset is the first ~50 MB of Silesia.tar (available here).

Six bar charts indicating performance of DE versus SM with Deflat, Snappy and LZ4 formats. Figure 1. Comparing the performance of streaming multiprocessors to the Decompression Engine, as shown in six examples.

Get started

The Decompression Engine in Blackwell makes it much easier to deal with one of the biggest challenges in data-heavy workloads: fast, efficient decompression. By moving this work to dedicated hardware, applications not only see faster results but also free up GPU compute for other tasks.

With nvCOMP handling the integration automatically, developers can take advantage of these improvements without changing their code, leading to smoother pipelines and better performance.

To get started with these new features, explore the following resources:

Learn more about nvCOMP and the hardware Decompression Engine and how to easily integrate them into your existing workflows.
Learn more about the latest nvCOMP API examples and benchmarks.
Download the latest version of nvCOMP to get started.