Boosting LLM Performance with Tiered KV Cache on Google Kubernetes Engine

Large Language Models (LLMs) are powerful, but their performance can be bottlenecked by the immense NVIDIA GPU memory footprint of the Key-Value (KV) Cache. This cache, crucial for speeding up LLM inference by storing Key (K) and Value (V) matrices, directly impacts context length, concurrency, and overall system throughput. Our primary goal is to maximize the KV Cache hit ratio by intelligently expanding NVIDIA GPU High Bandwidth Memory (HBM) with a tiered node-local storage solution.

Our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, and Yihua Cheng from Tensormesh) has led to the development of an innovative solution on Google Kubernetes Engine (GKE).

Tiered Storage: Expanding the KV Cache Beyond HBM

LMCache extends the KV Cache from the NVIDIA GPU’s …

Our collaboration with the LMCache team (Kuntai Du, Jiayi Yao, and Yihua Cheng from Tensormesh) has led to the development of an innovative solution on Google Kubernetes Engine (GKE).

Tiered Storage: Expanding the KV Cache Beyond HBM

LMCache extends the KV Cache from the NVIDIA GPU’s fast HBM (Tier 1) to larger, more cost-effective tiers like CPU RAM and local SSDs. This dramatically increases the total cache size, leading to a higher hit ratio and improved inference performance by keeping more data locally on the accelerator node. For GKE users, this means accommodating models with massive context windows while maintaining excellent performance.

Performance Benchmarking and Results

We designed tests to measure the performance of this tiered KV Cache by configuring workloads to fill each storage layer (HBM, CPU RAM, Local SSD). We benchmarked these configurations using various context lengths (1k, 5k, 10k, 50k, and 100k tokens), representing diverse use cases such as:

1k - 5k tokens: High-fidelity personas and complex instructions

10k tokens: Average user prompts (small RAG) or web page/article content

50k tokens: Prompt stuffing

100k tokens: Content equivalent to a long book

Our primary performance indicators were Time to First Token (TTFT), token input throughput, and end-to-end latency. The results highlight the best-performing storage setup for each KV Cache size and the performance improvements achieved.

Experiment Setup

We deployed a vLLM server on an A3 mega machine, leveraging local SSD for ephemeral storage via emptyDir.

Hardware: 8 × nvidia-h100-mega-80gb NVIDIA GPUs

Model: Llama-3.3-70B-Instruct

LMCache version: v0.3.3

Cache Configuration:

HBM only

HBM + CPU RAM

HBM + CPU RAM + Local SSD

Storage Resources: HBM: 640Gi, CPU RAM: 1Ti, Local SSD: 5Ti

Benchmark Tool: SGLang bench_serving

Requests: Tests were conducted with system prompt lengths of 1k, 5k, 10k, 50k, and 100k tokens. Each system prompt provided a shared context for a batch of 20 inference requests, with individual requests consisting of a unique 256-token input and generating a 512-token output.

Example Command:

Benchmark Results

Our tests explored different total KV Cache sizes. The following results highlight the optimal storage setup for each size and the performance improvements achieved:

Test 1: Cache (1.1M - 1.3M tokens) fits entirely within HBM

Results: In this scenario, adding slower storage tiers provided no advantage, making an HBM-only configuration the optimal setup.

Test 2: Cache (4.0M - 4.3M tokens) exceeds HBM capacity but fits within HBM + CPU RAM


System Prompt Length	Best-performing Storage Setup	Mean TTFT (ms) Change (%) vs. HBM only	Input Throughput Change (%) vs. HBM only	Mean End-to-End Latency Change (%) vs. HBM only
1000	HBM	0%	0%	0%
5000	HBM + CPU RAM	-18%	+16%	-14%
10000	HBM + CPU RAM	-44%	+50%	-33%
50000	HBM + CPU RAM + Local SSD	-68%	+179%	-64%
100000	HBM + CPU RAM + Local SSD	-79%	+264%	-73%

Test 3: Large cache (12.6M - 13.7M tokens) saturates HBM and CPU RAM, spilling to Local SSD


System Prompt Length	Best-performing Storage Setup	Mean TTFT (ms) Change (%) vs. HBM only	Input Throughput Change (%) vs. HBM only	Mean End-to-End Latency Change (%) vs. HBM only
1000	HBM + CPU RAM	+5%	+1%	-1%
5000	HBM + CPU RAM	-6%	+27%	-21%
10000	HBM + CPU RAM	+121%	+23%	-19%
50000	HBM + CPU RAM + Local SSD	+48%	+69%	-41%
100000	HBM + CPU RAM + Local SSD	-3%	+130%	-57%

Summary

These results clearly demonstrate that a tiered storage solution significantly improves LLM inference performance by leveraging node-local storage, especially in scenarios with long system prompts that generate large KV Caches.

Optimizing LLM inference is a complex challenge requiring the coordinated effort of multiple infrastructure components (storage, compute, networking). Our work is part of a broader initiative to enhance the entire end-to-end inference stack, from intelligent load balancing at the Inference Gateway to advanced caching logic within the model server.

We are actively exploring further enhancements by integrating additional remote storage solutions with LMCache.

Next Steps

Get started with the same setup mentioned above on GKE.

Keep up to date on the LLM-D Inference Stack.

Posted in

Developers & Practitioners

Tiered Storage: Expanding the KV Cache Beyond HBM

Tiered Storage: Expanding the KV Cache Beyond HBM

Performance Benchmarking and Results

1k - 5k tokens: High-fidelity personas and complex instructions

10k tokens: Average user prompts (small RAG) or web page/article content

50k tokens: Prompt stuffing

Experiment Setup

Hardware: 8 × nvidia-h100-mega-80gb NVIDIA GPUs

Model: Llama-3.3-70B-Instruct

LMCache version: v0.3.3

Storage Resources: HBM: 640Gi, CPU RAM: 1Ti, Local SSD: 5Ti

Benchmark Results

Test 1: Cache (1.1M - 1.3M tokens) fits entirely within HBM

Test 2: Cache (4.0M - 4.3M tokens) exceeds HBM capacity but fits within HBM + CPU RAM

Test 3: Large cache (12.6M - 13.7M tokens) saturates HBM and CPU RAM, spilling to Local SSD

Summary

Next Steps

Get started with the same setup mentioned above on GKE.

Similar Posts