Show HN: KV Marketplace – share LLM attention caches across GPUs like memcached
github.com·17h·
Discuss: DEV, Hacker News
Flag this post

Cross-GPU KV Cache Marketplace

Abstract

We propose a distributed inference runtime that enables cross-GPU reuse of transformer attention states. In autoregressive language models, each token produces per-layer key and value (KV) tensors that are cached to accelerate subsequent decoding. Today, every process or GPU recomputes these KV tensors independently, even when multiple requests share identical or overlapping prefixes, a major source of redundant computation and GPU memory waste.

The Cross-GPU KV Cache Marketplace treats these attention states as first-class, shareable artifacts. Each process exports completed prefix caches, indexed by a hash of the input token sequence and model version, into a distributed registry. Other processes with matching prefixes can import th…

Similar Posts

Loading similar posts...