⚡ KV Cache - linbolin1230 · Scour

PagedAttention is more than virtual memory

🧠LLM Inference

thecomputersciencebook.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

🧠LLM Inference Academic

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

🧠LLM Inference Blog

thecybersidekick.beehiiv.com··DEV

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

🧠LLM Inference Blog

anyscale.com··Hacker News

llama.cpp vs. vLLM: Choosing the right local LLM inference engine

🧠LLM Inference

developers.redhat.com··Covers 7 stories

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

🧠LLM Inference Blog

cloud.google.com·

The Transformer Pipeline: A Complete Mathematical and Visual Guide

🔢Vector DBs Blog

·

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

networkworld.com·

massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.

🧠LLM Inference Code

github.com··Hacker News·Covers: Just ran CC on my Mac remotely from my Phone - while sitting in a Taxi!

Two Qwen3 models on one DGX Spark: the residency math

🧠LLM Inference News

devashish.me··Hacker News

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

🧠LLM Inference

vettedconsumer.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention, DeepSeek-V2: A Strong, Economical, and Efficient MOE Language Model

A brief history of KV cache compression developments

🧠LLM Inference Blog

martinalderson.com··Covers: TurboQuant: Redefining AI efficiency with extreme compression

Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…

🧠LLM Inference Blog

·

Less-relevant results

KV Cache in LLMs: From Zero to Production

🧠LLM Inference Blog

carnotresearch.medium.com·

RAG Observability with Langfuse, vLLM, and FAISS

pyimagesearch.com·

Why GPUs Became the Foundation of AI: A GPU Primer for K8s Veterans

🔧MLOps Blog

KV Cache Explained: Why LLMs Recompute Everything and How We Stop It

🧠LLM Inference Blog

·

DFlash and Spec V2 Decoding (14 minute read)

🧠LLM Inference Blog

lmsys.org··Covers: Looking for a self-hosted alternative to Modal.com for running ML workloads, MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS +2 more

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

huggingface.co··Covers: GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...

Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

🧠LLM Inference Blog

rocm.blogs.amd.com··Hacker News

Log in to enable infinite scrolling