⚡ KV Cache - linbolin1230 · Scour

PagedAttention is more than virtual memory

🧠LLM Inference

thecomputersciencebook.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

🧠LLM Inference Academic

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

🧠LLM Inference Blog

thecybersidekick.beehiiv.com··DEV

67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X

🧠LLM Inference Blog

anyscale.com··Hacker News

llama.cpp vs. vLLM: Choosing the right local LLM inference engine

🧠LLM Inference

developers.redhat.com··Covers 7 stories

The Transformer Pipeline: A Complete Mathematical and Visual Guide

🔢Vector DBs Blog

·

Cosmicgpt – A GPT-in-space simulator to research SpaceX AI satellite viability

💬LLMs Code

github.com··Hacker News

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

networkworld.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

🧠LLM Inference

vettedconsumer.com··Hacker News·Covers: Efficient Memory Management for Large Language Model Serving with PagedAttention, DeepSeek-V2: A Strong, Economical, and Efficient MOE Language Model

A brief history of KV cache compression developments

🧠LLM Inference Blog

martinalderson.com··Covers: TurboQuant: Redefining AI efficiency with extreme compression

Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…

🧠LLM Inference Blog

·

Less-relevant results

KV Cache in LLMs: From Zero to Production

🧠LLM Inference Blog

carnotresearch.medium.com·

RAG Observability with Langfuse, vLLM, and FAISS

pyimagesearch.com·

Why GPUs Became the Foundation of AI: A GPU Primer for K8s Veterans

🔧MLOps Blog

KV Cache Explained: Why LLMs Recompute Everything and How We Stop It

🧠LLM Inference Blog

·

How Public AI delivers sovereign LLM inference on AWS and Intel

🧠LLM Inference Blog

aws.amazon.com··Covers: Hugging Face – Fun chat with your own Artificial Intelligence, vLLM +1 more

DFlash and Spec V2 Decoding (14 minute read)

🧠LLM Inference Blog

lmsys.org··Covers: Looking for a self-hosted alternative to Modal.com for running ML workloads, MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS +2 more

Run a local coding model with pi and LM Studio

🧠LLM Inference

zarar.dev··Covers: Pi.dev: There are many coding agents, but this one is mine, Opencode – open-source alternative to Claude Code +3 more

DiffusionGemma’s 4x Speedup Is a GPU Utilization Trick, Not a Model Breakthrough

🗄️Storage Engines Blog

·

yuxinlu1/gemma-4-12B-coder-fable5-composer2.5-v1-GGUF

huggingface.co··Covers: GitHub here . You can follow the build instructions below as well. Change -DGGML_CUDA=ON to -DGGML_CUDA=OFF if you don't have a GPU or just want CPU inferen...

Log in to enable infinite scrolling