💾 Prompt Caching - emschwartz · Scour

Prompt caching but for RL – 7.5x speedup on long-prompt/short-response workloads 🧠LLM Inference

castform.com·23h·Hacker News

Understanding KV Cache in LLMs and How It Affects Inference 🧠LLM Inference

pub.towardsai.net

·4d

Reformulating KV Cache Eviction Problem for Long-Context LLM Inference 🧠LLM Inference

Structural Prompt Preservation: Keeping AI Agents on Track Across Long Sessions 🧠Agent Memory

leithdocs.com·6h

Compressing KV caches with a related model 🔬RaBitQ

fergusfinn.com·4d·Hacker News

Codex in Chrome 🤖, inside Chinese labs 🇨🇳, improving token efficiency 🛠️ 🇨🇳Chinese AI

fluffypony/dothething: an autonomous AI agent: you describe the thing, it does the thing. 🔧Agent Tooling

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction 🧠LLM Inference

Improving token efficiency in GitHub Agentic Workflows 💰Tokenomics

github.blog·4d

Pinning a Local LLM to an RTX 5090: Five Hours, Several Faceplants, One Solid Setup 🏗️LLM Infrastructure

buraak.com·6d·Hacker News

RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache 🔬RaBitQ

RateQuant: Optimal Mixed-Precision KV Cache Quantization via Rate-Distortion Theory 🔬RaBitQ

Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache 🧠LLM Inference

Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant 🔬RaBitQ

pub.towardsai.net

·3d

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment 🗜️Vector Compression

A Queueing-Theoretic Framework for Stability Analysis of LLM Inference with KV Cache Memory Constraints 🧠LLM Inference

Tutti: Making SSD-Backed KV Cache Practical for Long-Context LLM Serving 🏗️LLM Infrastructure

AdapShot: Adaptive Many-Shot In-Context Learning with Semantic-Aware KV Cache Reuse 📦Batch Embeddings

RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction 💨Cache-Friendly Algorithms

Memory Inception: Latent-Space KV Cache Manipulation for Steering LLMs 🧠LLM Inference

Log in to enable infinite scrolling