💾 Prompt Caching - emschwartz · Scour

Claude: How prompt caching actually works ⏳Lazy Loading

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference 🧠LLM Inference

AmSach/kvquant: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM 🧠LLM Inference

github.com·17h·DEV

KV Cache Locality: The Hidden Variable in Your LLM Serving Cost ⚡Prefetching

ranvier.systems·1d·Hacker News

Qwen 3.6-35B-A3B KV cache bench: f16 vs q8_0 vs turbo3 vs turbo4 from 0 to 1M context on M5 Max 🖥️Hardware Architecture

llmkube.com·3d·r/LocalLLaMA

$38k AWS Bedrock bill caused by a simple prompt caching miss 🌐Pingora

news.ycombinator.com·2d·Hacker News

DeepSeek V4 Cuts KV Cache by 90% at 1M Tokens, But Aggressive Compression Could Risk ‘Needle in a Haystack’ Failures 🧠LLM Inference

wccftech.com·6d

not much happened today 🏗️LLM Infrastructure

news.smol.ai·2d

Microsoft updates VS Code to 1.118 and adds remote control for Copilot CLI 🔧Developer tools

Speculative Decoding vs MoE: 3.2x Cost Gap on Llama 3 📊Model Serving Economics

tildalice.io·3d

Gemma 4 and Qwen 3.6 with q8_0 and q4_0 KV cache: KL divergence results ⚡Prefetching

localbench.substack.com·6d·r/LocalLLaMA

GPT-5.5 is here: The price doubled, but 40% fewer tokens means it’s actually a ~20% hike. Here’s the honest TL;DR. 🖥GPUs

mindwiredai.com·6d·r/PromptEngineering, r/SideProject

Google splits AI chips into training and inference TPUs, signaling shift toward workload-specialized AI infrastructure 📱Edge AI Optimization

digitimes.com·6d

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective 🧠LLM Inference

Update: GPT-5.5 and GPT-5.5 Pro are now available in the API. 🌐Web Standards

twitter.macworks.dev

·6d

I got a $134 Cloudflare D1 bill. Here's how I cut it 95% ☁️Cloudflare D1

fullstacksveltekit.com·3d·Hacker News

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference 🧠LLM Inference

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration 🔄Cache Coherence

Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding ⚡Hardware Acceleration

Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing 🧠Memory Hierarchy Design

Log in to enable infinite scrolling