Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
💾 Prompt Caching
Specific
Context Reuse, KV Cache, Inference Optimization, Token Efficiency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186665
posts in
55.0
ms
Claude: How
prompt
caching
actually works
⏳
Lazy Loading
mager.co
·
2d
DepthKV
: Layer-Dependent
KV
Cache Pruning for Long-Context LLM Inference
🧠
LLM Inference
arxiv.org
·
3d
AmSach/kvquant
: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM
🧠
LLM Inference
github.com
·
17h
·
DEV
KV
Cache
Locality
: The Hidden Variable in Your LLM Serving Cost
⚡
Prefetching
ranvier.systems
·
1d
·
Hacker News
Qwen 3.6-35B-A3B KV cache bench:
f16
vs q8_0 vs
turbo3
vs turbo4 from 0 to 1M context on M5 Max
🖥️
Hardware Architecture
llmkube.com
·
3d
·
r/LocalLLaMA
$
38k
AWS Bedrock bill caused by a simple prompt
caching
miss
🌐
Pingora
news.ycombinator.com
·
2d
·
Hacker News
DeepSeek V4 Cuts KV Cache by 90% at 1M Tokens, But Aggressive Compression Could Risk ‘
Needle
in a
Haystack
’ Failures
🧠
LLM Inference
wccftech.com
·
6d
not much
happened
today
🏗️
LLM Infrastructure
news.smol.ai
·
2d
Microsoft updates VS Code to 1.118 and adds remote control for
Copilot
CLI
🔧
Developer tools
neowin.net
·
1d
Speculative
Decoding vs
MoE
: 3.2x Cost Gap on Llama 3
📊
Model Serving Economics
tildalice.io
·
3d
Gemma 4 and Qwen 3.6 with
q8
_0 and q4_0 KV cache: KL
divergence
results
⚡
Prefetching
localbench.substack.com
·
6d
·
r/LocalLLaMA
GPT-5.5 is here: The price
doubled
, but 40% fewer tokens means it’s actually a ~20% hike. Here’s the honest
TL
;DR.
🖥
GPUs
mindwiredai.com
·
6d
·
r/PromptEngineering
,
r/SideProject
Google splits AI chips into training and inference
TPUs
, signaling shift toward
workload-specialized
AI infrastructure
📱
Edge AI Optimization
digitimes.com
·
6d
Rethinking KV Cache
Eviction
via a Unified
Information-Theoretic
Objective
🧠
LLM Inference
arxiv.org
·
1d
Update:
GPT-5.5
and
GPT-5.5
Pro are now available in the API.
🌐
Web Standards
twitter.macworks.dev
·
6d
I got a $134
Cloudflare
D1
bill. Here's how I cut it 95%
☁️
Cloudflare D1
fullstacksveltekit.com
·
3d
·
Hacker News
PolyKV
: A Shared
Asymmetrically-Compressed
KV Cache Pool for Multi-Agent LLM Inference
🧠
LLM Inference
arxiv.org
·
2d
CacheFlow
: Efficient LLM Serving with 3D-Parallel
KV
Cache Restoration
🔄
Cache Coherence
arxiv.org
·
2d
Salca
: A
Sparsity-Aware
Hardware Accelerator for Efficient Long-Context Attention Decoding
⚡
Hardware Acceleration
arxiv.org
·
2d
Stochastic
KV
Routing: Enabling Adaptive
Depth-Wise
Cache Sharing
🧠
Memory Hierarchy Design
arxiv.org
·
3d
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help