Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
💾 Prompt Caching
Specific
Context Reuse, KV Cache, Inference Optimization, Token Efficiency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186652
posts in
57.8
ms
Claude: How
prompt
caching
actually works
⏳
Lazy Loading
mager.co
·
2d
SparKV
:
Overhead-Aware
KV Cache Loading for Efficient On-Device LLM Inference
🧠
LLM Inference
arxiv.org
·
6d
AmSach/kvquant
: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM
🧠
LLM Inference
github.com
·
13h
·
DEV
$
38k
AWS Bedrock bill caused by a simple prompt
caching
miss
🌐
Pingora
news.ycombinator.com
·
2d
·
Hacker News
Qwen 3.6-35B-A3B KV cache bench:
f16
vs q8_0 vs
turbo3
vs turbo4 from 0 to 1M context on M5 Max
🖥️
Hardware Architecture
llmkube.com
·
3d
·
r/LocalLLaMA
not much
happened
today
🏗️
LLM Infrastructure
news.smol.ai
·
2d
DeepSeek V4 Cuts KV Cache by 90% at 1M Tokens, But Aggressive Compression Could Risk ‘
Needle
in a
Haystack
’ Failures
🧠
LLM Inference
wccftech.com
·
6d
Microsoft updates VS Code to 1.118 and adds remote control for
Copilot
CLI
🔧
Developer tools
neowin.net
·
1d
Speculative
Decoding vs
MoE
: 3.2x Cost Gap on Llama 3
📊
Model Serving Economics
tildalice.io
·
3d
Rethinking KV Cache
Eviction
via a Unified
Information-Theoretic
Objective
🧠
LLM Inference
arxiv.org
·
21h
Gemma 4 and Qwen 3.6 with
q8
_0 and q4_0 KV cache: KL
divergence
results
⚡
Prefetching
localbench.substack.com
·
6d
·
r/LocalLLaMA
GPT-5.5 is here: The price
doubled
, but 40% fewer tokens means it’s actually a ~20% hike. Here’s the honest
TL
;DR.
🖥
GPUs
mindwiredai.com
·
6d
·
r/PromptEngineering
,
r/SideProject
Google splits AI chips into training and inference
TPUs
, signaling shift toward
workload-specialized
AI infrastructure
📱
Edge AI Optimization
digitimes.com
·
6d
Update:
GPT-5.5
and
GPT-5.5
Pro are now available in the API.
🌐
Web Standards
twitter.macworks.dev
·
6d
I got a $134
Cloudflare
D1
bill. Here's how I cut it 95%
☁️
Cloudflare D1
fullstacksveltekit.com
·
3d
·
Hacker News
DepthKV
: Layer-Dependent
KV
Cache Pruning for Long-Context LLM Inference
🧠
LLM Inference
arxiv.org
·
2d
PolyKV
: A Shared
Asymmetrically-Compressed
KV Cache Pool for Multi-Agent LLM Inference
🧠
LLM Inference
arxiv.org
·
1d
CacheFlow
: Efficient LLM Serving with 3D-Parallel
KV
Cache Restoration
🔄
Cache Coherence
arxiv.org
·
1d
Stochastic
KV
Routing: Enabling Adaptive
Depth-Wise
Cache Sharing
🧠
Memory Hierarchy Design
arxiv.org
·
2d
Salca
: A
Sparsity-Aware
Hardware Accelerator for Efficient Long-Context Attention Decoding
⚡
Hardware Acceleration
arxiv.org
·
1d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help