Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
💾 Prompt Caching
Specific
Context Reuse, KV Cache, Inference Optimization, Token Efficiency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
24869
posts in
211.7
ms
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
🧠
LLM Inference
pub.towardsai.net
·
1d
How we cut our agent's API costs by 10x with
prompt
caching
⏳
Durable Execution
kern-ai.com
·
4d
·
Hacker News
KV
Cache
Offloading
for Context-Intensive Tasks
⚡
Prefetching
arxiv.org
·
15h
Low-Rank Key Value Attention: Reducing
KV
Cache Memory and
Maintaining
Head Diversity
🧠
LLM Inference
fin.ai
·
1d
·
Hacker News
g023/turboquant
: Standalone TurboQuant KV Cache Inference for https://
huggingface.co/g023/Qwen3-1.77B-g023
🕯️
Candle
github.com
·
6d
·
Hacker News
GPU Memory for LLM Inference: Why
Llama-70B
Doesn't Fit
🏗️
LLM Infrastructure
darshanfofadiya.com
·
4d
·
Hacker News
AsyncTLS
: Efficient Generative LLM Inference with
Asynchronous
Two-level Sparse Attention
🧠
LLM Inference
arxiv.org
·
15h
TurboQuant
- Extreme KV Cache Quantization ·
ggml-org
llama.cpp
🔬
RaBitQ
github.com
·
3d
·
r/LocalLLaMA
Breaking the Memory Wall:
TurboQuant
KV
Cache Quantization on Apple Silicon
🖥️
Hardware Architecture
pub.towardsai.net
·
1d
SAT:
Balancing
Reasoning Accuracy and Efficiency with
Stepwise
Adaptive Thinking
🧮
SMT Solvers
arxiv.org
·
15h
Comparative
Characterization
of
KV
Cache Management Strategies for LLM Inference
🧠
LLM Inference
arxiv.org
·
2d
HybridKV
: Hybrid
KV
Cache Compression for Efficient Multimodal Large Language Model Inference
🧠
LLM Inference
arxiv.org
·
2d
From
Whiteboard
to IDE: Implementing Google’s
TurboQuant
KV Cache Compression in Python
🔬
RaBitQ
pub.towardsai.net
·
3d
AudioKV
: KV Cache
Eviction
in Efficient Large Audio Language Models
🧠
LLM Inference
arxiv.org
·
1d
ForkKV
: Scaling Multi-LoRA Agent Serving via Copy-on-Write
Disaggregated
KV Cache
🌍
Distributed Systems
arxiv.org
·
1d
TRAPTI
: Time-Resolved Analysis for
SRAM
Banking and Power Gating Optimization in Embedded Transformer Inference
🖥️
Hardware Architecture
arxiv.org
·
1d
Cognitive Loop of Thought:
Reversible
Hierarchical
Markov
Chain for Efficient Mathematical Reasoning
🧠
LLM Inference
arxiv.org
·
1d
TokenDance
: Scaling Multi-Agent LLM Serving via Collective
KV
Cache Sharing
🗳️
Raft Consensus
arxiv.org
·
4d
JoyAI-LLM
Flash:
Advancing
Mid-Scale LLMs with Token Efficiency
🏗️
LLM Infrastructure
arxiv.org
·
4d
FluxMoE
:
Decoupling
Expert Residency for High-Performance MoE Serving
🏗️
LLM Infrastructure
arxiv.org
·
4d
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help