Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186562
posts in
60.5
ms
DepthKV
: Layer-Dependent
KV
Cache Pruning for Long-Context LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
3d
The
Inference
Economy:
Token
Use
💰
Tokenomics
frontierai.substack.com
·
11h
·
Substack
Adaptive
Thinking
: Large Language Models Know When to Think in
Latent
Space
🏗️
LLM Infrastructure
machinelearning.apple.com
·
2d
AmSach/kvquant
: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM
🏗️
LLM Infrastructure
github.com
·
17h
·
DEV
DeepSeek-V4 on Day 0: From Fast Inference to Verified
RL
with
SGLang
and Miles
🧠
Inference Serving
lmsys.org
·
5d
·
Hacker News
Speculative
Decoding vs
MoE
: 3.2x Cost Gap on Llama 3
📊
Model Serving Economics
tildalice.io
·
3d
Incompressible
Knowledge Probes: Estimating Black-Box LLM Parameter Counts via
Factual
Capacity
🏗️
LLM Infrastructure
arxiv.org
·
2d
·
Hacker News
shreyansh26/Speculative-Decoding
: Speculative Decoding Implementations: EAGLE-3, Medusa-1,
PARD
, Draft Models, N-gram and Suffix Decoding from scratch
📊
Model Serving Economics
github.com
·
4d
·
r/LLM
,
r/LocalLLaMA
PolyKV
: A Shared
Asymmetrically-Compressed
KV Cache Pool for Multi-Agent LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
DAK
: Direct-Access-Enabled GPU Memory
Offloading
with Optimal Efficiency for LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
Efficient,
VRAM-Constrained
xLM
Inference on Clients
🏗️
LLM Infrastructure
arxiv.org
·
1d
QFlash
: Bridging
Quantization
and Memory Efficiency in Vision Transformer Attention
🔬
RaBitQ
arxiv.org
·
2d
DUAL-BLADE: Dual-Path NVMe-Direct
KV-Cache
Offloading
for Edge LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
Network Edge Inference for Large Language Models:
Principles
,
Techniques
, and Opportunities
📱
Edge AI Optimization
arxiv.org
·
3d
Scalable Inference
Architectures
for
Compound
AI Systems: A Production Deployment Study
🏗️
LLM Infrastructure
arxiv.org
·
2d
Select to Think: Unlocking
SLM
Potential with Local
Sufficiency
🏗️
LLM Infrastructure
arxiv.org
·
1d
Hybrid
JIT-CUDA
Graph Optimization for Low-Latency Large Language Model Inference
🏗️
LLM Infrastructure
arxiv.org
·
3d
Exploring the Efficiency of
3D-Stacked
AI Chip Architecture for LLM Inference with
Voxel
🏗️
LLM Infrastructure
arxiv.org
·
1d
Anchored
Variational Inference for Personalized
Sequential
Latent-State Models
🏗️
LLM Infrastructure
arxiv.org
·
3d
One
Refiner
to Unlock Them All: Inference-Time Reasoning
Elicitation
via Reinforcement Query Refinement
🏗️
LLM Infrastructure
arxiv.org
·
2d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help