Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
29024
posts in
55.6
ms
SparKV
:
Overhead-Aware
KV Cache Loading for Efficient On-Device LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
6d
The
Inference
Economy:
Token
Use
💰
Tokenomics
frontierai.substack.com
·
5h
·
Substack
Adaptive
Thinking
: Large Language Models Know When to Think in
Latent
Space
🏗️
LLM Infrastructure
machinelearning.apple.com
·
1d
DeepSeek-V4 on Day 0: From Fast Inference to Verified
RL
with
SGLang
and Miles
🧠
Inference Serving
lmsys.org
·
4d
·
Hacker News
DeepSeek
V4
in
vLLM
: Efficient Long-Context Attention
🏗️
LLM Infrastructure
vllm-website-pdzeaspbm-inferact-inc.vercel.app
·
6d
·
Hacker News
shreyansh26/Speculative-Decoding
: Speculative Decoding Implementations: EAGLE-3, Medusa-1,
PARD
, Draft Models, N-gram and Suffix Decoding from scratch
📊
Model Serving Economics
github.com
·
4d
·
r/LLM
,
r/LocalLLaMA
DepthKV
: Layer-Dependent
KV
Cache Pruning for Long-Context LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
Inside the
Softmax
Bottleneck
: Engineering Hardware-Aware Attention Mechanisms
🏗️
LLM Infrastructure
pub.towardsai.net
·
6d
Network Edge Inference for Large Language Models:
Principles
,
Techniques
, and Opportunities
📱
Edge AI Optimization
arxiv.org
·
2d
DAK
: Direct-Access-Enabled GPU Memory
Offloading
with Optimal Efficiency for LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
19h
Efficient,
VRAM-Constrained
xLM
Inference on Clients
🏗️
LLM Infrastructure
arxiv.org
·
19h
DUAL-BLADE: Dual-Path NVMe-Direct
KV-Cache
Offloading
for Edge LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
19h
Incompressible
Knowledge Probes: Estimating Black-Box LLM Parameter Counts via
Factual
Capacity
🏗️
LLM Infrastructure
arxiv.org
·
1d
·
Hacker News
Select to Think: Unlocking
SLM
Potential with Local
Sufficiency
🏗️
LLM Infrastructure
arxiv.org
·
19h
Exploring the Efficiency of
3D-Stacked
AI Chip Architecture for LLM Inference with
Voxel
🏗️
LLM Infrastructure
arxiv.org
·
19h
PolyKV
: A Shared
Asymmetrically-Compressed
KV Cache Pool for Multi-Agent LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
Hybrid
JIT-CUDA
Graph Optimization for Low-Latency Large Language Model Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
QFlash
: Bridging
Quantization
and Memory Efficiency in Vision Transformer Attention
🔬
RaBitQ
arxiv.org
·
1d
Scaling Multi-Node
Mixture-of-Experts
Inference Using Expert
Activation
Patterns
🧩
MoE
arxiv.org
·
2d
MCAP
: Deployment-Time Layer
Profiling
for Memory-Constrained LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
6d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help