Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
187389
posts in
50.3
ms
SparKV
:
Overhead-Aware
KV Cache Loading for Efficient On-Device LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
6d
The
Inference
Economy:
Token
Use
💰
Tokenomics
frontierai.substack.com
·
9h
·
Substack
Adaptive
Thinking
: Large Language Models Know When to Think in
Latent
Space
🏗️
LLM Infrastructure
machinelearning.apple.com
·
2d
AmSach/kvquant
: Drop-in KV cache compressor for local LLM inference - Run 70B models on 8GB RAM
🏗️
LLM Infrastructure
github.com
·
15h
·
DEV
DeepSeek-V4 on Day 0: From Fast Inference to Verified
RL
with
SGLang
and Miles
🧠
Inference Serving
lmsys.org
·
5d
·
Hacker News
DepthKV
: Layer-Dependent
KV
Cache Pruning for Long-Context LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
shreyansh26/Speculative-Decoding
: Speculative Decoding Implementations: EAGLE-3, Medusa-1,
PARD
, Draft Models, N-gram and Suffix Decoding from scratch
📊
Model Serving Economics
github.com
·
4d
·
r/LLM
,
r/LocalLLaMA
DAK
: Direct-Access-Enabled GPU Memory
Offloading
with Optimal Efficiency for LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
23h
Network Edge Inference for Large Language Models:
Principles
,
Techniques
, and Opportunities
📱
Edge AI Optimization
arxiv.org
·
2d
Efficient,
VRAM-Constrained
xLM
Inference on Clients
🏗️
LLM Infrastructure
arxiv.org
·
23h
Incompressible
Knowledge Probes: Estimating Black-Box LLM Parameter Counts via
Factual
Capacity
🏗️
LLM Infrastructure
arxiv.org
·
1d
·
Hacker News
DUAL-BLADE: Dual-Path NVMe-Direct
KV-Cache
Offloading
for Edge LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
23h
PolyKV
: A Shared
Asymmetrically-Compressed
KV Cache Pool for Multi-Agent LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
Select to Think: Unlocking
SLM
Potential with Local
Sufficiency
🏗️
LLM Infrastructure
arxiv.org
·
23h
MCAP
: Deployment-Time Layer
Profiling
for Memory-Constrained LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
6d
Hybrid
JIT-CUDA
Graph Optimization for Low-Latency Large Language Model Inference
🏗️
LLM Infrastructure
arxiv.org
·
2d
Exploring the Efficiency of
3D-Stacked
AI Chip Architecture for LLM Inference with
Voxel
🏗️
LLM Infrastructure
arxiv.org
·
23h
Anchored
Variational Inference for Personalized
Sequential
Latent-State Models
🏗️
LLM Infrastructure
arxiv.org
·
2d
FairyFuse
: Multiplication-Free LLM Inference on CPUs via
Fused
Ternary Kernels
🕯️
Candle
arxiv.org
·
6d
QFlash
: Bridging
Quantization
and Memory Efficiency in Vision Transformer Attention
🔬
RaBitQ
arxiv.org
·
1d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help