Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
201
posts in
30.8
ms
LLM
Inference
🏗️
LLM Infrastructure
iop.systems
·
49m
SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for
LLM
Inference
on Superchips
🏗️
LLM Infrastructure
supercomputing-system-ai-lab.github.io
·
2d
·
Hacker News
chiennv2000/orthrus: Fast, lossless
LLM
inference
via dual-view diffusion
decoding
.
🏗️
LLM Infrastructure
github.com
·
5d
·
Hacker News
SpecSA: Bridging
Speculative
Decoding
and Sparse
Attention
for Efficient LLM Inference
🏗️
LLM Infrastructure
arxiv.org
·
23h
InferenceBench
: A Benchmark for Open-Ended Inference Optimization by AI Agents
🏗️
LLM Infrastructure
inferencebench.ai
·
4h
·
Hacker News
Your
LLM
Server Is Wasting 80% of Its GPU Memory — Here’s How
vLLM
Fixes That
🏗️
LLM Infrastructure
pub.towardsai.net
·
1d
I tried 4
LLM
speedup techniques on CPU. Three made it slower.
⚙️
Mechanical Sympathy
deemwar-products.github.io
·
8h
·
Hacker News
https://www.together.ai/blog/coding-agent-benchmarks
💻
Coding Agents
together.ai
·
5d
KV
Cache
Is Becoming the Memory Hierarchy of
Inference
💾
Prompt Caching
touchdown-labs.com
·
2d
GPU Memory Math for LLMs: Formula That Tells You What Fits on Your GPU
🏗️
LLM Infrastructure
theahmadosman.substack.com
·
6h
·
Substack
,
r/LocalLLaMA
Benchmarking
llama.cpp
's brand-new MTP support on Strix Halo
⚡
Glommio
calebcoffie.com
·
2d
·
Hacker News
How
LLM
Inference
Works
🏗️
LLM Infrastructure
arpitbhayani.me
·
6d
·
Hacker News
DrBearJew/llama.cpp
at tbq4-rdna3-experiment
🏗️
LLM Infrastructure
github.com
·
6d
·
r/LocalLLaMA
FlexDraft: Flexible
Speculative
Decoding
via
Attention
Tuning and Bonus-Guided Calibration
📊
Model Serving Economics
arxiv.org
·
23h
Let AI Agents Write Your Serving Stack with VibeServe
🏗️
LLM Infrastructure
syfi.cs.washington.edu
·
6d
·
Hacker News
VeriCache: Turning Lossy
KV
Cache
into Lossless
LLM
Inference
🏗️
LLM Infrastructure
arxiv.org
·
1d
RedToasty/llama.cpp
_qts: Fixing --
split-mode
tensor, with different KV cache quantization types.
🏗️
LLM Infrastructure
github.com
·
3d
·
r/LocalLLaMA
Draft Less, Retrieve More: Hybrid Tree Construction for
Speculative
Decoding
🔢
BitNet Inference
arxiv.org
·
23h
I've updated my glorified
Llama
fork (
LLM
Inference
Server) for P40's to utilise MTP + TurboQuant + DFlash
🏗️
LLM Infrastructure
github.com
·
4d
·
r/LocalLLaMA
OSCAR: Offline Spectral Covariance-Aware Rotation for
2-bit
KV
Cache
Quantization
🔬
RaBitQ
arxiv.org
·
1d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help