Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Systems-level optimizations for LLM serving
🔧 Systems-level optimizations for LLM serving
Specific
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
139
posts in
7.5
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving
.
🧠
Large Language Models (LLMs)
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
,
r/LLM
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Big Blue’s Redbook on Storage Scale
KV
Cache
management
📊
AI Performance Profiling
Content type:
News
blocksandfiles.com
·
2d
2 days ago
Actions for Big Blue’s Redbook on Storage Scale KV Cache management
Intelligent
inference
scheduling with
llm-d
on Red Hat AI
🚀
LLM serving frameworks
developers.redhat.com
·
23h
23 hours ago
Actions for Intelligent inference scheduling with llm-d on Red Hat AI
Alignment Collapse Under
KV
Cache
Quantization
: Diagnosis and Mitigation
✨
Model optimizations in LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
DiffusionGemma: Discrete diffusion in a large language
model
🧠
Large Language Models (LLMs)
idlemachines.co.uk
·
1h
1 hour ago
·
Hacker News
Actions for DiffusionGemma: Discrete diffusion in a large language model
PagedAttention
vs Traditional
KV
Cache
: How vLLM Reinvented GPU Memory for LLM Inference
🚀
LLM serving frameworks
Content type:
Blog
medium.com
·
3d
3 days ago
Actions for PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference
Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes
🚀
LLM serving frameworks
venturebeat.com
·
8h
8 hours ago
Actions for Google's DiffusionGemma generates 256 tokens in parallel and self-corrects as it goes
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🚀
LLM serving frameworks
Content type:
News
newsletter.semianalysis.com
·
2d
2 days ago
·
Hacker News
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
How we fight GPU scarcity without compromise
🧠
Large Language Models (LLMs)
Content type:
Blog
equixly.com
·
6d
6 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Anatomy of a high-performance EP kernel
📊
AI Performance Profiling
Content type:
Blog
fergusfinn.com
·
1d
1 day ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
Less-relevant results
Making FlashAttention-4 faster for
inference
📊
AI Performance Profiling
Content type:
Blog
modal.com
·
11h
11 hours ago
·
Hacker News
Actions for Making FlashAttention-4 faster for inference
MTP Isn't Always a Win: 1.95x on My 3090, but
Speculative
Decoding
Is Hardware-Dependent
🧠
Large Language Models (LLMs)
Content type:
Blog
bric.pe.kr
·
2d
2 days ago
·
DEV
Actions for MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent
2x GH200 for
LLM
inference
, Part 2:
vLLM
, DeepSeek V4 Flash, and MTP
🌐
Distributed LLM Systems
Content type:
Blog
dnhkng.github.io
·
4d
4 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
🧠
Large Language Models (LLMs)
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
✨
Model optimizations in LLMs
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
6d
6 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
Stop Treating Your
Models
Like Microservices
⚙️
AI Infrastructure Automation
cloudnativenow.com
·
5h
5 hours ago
Actions for Stop Treating Your Models Like Microservices
The
Inference
Alpha: Maximizing Frontier
Models
on AMD
✨
Model optimizations in LLMs
Content type:
Blog
digitalocean.com
·
1d
1 day ago
Actions for The Inference Alpha: Maximizing Frontier Models on AMD
massimo92/spark: CLI tool for
serving
LLMs with
vLLM
on NVIDIA DGX Spark. One file, zero friction.
🚀
LLM serving frameworks
Content type:
Code
github.com
·
4h
4 hours ago
·
Hacker News
Actions for massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.
A
system
programmer’s guide to
LLM
inference
✨
Model optimizations in LLMs
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
✨
Model optimizations in LLMs
gizchina.com
·
2d
2 days ago
Actions for Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help