Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Systems-level optimizations for LLM serving
🔧 Systems-level optimizations for LLM serving
Specific
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
138
posts in
6.5
ms
DiffusionGemma: The Developer Guide
🚀
LLM serving frameworks
Content type:
Blog
developers.googleblog.com
·
2d
2 days ago
·
Hacker News
Actions for DiffusionGemma: The Developer Guide
From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
📊
AI Performance Profiling
Content type:
Blog
jimmysong.io
·
2d
2 days ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
🔢
Quantization of LLMs
vettedconsumer.com
·
5d
5 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Context compression finally works in production: new research cuts
LLM
input 16x without the accuracy hit
🧠
Large Language Models (LLMs)
venturebeat.com
·
7h
7 hours ago
Actions for Context compression finally works in production: new research cuts LLM input 16x without the accuracy hit
WEKA software speeds long context AI
inferencing
on Oracle’s public cloud
📊
AI Performance Profiling
Content type:
News
blocksandfiles.com
·
1d
1 day ago
Actions for WEKA software speeds long context AI inferencing on Oracle’s public cloud
MTP Isn't Always a Win: 1.95x on My 3090, but
Speculative
Decoding
Is Hardware-Dependent
🧠
Large Language Models (LLMs)
Content type:
Blog
bric.pe.kr
·
3d
3 days ago
·
DEV
Actions for MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent
Apple Shows How to Run AI Agents Locally on Mac With MLX [Video]
🤖
Agents using LLMs
Content type:
News
iclarified.com
·
9h
9 hours ago
Actions for Apple Shows How to Run AI Agents Locally on Mac With MLX [Video]
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
🧠
Large Language Models (LLMs)
Content type:
Blog
cloud.google.com
·
3d
3 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
Massive AI Storage Demand Creates a New Memory Wall
🔍
Retrieval-augmented generation
Content type:
News
eetimes.com
·
1d
1 day ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
The economics of
speculative
decoding
📊
AI Performance Profiling
Content type:
Blog
fergusfinn.com
·
4d
4 days ago
·
Hacker News
Actions for The economics of speculative decoding
VIA-SD: Verification via
Intra-Model
Routing for
Speculative
Decoding
💬
Prompt optimizations for LLM serving
Content type:
Academic
arxiv.org
·
21h
21 hours ago
Actions for VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
Two Leaps to 1000 Tokens/s on a 1T-Parameter
Model
: On
Inference
Systems
, Execution Boundaries, and Co-Design
📊
AI Performance Profiling
Content type:
Blog
tilert.ai
·
3d
3 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
For whom the door-bell tolls
🧠
Large Language Models (LLMs)
ceph.io
·
1d
1 day ago
Actions for For whom the door-bell tolls
gist:5b74b8c31e934ff50ce57aa653a343d5
🧠
Large Language Models (LLMs)
gist.github.com
·
21h
21 hours ago
·
r/LocalLLaMA
Actions for gist:5b74b8c31e934ff50ce57aa653a343d5
BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🧠
Large Language Models (LLMs)
sleepingrobots.com
·
5d
5 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for
Modern
LLM
Serving
💬
Prompt optimizations for LLM serving
Content type:
Code
github.com
·
18h
18 hours ago
·
Hacker News
Actions for NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving
Valkey: Unlocked Seattle: The Best
Systems
Let You Sleep At Night
🌐
Distributed LLM Systems
Content type:
Blog
valkey.io
·
1d
1 day ago
Actions for Valkey: Unlocked Seattle: The Best Systems Let You Sleep At Night
What Arm-based innovations happened in May 2026?
🤖
Agents using LLMs
Content type:
Blog
newsroom.arm.com
·
6d
6 days ago
Actions for What Arm-based innovations happened in May 2026?
Qwen 3.6 27B AutoRound GGUF, need your feedback
✨
Model optimizations in LLMs
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for Qwen 3.6 27B AutoRound GGUF, need your feedback
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
📊
AI Performance Profiling
local-llm.utop.workers.dev
·
4d
4 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help