Skip to main content
Scour
Discover
Docs
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Serving
⚡ LLM Serving
Specific
LLM inference, vLLM, model serving, TensorRT-LLM
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
210
posts in
15.9
ms
🔬
Deep Learning
GitHub
·
4d
4 days ago
I got tired of not understanding how
vLLM
works under the hood, so I built my own mini
inference
engine from scratch.
Discussed on
r/LLM
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.
🕸️
Multi-Agent Systems
supercomputing-system-ai-lab.github.io
·
2h
2 hours ago
VoltanaLLM: Energy-Efficient
LLM
Serving
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for VoltanaLLM: Energy-Efficient LLM Serving
🖥️
GPU Computing
NVIDIA Technical Blog
·
16h
16 hours ago
Boost
Inference
Performance up to 15x on NVIDIA Blackwell Using DFlash
Speculative
Decoding
Covers
3 stories
See all stories this covers
including
NVIDIA/TensorRT-LLM
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Boost Inference Performance up to 15x on NVIDIA Blackwell Using DFlash Speculative Decoding
🖥️
GPU Computing
Red Hat Developer
·
2d
2 days ago
Designing distributed AI
inference
: Core concepts and scaling dimensions
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Designing distributed AI inference: Core concepts and scaling dimensions
📈
LLM Scaling
arXiv
·
3h
3 hours ago
CrossPool: Efficient
Multi-LLM
Serving
for Cold MoE
Models
through KV-Cache and Weight Disaggregation
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation
🏗️
Systems Design
primeintellect.ai
·
23h
23 hours ago
RL at 1T Scale: prime-rl Performance Deep Dive
Covers
6 stories
See all stories this covers
including
Kimi K2.7-Code: open-source coding model with better token efficiency
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for RL at 1T Scale: prime-rl Performance Deep Dive
🏗️
Systems Design
Anyscale blog posts
·
5d
5 days ago
High Performance Distributed
Inference
with Ray
Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🖥️
GPU Computing
medium.com
·
2d
2 days ago
Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Debugging Deployments with Gemma 12B, TPU v6e-1, MCP, and Antigravity CLI
🔬
Deep Learning
ubuntu.com
·
2d
2 days ago
Developing web apps with local
LLM
inference
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Developing web apps with local LLM inference
📈
LLM Scaling
IBM Research
·
13h
13 hours ago
Running AI on mixed hardware for speed and affordability
Covers
Introduction to llm-d Open-source Kubernetes-native Framework for Distributed LLM Inference | Ep 140 #cloudnativefm
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running AI on mixed hardware for speed and affordability
⚙️
MLOps
thecybersidekick.beehiiv.com
·
5d
5 days ago
AI
Inference
at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
Discussed on
DEV
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
⚙️
MLOps
Modal
·
12h
12 hours ago
Modal Auto Endpoints: Optimized
inference
you own
Covers
2 stories
See all stories this covers
including
Statement on the US government directive to suspend access to Fable 5 and Mythos 5
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Modal Auto Endpoints: Optimized inference you own
🔬
Deep Learning
Fergus's blog
·
2d
2 days ago
Adaptive
speculative
decoding
: picking draft lengths at runtime
Covers
4 stories
See all stories this covers
including
Looking for a self-hosted alternative to Modal.com for running ML workloads
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Adaptive speculative decoding: picking draft lengths at runtime
🏗️
Systems Design
Google Cloud Blog
·
6d
6 days ago
Scaling Ray
Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
📊
Machine Learning
Gradient Ascent
·
2d
2 days ago
Groq on Endless Compute, Inside Claude's Mind, and GLM-5.2 Open Weights - The Tokenizer Edition #32
Covers
3 stories
See all stories this covers
including
alibaba/open-code-review: Battle-tested at Alibaba's scale. Hybrid architecture code review tool: deterministic pipelines + LLM Agent, precise line-level comments, built-in fine-tuned ruleset (NPE, thread-safety, XSS, SQL injection), OpenAI & Anthropic compatible.
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Groq on Endless Compute, Inside Claude's Mind, and GLM-5.2 Open Weights - The Tokenizer Edition #32
🔬
Deep Learning
YouTube
Content type:
Video
·
6d
6 days ago
Token Injection: Crashing
LLM
Inference
With Special Tokens
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Token Injection: Crashing LLM Inference With Special Tokens
🖥️
GPU Computing
Baseten
·
1d
1 day ago
We built the fastest API for GLM-5.2 (280 TPS)
Covers
GLM-5.2 (6 minute read)
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for We built the fastest API for GLM-5.2 (280 TPS)
🧠
Transformer Architecture
fitservers.com
·
4d
4 days ago
The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
📈
LLM Scaling
arXiv
·
3h
3 hours ago
CompressKV: Semantic-Retrieval-Guided
KV-Cache
Compression for Resource-Efficient Long-Context
LLM
Inference
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for CompressKV: Semantic-Retrieval-Guided KV-Cache Compression for Resource-Efficient Long-Context LLM Inference
📈
LLM Scaling
Network World
·
6d
6 days ago
Tether is shipping TurboQuant
KV-cache
quantization
with Vulkan support into its QVAC SDK
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report