Skip to main content
Scour
Discover
Docs
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Serving
⚡ LLM Serving
Specific
LLM inference, vLLM, model serving, TensorRT-LLM
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
215
posts in
12.8
ms
🤖
AI Agents
medium.com
·
3d
3 days ago
The Context Budget That Will Decide Everyday AI
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Context Budget That Will Decide Everyday AI
Less-relevant results
🏗️
Systems Design
blocksandfiles
·
1d
1 day ago
Dell and data physics
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Dell and data physics
📈
LLM Scaling
arXiv
·
19h
19 hours ago
Human-Less
LLM
Serving
: Quantifying the Human Tax on
Throughput
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Human-Less LLM Serving: Quantifying the Human Tax on Throughput
📊
Machine Learning
jimmysong.io
·
6d
6 days ago
Why GPUs Became the Foundation of AI: A GPU Primer for K8s Veterans
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Why GPUs Became the Foundation of AI: A GPU Primer for K8s Veterans
📈
LLM Scaling
medium.com
·
1d
1 day ago
LLM
Inference
Optimization: The Difference Between an AI Demo and an AI Business
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for LLM Inference Optimization: The Difference Between an AI Demo and an AI Business
🖥️
GPU Computing
Hugging Face
·
6d
6 days ago
GLM-5.2: Built for Long-Horizon Tasks
Covers
5 stories
See all stories this covers
including
New model GLM-Experimental is quite good (not local so far)
Covered by
3 sources
See all sources covering this story
including
vettedconsumer.com
,
DEV Community
Discussed on
Hacker News
and
r/LocalLLaMA
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for GLM-5.2: Built for Long-Horizon Tasks
🔬
Deep Learning
GitHub
·
20h
20 hours ago
100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to
tensor
split-mode
got me from 70 to 100+
Covered by
imil.net
,
NVIDIA Technical Blog
Discussed on
r/LocalLLaMA
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 100+ t/s on Qwen3.6-27B Q8 across a 5090 + 3090 Ti — switching to tensor split-mode got me from 70 to 100+
⚙️
MLOps
AWS
·
5d
5 days ago
Monitor and debug
generative
AI
inference
with SageMaker detailed metrics and Insights dashboard on CloudWatch
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Monitor and debug generative AI inference with SageMaker detailed metrics and Insights dashboard on CloudWatch
📈
LLM Scaling
arXiv
·
19h
19 hours ago
Geometry-Aware Online Scheduling for
LLM
Serving
: From Theoretical Bound to System Practice
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Geometry-Aware Online Scheduling for LLM Serving: From Theoretical Bound to System Practice
🖥️
GPU Computing
digitalocean.com
·
4d
4 days ago
Efficient
LLM
Compression with SparseGPT and Wanda on GPU Cloud
Covers
NVIDIA Triton Inference Server — NVIDIA Triton Inference Server
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Efficient LLM Compression with SparseGPT and Wanda on GPU Cloud
🏗️
Systems Design
ServeTheHome
·
1d
1 day ago
This is the Storage of Spaceborne Computer 4 Bringing AI Compute to the Moon
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for This is the Storage of Spaceborne Computer 4 Bringing AI Compute to the Moon
🔬
Deep Learning
certdepot.net
·
5d
5 days ago
Recent Technical
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Recent Technical
🖥️
GPU Computing
arXiv
·
19h
19 hours ago
HyperQuant: A Rate-Distortion-Optimal
Quantization
Pipeline for Large Language and Diffusion
Models
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models
🧠
Transformer Architecture
news.smol.ai
·
5d
5 days ago
not much happened today | AINews
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for not much happened today | AINews
🔬
Deep Learning
GitHub
·
1d
1 day ago
DeepSeek V4 Flash optimized framework and
model
variants for DGX Spark
Covers
Nvidia RTX Spark
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for DeepSeek V4 Flash optimized framework and model variants for DGX Spark
⚙️
MLOps
alternativeto.net
·
5d
5 days ago
Z.ai debuts GLM-5.2 with stable 1M-token context and top coding scores
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Z.ai debuts GLM-5.2 with stable 1M-token context and top coding scores
🧠
Transformer Architecture
arXiv
·
19h
19 hours ago
Delay-Adaptive
Speculation
Control for
Low-Latency
Edge-Cloud
LLM
Inference
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Delay-Adaptive Speculation Control for Low-Latency Edge-Cloud LLM Inference
🧠
Transformer Architecture
XDA
·
5d
5 days ago
I tested Google's new Gemma 4 12B on my 8GB GPU, and now I don't want to go back to smaller
models
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I tested Google's new Gemma 4 12B on my 8GB GPU, and now I don't want to go back to smaller models
🔬
Deep Learning
together.ai
·
23h
23 hours ago
ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
Covers
3 stories
See all stories this covers
including
Show HN: Mini-swe-agent achieves 65% on SWE-bench in 100 lines of python
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ParallelKernelBench: Frontier LLMs can't write fast multi-GPU kernels (yet)
🧠
Transformer Architecture
Martin Alderson
·
1d
1 day ago
Expert-aware quantisation: near-Q4 quality at near-Q2 size?
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Expert-aware quantisation: near-Q4 quality at near-Q2 size?
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report