Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Fast AI Inference
⚡ Fast AI Inference
Cerebras, Groq, fast LLM tokens
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
125
posts in
35.3
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
🧠
LLM Inference
Content type:
Code
github.com
·
2d
2 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Free
vLLM
Course:
Inference
, Compression, Benchmarks
🧠
Inference Serving
deeplearning.ai
·
5d
5 days ago
·
Hacker News
,
r/selfhosted
Actions for Free vLLM Course: Inference, Compression, Benchmarks
Breaking the Ice: Analyzing Cold Start
Latency
in
vLLM
🏗️
LLM Infrastructure
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
LLM
Inference
Handbook 2026
🤖
AI
pub.towardsai.net
·
3h
3 hours ago
Actions for LLM Inference Handbook 2026
Two Leaps to 1000
Tokens/s
on a 1T-Parameter Model: On
Inference
Systems, Execution Boundaries, and Co-Design
🏗️
LLM Infrastructure
Content type:
Blog
tilert.ai
·
6h
6 hours ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🤖
AI
local-llm.utop.workers.dev
·
1d
1 day ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Fast
and Efficient
LLM
Inference
with vLLM: A New Course with Deeplearning.ai
🧠
Inference Serving
Content type:
Blog
vllm.ai
·
5d
5 days ago
·
Hacker News
Actions for Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🏗️
LLM Infrastructure
huggingface.co
·
11h
11 hours ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
Why I care so much about energy
per
token
🤖
AI
Content type:
Blog
ziraph.com
·
1d
1 day ago
·
Hacker News
Actions for Why I care so much about energy per token
NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300
tokens
per
second
on benchmar...
🗄️
Web Datasets
digg.com
·
4d
4 days ago
Actions for NVIDIA releases Nemotron 3 Ultra, claiming five times the speed and 30 percent lower costs than prior modelsThe model delivers 300 tokens per second on benchmar...
MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
🧩
MoE
Content type:
Blog
mimo.xiaomi.com
·
23h
23 hours ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient
LLM
inference
.
💾
Prompt Caching
Content type:
Code
github.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Making Local
LLM
Go Brrr
🤖
AI
seanpedersen.github.io
·
5d
5 days ago
Actions for Making Local LLM Go Brrr
Gemma 4 12B: A unified, encoder-free multimodal model
🤖
AI
Content type:
Discussion
news.ycombinator.com
·
1d
1 day ago
·
Hacker News
Actions for Gemma 4 12B: A unified, encoder-free multimodal model
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🤖
AI
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
🤖
AI
deemwar-products.github.io
·
3d
3 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization (and Which One to Pick)
🤖
AI
vettedconsumer.com
·
2d
2 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster
Inference
Gateway
🌍
Distributed Systems
Content type:
Blog
cloud.google.com
·
6d
6 days ago
Actions for Experimenting with TPUs, GKE Managed DRANET, and Multi-cluster Inference Gateway
KJLdefeated/RL.cu: RLVR training for
LLM
in CUDA/C++
🏗️
LLM Infrastructure
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
How we fight GPU scarcity without compromise
🏗️
LLM Infrastructure
Content type:
Blog
equixly.com
·
3d
3 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help