Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
⚡ LLM Inference
Specific
inference serving, vLLM, TensorRT, model serving, token generation
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
283
posts in
7.5
ms
MTP Isn't Always a Win: 1.95x on My 3090, but
Speculative
Decoding
Is Hardware-Dependent
🤖
LLMs
Content type:
Blog
bric.pe.kr
·
3d
3 days ago
·
DEV
Actions for MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent
Less-relevant results
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
📈
Performance Engineering
aarushgupta.io
·
2d
2 days ago
·
Lobsters
,
Hacker News
Actions for Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
Quantization
Was Never About the Bits
🤖
LLMs
Content type:
Blog
medium.com
·
9h
9 hours ago
Actions for Quantization Was Never About the Bits
MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
🤖
LLMs
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
6d
6 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
DiffusionGemma: Discrete diffusion in a large language
model
✍️
Prompt Engineering
idlemachines.co.uk
·
8h
8 hours ago
·
Hacker News
Actions for DiffusionGemma: Discrete diffusion in a large language model
Intelligent
inference
scheduling with
llm-d
on Red Hat AI
✍️
Prompt Engineering
developers.redhat.com
·
1d
1 day ago
Actions for Intelligent inference scheduling with llm-d on Red Hat AI
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
🤖
LLMs
vettedconsumer.com
·
5d
5 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
How To Start Building Edge-Native AI
📈
Performance Engineering
semiengineering.com
·
23h
23 hours ago
Actions for How To Start Building Edge-Native AI
AI
Serving
Platform That Adapts to Your
Model
📈
Performance Engineering
Content type:
Blog
databricks.com
·
1d
1 day ago
Actions for AI Serving Platform That Adapts to Your Model
Mi50 32GB / GFX906 -
vLLM
Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit
🤖
LLMs
huggingface.co
·
8h
8 hours ago
·
r/LocalLLaMA
Actions for Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit
HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs
🧮
Vector Databases
Content type:
Blog
elastic.co
·
3d
3 days ago
Actions for HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs
Optimal Post-Training
Quantization
Scales and Where to Find Them
🤖
LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Optimal Post-Training Quantization Scales and Where to Find Them
Model2vec-zig
: static text embeddings in pure Zig, in a single binary
🤖
LLMs
ziggit.dev
·
10h
10 hours ago
Actions for Model2vec-zig: static text embeddings in pure Zig, in a single binary
The economics of
speculative
decoding
📈
Performance Engineering
Content type:
Blog
fergusfinn.com
·
4d
4 days ago
·
Hacker News
Actions for The economics of speculative decoding
vLLM
Transformers Backend: Bridging Hugging Face Compatibility and High-Performance
Inference
🤖
LLMs
Content type:
Blog
odsc.medium.com
·
8h
8 hours ago
Actions for vLLM Transformers Backend: Bridging Hugging Face Compatibility and High-Performance Inference
2x GH200 for
LLM
inference
, Part 2:
vLLM
, DeepSeek V4 Flash, and MTP
🤖
LLMs
Content type:
Blog
dnhkng.github.io
·
4d
4 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
DiffusionGemma: The Developer Guide
🤖
LLMs
Content type:
Blog
developers.googleblog.com
·
2d
2 days ago
·
Hacker News
Actions for DiffusionGemma: The Developer Guide
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
📈
Performance Engineering
Content type:
Blog
mimo.xiaomi.com
·
4d
4 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
🤖
LLMs
phoronix.com
·
1d
1 day ago
·
r/artificial
Actions for AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive
llama.cpp
conversions suffer accuracy loss
🤖
LLMs
Content type:
News
digg.com
·
6d
6 days ago
Actions for Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help