Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference Optimization
⚡ Inference Optimization
Specific
Quantization, Model Compression, KV Cache, Speculative Decoding
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
296
posts in
6.2
ms
What's in the Box? A Field Guide to AI
Models
📐
Linear Algebra
Content type:
Blog
iankduncan.com
·
2d
2 days ago
Actions for What's in the Box? A Field Guide to AI Models
How we fight GPU scarcity without compromise
💾
KV Cache
Content type:
Blog
equixly.com
·
5d
5 days ago
·
Hacker News
Actions for How we fight GPU scarcity without compromise
DiffusionGemma: The Developer Guide
💾
KV Cache
Content type:
Blog
developers.googleblog.com
·
1d
1 day ago
Actions for DiffusionGemma: The Developer Guide
NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector
Quantization
⚡
FlashAttention
Content type:
Academic
arxiv.org
·
3h
3 hours ago
Actions for NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
🔲
TPU Architecture
aarushgupta.io
·
1d
1 day ago
·
Lobsters
,
Hacker News
Actions for Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for
Modern
LLM
Serving
💾
KV Cache
Content type:
Code
github.com
·
42m
42 minutes ago
·
Hacker News
Actions for NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving
2x GH200 for
LLM
inference
, Part 2:
vLLM
, DeepSeek V4 Flash, and MTP
💾
KV Cache
Content type:
Blog
dnhkng.github.io
·
3d
3 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
🔄
Transformers
gizchina.com
·
2d
2 days ago
Actions for Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive
llama.cpp
conversions suffer accuracy loss
🔥
PyTorch Internals
Content type:
News
digg.com
·
5d
5 days ago
Actions for Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss
Train
Models
Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
⚡
CUDA
Content type:
News
Content type:
Blog
developer.nvidia.com
·
2d
2 days ago
Actions for Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
Making LLMs faster and more efficient across multiple languages
🔧
MLIR
techxplore.com
·
6d
6 days ago
Actions for Making LLMs faster and more efficient across multiple languages
TFLite Edge
Model
Quantizer
Snippet
🔥
PyTorch Internals
itsevilduck.gumroad.com
·
2d
2 days ago
·
DEV
Actions for TFLite Edge Model Quantizer Snippet
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
🎭
Mixture of Experts
Content type:
Blog
mimo.xiaomi.com
·
3d
3 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
💾
KV Cache
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
🔄
Transformers
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
📊
LLM Evaluation
Content type:
Blog
ziraph.com
·
5d
5 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language
Models
💾
KV Cache
Content type:
Academic
arxiv.org
·
3h
3 hours ago
Actions for Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
A system programmer’s guide to
LLM
inference
🎭
Mixture of Experts
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
💾
KV Cache
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
Where to Host Your Open-Source
Model
(Under 10B Parameters)
💾
KV Cache
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help