Inference Optimization

Feeds to Scour
SubscribedAll
Scoured 296 posts in 6.2 ms

What's in the Box? A Field Guide to AI Models

 📐Linear Algebra  Content type: Blog
iankduncan.com·

How we fight GPU scarcity without compromise

 💾KV Cache  Content type: Blog
equixly.com··Hacker News

DiffusionGemma: The Developer Guide

 💾KV Cache  Content type: Blog

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

 FlashAttention  Content type: Academic
arxiv.org·

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

 🔲TPU Architecture

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

 💾KV Cache  Content type: Code
github.com··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

 💾KV Cache  Content type: Blog
dnhkng.github.io·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

 🔄Transformers
gizchina.com·

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

 🔥PyTorch Internals  Content type: News
digg.com·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

 CUDA  Content type: News  Content type: Blog
developer.nvidia.com·

Making LLMs faster and more efficient across multiple languages

 🔧MLIR
techxplore.com·

TFLite Edge Model Quantizer Snippet

 🔥PyTorch Internals

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

 🎭Mixture of Experts  Content type: Blog

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

 💾KV Cache

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🔄Transformers

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 📊LLM Evaluation  Content type: Blog
ziraph.com··Hacker News

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

 💾KV Cache  Content type: Academic
arxiv.org·

A system programmer’s guide to LLM inference

 🎭Mixture of Experts  Content type: Blog

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

 💾KV Cache  Content type: Code
github.com·

Where to Host Your Open-Source Model (Under 10B Parameters)

 💾KV Cache
digitalocean.com·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help