⚡ Inference Optimization - moyutianzun

💾KV Cache Blog

equixly.com··Hacker News

DiffusionGemma: The Developer Guide

💾KV Cache Blog

developers.googleblog.com·

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

⚡FlashAttention Academic

arxiv.org·

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

🔲TPU Architecture

aarushgupta.io··Lobsters, Hacker News

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

💾KV Cache Code

github.com··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

💾KV Cache Blog

dnhkng.github.io·

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

🔄Transformers

gizchina.com·

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🔥PyTorch Internals News

digg.com·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

⚡CUDA News Blog

developer.nvidia.com·

Making LLMs faster and more efficient across multiple languages

🔧MLIR

techxplore.com·

TFLite Edge Model Quantizer Snippet

🔥PyTorch Internals

itsevilduck.gumroad.com··DEV

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

🎭Mixture of Experts Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

💾KV Cache

huggingface.co··r/LocalLLaMA

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

🔄Transformers

deemwar-products.github.io··Hacker News

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

📊LLM Evaluation Blog

ziraph.com··Hacker News

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

💾KV Cache Academic

arxiv.org·

A system programmer’s guide to LLM inference

🎭Mixture of Experts Blog

blog.xiangpeng.systems··Hacker News

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

💾KV Cache Code

github.com·

Where to Host Your Open-Source Model (Under 10B Parameters)

💾KV Cache

digitalocean.com·

What's in the Box? A Field Guide to AI Models

How we fight GPU scarcity without compromise

DiffusionGemma: The Developer Guide

NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

NetX-lab/Frontier: Frontier: A Discrete-Event Simulator for Modern LLM Serving

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

Making LLMs faster and more efficient across multiple languages

TFLite Edge Model Quantizer Snippet

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

A system programmer’s guide to LLM inference

libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.

Where to Host Your Open-Source Model (Under 10B Parameters)