⚡ Inference Optimization - touyou · Scour

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🤖LLM Inference Blog

dnhkng.github.io·

Less-relevant results

TFLite Edge Model Quantizer Snippet

🤖LLM Inference

itsevilduck.gumroad.com··DEV

Domain-Specific Small Language Models (Manning)

🤖LLM Inference

i-programmer.info·

Efficient and accurate neural-field reconstruction using resistive memory

👁️Multimodal LLMs Academic

·

Unsloth Gemma 4 QAT

🤖LLM Inference

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🤖LLM Inference Blog

towardsai.net·

A system programmer’s guide to LLM inference

🤖LLM Inference Blog

blog.xiangpeng.systems··Hacker News

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🤖LLM Inference News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

⚙️AI Infrastructure

huggingface.co··r/LocalLLaMA

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🤖LLM Inference Academic

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

🤖LLM Inference Code

github.com··Hacker News

Anthropic's most powerful model comes with a kill switch aimed at you

🔄Agentic Systems

boingboing.net·

Google releases Gemma 4 QAT models for local AI on enterprise laptops

⚙️AI Infrastructure

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

🤖LLM Inference Blog

·

How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops

⚙️AI Infrastructure Video

Token4Token — pay-per-token inference on Gnosis + Swarm

⚙️AI Infrastructure

t4t.eth.link··Hacker News

Optimal Post-Training Quantization Scales and Where to Find Them

🎯Post-Training Academic

China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)

🤖LLM Inference News

Re-quantizing a local LLM 14x faster by skipping the tensors that didn't change

🤖LLM Inference News Blog

andreaborio.substack.com··Substack

OpenAI govt stake 🇺🇸, Google compute deal 🚀, Microsoft Scout launch 🤖

⚙️AI Infrastructure

Log in to enable infinite scrolling