🤖 Inference - kelvinyu1117 · Scour

Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good

🧠LLMs Blog

towardsai.net·

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

🧠LLMs Blog

adambien.blog·

DeskDash - a free Windows tool to easily manage your GGUF files

gerry7.itch.io··r/LocalLLaMA

Here's a llama.cpp CLI Command builder.

⚙️Systems Programming

llamabuilding.com··r/LocalLLaMA

Speculators v0.5.0: DFlash support and online training

developers.redhat.com·

Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

🧠LLMs Academic

TFLite Edge Model Quantizer Snippet

itsevilduck.gumroad.com··DEV

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

deemwar-products.github.io··Hacker News

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

🧠LLMs News Blog

developer.nvidia.com·

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

🏗️MLSys News

google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation

huggingface.co··r/LocalLLaMA

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

androidauthority.com·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

🛠️Compilers Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

🧠LLMs Code

github.com··Hacker News

PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference

⚙️Systems Programming Blog

·

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

🔧Hardware Blog

ziraph.com··Hacker News

Token4Token — pay-per-token inference on Gnosis + Swarm

t4t.eth.link··Hacker News

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🛠️Compilers

local-llm.utop.workers.dev··Hacker News

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

🤖AI Blog Discussion

Optimal Post-Training Quantization Scales and Where to Find Them

🧠LLMs Academic

Log in to enable infinite scrolling