⚡ Inference Optimization - touyou · Scour

Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency

🤖LLM Inference News Blog

blog.google··Hacker News

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

🤖LLM Inference Academic

Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%

⚙️AI Infrastructure

zozo123.github.io··Hacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

🤖LLM Inference

huggingface.co··r/LocalLLaMA

huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.

🤖LLM Inference Code

github.com··Hacker News

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🤖LLM Inference News

newsletter.semianalysis.com

··Hacker News

Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work

🤖LLM Inference Blog Discussion

HNSW vs LSH: How Elasticsearch hits 0.99 recall@10 at 15,000 QPS — and what it costs

🔍Retrieval-Augmented Generation Blog

Less-relevant results

AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support

🤖LLM Inference

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

🤖LLM Inference

androidauthority.com·

What's in the Box? A Field Guide to AI Models

🤖LLM Inference Blog

iankduncan.com·

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🤖LLM Inference

vettedconsumer.com··Hacker News

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

🤖LLM Inference Blog

blogs.nvidia.com·

Speculators v0.5.0: DFlash support and online training

🤖LLM Inference

developers.redhat.com·

Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks

🤖LLM Inference

aarushgupta.io··Lobsters, Hacker News

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🤖LLM Inference News Blog

kaitchup.substack.com··r/LocalLLaMA

UniSVQ: 2-bit Unified Scalar-Vector Quantization

🤖LLM Inference Academic

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

🤖LLM Inference News Blog

developer.nvidia.com·

DiffusionGemma: The Developer Guide

🎯Post-Training Blog

developers.googleblog.com·

MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS

🤖LLM Inference Blog

mimo.xiaomi.com··Hacker News, r/LocalLLaMA

Log in to enable infinite scrolling