⚡ Quantization - zongyuzhang · Scour

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

🔓Open-source Models

vettedconsumer.com··Hacker News

Quality Is Not a Safety Proxy Under Quantization

🔓Open-source Models Academic

Ollama 0.30 GPU Boost: Faster local Qwen inference on NVIDIA

🔓Open-source Models

everylocalai.com··DEV

lightmetal: GPU LLM Inference From a Single Java 25 JAR

🧠LLMs Blog

adambien.blog·

alexziskind1/model-shelf: Model Shelf is a local-first model resolver that helps AI agents and scripts find model weights on your own storage before downloading from Hugging Face. Point it at an internal SSD, NAS, external SSD, or Thunderbolt DAS, and it returns the best local path for GGUF, MLX, safetensors, Ollama, vLLM, and other local AI workflows.

🔓Open-source Models Code

DiffusionGemma 26B A4B results on my 5090

🔓Open-source Models

huggingface.co··r/LocalLLaMA

Improved performance and model support with GGUF

🧠LLMs Blog

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🔓Open-source Models News Blog

kaitchup.substack.com··r/LocalLLaMA

Less-relevant results

MTP Isn't Always a Win: 1.95x on My 3090, but Speculative Decoding Is Hardware-Dependent

🧠LLMs Blog

bric.pe.kr··DEV

Orchestrate your LLM pipeline. Locally

🔓Open-source Models

llmforge.app··Hacker News

DeskDash - a free Windows tool to easily manage your GGUF files

gerry7.itch.io··r/LocalLLaMA

local llm on laptop 780M GPU using llama + gemma 4 qat

🧠LLMs Blog

alper.bearblog.dev·

Unsloth Gemma 4 QAT

🔓Open-source Models

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

🖥️Inference Compute Academic

146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb

🕵️AI Agents Blog

adambien.blog·

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

deemwar-products.github.io··Hacker News

Qwen 3.6 27B AutoRound GGUF, need your feedback

🔓Open-source Models

huggingface.co··r/LocalLLaMA

A system programmer’s guide to LLM inference

🔓Open-source Models Blog

blog.xiangpeng.systems··Hacker News

2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP

🧠LLMs Blog

dnhkng.github.io·

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

alternativeto.net·

Log in to enable infinite scrolling