📉 Model Quantization - minezone

ComfyUI NVFP4 in 2026: 3 Faster Image Generation on RTX 50-Series (and the Right Format for RTX 40-Series)

🧩LLM Integration Blog

dev.to··DEV

Ollama 0.30 GPU Boost: Faster local Qwen inference on NVIDIA

🧩LLM Integration

everylocalai.com··DEV

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

🦙Ollama

alternativeto.net·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

🧩LLM Integration News Blog

kaitchup.substack.com··r/LocalLLaMA

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

🧩LLM Integration News

newsletter.semianalysis.com

··Hacker News

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

🧩LLM Integration Code

github.com··Hacker News

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

🦙Ollama Blog

dev.to··DEV

Less-relevant results

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

🦙Ollama Code

github.com··Hacker News

local llm on laptop 780M GPU using llama + gemma 4 qat

🦙Ollama Blog

alper.bearblog.dev·

Unsloth Gemma 4 QAT

🦙Ollama

unsloth.ai·

Here's a llama.cpp CLI Command builder.

🦙Ollama

llamabuilding.com··r/LocalLLaMA

DeskDash - a free Windows tool to easily manage your GGUF files

🧩LLM Integration

gerry7.itch.io··r/LocalLLaMA

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

🦙Ollama Code

github.com··r/LocalLLaMA

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

🦙Ollama

deemwar-products.github.io··Hacker News

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

🧩LLM Integration Blog

dev.to··DEV

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

🧩LLM Integration

local-llm.utop.workers.dev··Hacker News

Why Quantized Models and Distilled Models Run Differently on Your Computer

📱Edge AI Blog

medium.com

How LLM Quantization Works: INT8, INT4, GPTQ, and AWQ Explained

🧩LLM Integration

pub.towardsai.net

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

Qwen 3.6 27B AutoRound GGUF, need your feedback

ComfyUI NVFP4 in 2026: 3 Faster Image Generation on RTX 50-Series (and the Right Format for RTX 40-Series)

Ollama 0.30 GPU Boost: Faster local Qwen inference on NVIDIA

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

local llm on laptop 780M GPU using llama + gemma 4 qat

Unsloth Gemma 4 QAT

Here's a llama.cpp CLI Command builder.

DeskDash - a free Windows tool to easily manage your GGUF files

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

Why Quantized Models and Distilled Models Run Differently on Your Computer

How LLM Quantization Works: INT8, INT4, GPTQ, and AWQ Explained