Model Quantization

Feeds to Scour
SubscribedAll
Scoured 73 posts in 11.2 ms

GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)

 🧩LLM Integration

Qwen 3.6 27B AutoRound GGUF, need your feedback

 🦙Ollama

ComfyUI NVFP4 in 2026: 3 Faster Image Generation on RTX 50-Series (and the Right Format for RTX 40-Series)

 🧩LLM Integration  Content type: Blog
dev.to··DEV

Ollama 0.30 GPU Boost: Faster local Qwen inference on NVIDIA

 🧩LLM Integration
everylocalai.com··DEV

Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support

 🦙Ollama
alternativeto.net·

MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better

 🧩LLM Integration  Content type: News  Content type: Blog

DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200

 🧩LLM Integration  Content type: News

Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script

 🧩LLM Integration  Content type: Code
github.com··Hacker News

Doubling Qwen3.6-27B on One RTX 3090: ollama llama.cpp + MTP, Lever by Lever (35.7 80.2 tok/s)

 🦙Ollama  Content type: Blog
dev.to··DEV
Less-relevant results

KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.

 🦙Ollama  Content type: Code
github.com··Hacker News

local llm on laptop 780M GPU using llama + gemma 4 qat

 🦙Ollama  Content type: Blog
alper.bearblog.dev·

Unsloth Gemma 4 QAT

 🦙Ollama
unsloth.ai·

Here's a llama.cpp CLI Command builder.

 🦙Ollama

DeskDash - a free Windows tool to easily manage your GGUF files

 🧩LLM Integration

bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss

 🦙Ollama  Content type: Code
github.com··r/LocalLLaMA

Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM

 🦙Ollama

How to Tune llama.cpp --n-gpu-layers: A Practical VRAM Guide (2026)

 🧩LLM Integration  Content type: Blog
dev.to··DEV

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 🧩LLM Integration

Why Quantized Models and Distilled Models Run Differently on Your Computer

 📱Edge AI  Content type: Blog
medium.com
·

How LLM Quantization Works: INT8, INT4, GPTQ, and AWQ Explained

 🧩LLM Integration
pub.towardsai.net
·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help