Quantization of LLMs

Feeds to Scour
SubscribedAll
Scoured 76 posts in 6.6 ms

A system programmer’s guide to LLM inference

 🔧Systems-level optimizations for LLM serving  Content type: Blog

Quality Is Not a Safety Proxy Under Quantization

 Model optimizations in LLMs  Content type: Academic
arxiv.org·
Less-relevant results

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

 🚀LLM serving frameworks

Running Qwen 35B MoE at 450k Context on a Single 32GB GPU

 📊AI Performance Profiling

Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB

 🚀LLM serving frameworks  Content type: Blog
ziraph.com··Hacker News

Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss

 Model optimizations in LLMs  Content type: News
digg.com·

BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster

 🔧Systems-level optimizations for LLM serving
sleepingrobots.com·

Apple rebuilt its on-device AI stack at WWDC 2026

 📊AI Performance Profiling  Content type: Blog
ziraph.com··Hacker News

defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes

 🧠Large Language Models (LLMs)  Content type: Code
github.com··Hacker News

Gemma 4 12B: A unified, encoder-free multimodal model

 🚀LLM serving frameworks  Content type: Discussion

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

 Model optimizations in LLMs  Content type: Academic
arxiv.org·

Week Links [1st June 2026]

 Model optimizations in LLMs
jackharrington.xyz·

Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell

 🧠Large Language Models (LLMs)  Content type: News  Content type: Blog
developer.nvidia.com·

The latest Gemma 4 models use a training trick to slash their on-device memory footprint

 Model optimizations in LLMs
androidauthority.com·

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

 Model optimizations in LLMs  Content type: Academic
arxiv.org·

[AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo

 🧠Large Language Models (LLMs)  Content type: News
latent.space
·

Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent

 🧠Large Language Models (LLMs)  Content type: Blog
dnhkng.github.io·

I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.

 🤖Agents using LLMs
saintlex.sbs··DEV

6. Air-Gapped Claude Code - The Claude Code SRE Handbook

 🚀LLM serving frameworks

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help