Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Quantization of LLMs
🔢 Quantization of LLMs
Specific
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
76
posts in
6.6
ms
A system programmer’s guide to
LLM
inference
🔧
Systems-level optimizations for LLM serving
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
Quality Is Not a Safety Proxy Under
Quantization
✨
Model optimizations in LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Quality Is Not a Safety Proxy Under Quantization
Less-relevant results
Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B
AWQ-4bit
🚀
LLM serving frameworks
huggingface.co
·
1h
1 hour ago
·
r/LocalLLaMA
Actions for Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
📊
AI Performance Profiling
local-llm.utop.workers.dev
·
4d
4 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
🚀
LLM serving frameworks
Content type:
Blog
ziraph.com
·
6d
6 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive
llama.cpp
conversions suffer accuracy loss
✨
Model optimizations in LLMs
Content type:
News
digg.com
·
6d
6 days ago
Actions for Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss
BeeLlama.cpp
DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🔧
Systems-level optimizations for LLM serving
sleepingrobots.com
·
4d
4 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
Apple rebuilt its on-device AI stack at WWDC 2026
📊
AI Performance Profiling
Content type:
Blog
ziraph.com
·
2d
2 days ago
·
Hacker News
Actions for Apple rebuilt its on-device AI stack at WWDC 2026
defai-digital/ax-engine: Apple Silicon
LLM
runtime supporting Gemma 4 and Qwen 3.6 MTP
modes
🧠
Large Language Models (LLMs)
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes
Gemma 4 12B: A unified, encoder-free multimodal
model
🚀
LLM serving frameworks
Content type:
Discussion
news.ycombinator.com
·
4d
4 days ago
·
Hacker News
Actions for Gemma 4 12B: A unified, encoder-free multimodal model
Holding the FP8 Quality Ceiling at
8-Bit
Weights
and Activations: INT8 and
GGUF
Post-Training Quantization of Ideogram 4.0 for Consumer GPUs
✨
Model optimizations in LLMs
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs
Week Links [1st June 2026]
✨
Model optimizations in LLMs
jackharrington.xyz
·
4d
4 days ago
Actions for Week Links [1st June 2026]
Train
Models
Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
🧠
Large Language Models (LLMs)
Content type:
News
Content type:
Blog
developer.nvidia.com
·
3d
3 days ago
Actions for Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
The latest Gemma 4
models
use a
training
trick to slash their on-device memory footprint
✨
Model optimizations in LLMs
androidauthority.com
·
6d
6 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
ScaleSweep: Accurate NVFP4
Post-Training
Quantization
of LLMs via Block Scale Initialization
✨
Model optimizations in LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
[AINews] Open
Models
, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo
🧠
Large Language Models (LLMs)
Content type:
News
latent.space
·
20h
20 hours ago
Actions for [AINews] Open Models, Model Labs vs Agent Labs, and What's Untrainable — Sarah Guo
Building & Benchmarking:
LLMs
on a 16GB Jetson Orin NX for Hermes Agent
🧠
Large Language Models (LLMs)
Content type:
Blog
dnhkng.github.io
·
3d
3 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
I Processed 2.4 Billion Tokens Across 52 AI
Models
for $0.52. Here's the Full Breakdown.
🤖
Agents using LLMs
saintlex.sbs
·
20h
20 hours ago
·
DEV
Actions for I Processed 2.4 Billion Tokens Across 52 AI Models for $0.52. Here's the Full Breakdown.
Ideogram4
GGUF
is out!
💬
Prompt optimizations for LLM serving
huggingface.co
·
4d
4 days ago
·
r/StableDiffusion
Actions for Ideogram4 GGUF is out!
6. Air-Gapped Claude Code - The Claude Code SRE Handbook
🚀
LLM serving frameworks
har-ki.github.io
·
7h
7 hours ago
·
Hacker News
Actions for 6. Air-Gapped Claude Code - The Claude Code SRE Handbook
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help