Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Quantization
⚡ Quantization
Model Compression, INT8, Weight Quantization
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
57
posts in
8.3
ms
UniSVQ:
2-bit
Unified Scalar-Vector
Quantization
📊
Vector Quantization
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for UniSVQ: 2-bit Unified Scalar-Vector Quantization
[AINews] not much happened today
📉
Technical Analysis
Content type:
News
latent.space
·
4d
4 days ago
Actions for [AINews] not much happened today
mtmd : add video input support by ngxson · Pull Request #24269 ·
ggml-org/llama.cpp
🎮
Godot
Content type:
Code
github.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for mtmd : add video input support by ngxson · Pull Request #24269 · ggml-org/llama.cpp
Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM
Quantization
🎛️
Fine-tuning
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Trainable Smooth-Rotation Transforms with Learned Channel Scales for LLM Quantization
The Edge LLM Offload Story
🤖
AI
semiengineering.com
·
6d
6 days ago
Actions for The Edge LLM Offload Story
LC-QAT: Data-Efficient
2-Bit
QAT for LLMs via Linear-Constrained Vector
Quantization
📊
Vector Quantization
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for LC-QAT: Data-Efficient 2-Bit QAT for LLMs via Linear-Constrained Vector Quantization
Show HN: Ext-Infer
💬
LLMs
infer.displace.tech
·
3d
3 days ago
·
Hacker News
Actions for Show HN: Ext-Infer
Joint Structural
Pruning
and
Mixed-Precision
Quantization for LLM Compression
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 ·
ggml-org/llama.cpp
💬
LLMs
Content type:
Code
github.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for mtp: support for gemma-4 E2B and E4B assistants by max-krasnyansky · Pull Request #24282 · ggml-org/llama.cpp
Google Gemma 4 12B: Architecture, Benchmarks, Access, and Hands-on Guide for Developers
💬
LLMs
Content type:
Blog
analyticsvidhya.com
·
5d
5 days ago
Actions for Google Gemma 4 12B: Architecture, Benchmarks, Access, and Hands-on Guide for Developers
FAIR-Calib:
Frontier-Aware
Instability-Reweighted Calibration for
Post-Training
Quantization of Diffusion Large Language Models
🎛️
Fine-tuning
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for FAIR-Calib: Frontier-Aware Instability-Reweighted Calibration for Post-Training Quantization of Diffusion Large Language Models
Less-relevant results
[AINews] Reve 2 and Ideogram 4: Layouts in Imagegen
🎮
Reinforcement Learning
latent.space
·
6d
6 days ago
Actions for [AINews] Reve 2 and Ideogram 4: Layouts in Imagegen
On
Low-Bit
Quantization
Errors in Speaker Verification: Diagnostic and Mitigation
📊
Vector Quantization
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation
defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP
modes
🤖
AI
Content type:
Code
github.com
·
22h
22 hours ago
·
Hacker News
Actions for defai-digital/ax-engine: Apple Silicon LLM runtime supporting Gemma 4 and Qwen 3.6 MTP modes
not much happened today | AINews
🔬
Anthropic
news.smol.ai
·
6d
6 days ago
Actions for not much happened today | AINews
ScaleSweep: Accurate NVFP4
Post-Training
Quantization
of LLMs via Block Scale Initialization
💬
LLMs
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization
bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single
quantized
model
(Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads
model
weights
once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft
model
. No quality loss
💬
LLMs
Content type:
Code
github.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for bigattichouse/packed-twin-inference: PTI achieves ~2× throughput using a single quantized model (Q5_K_M or better) by running 4 generation streams in one batched decode call. The GPU loads model weights once per step and produces 4 predictions simultaneously. KV cache overhead is ~0.8 GiB total for all 4 streams. No draft model. No quality loss
Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters
🤖
AI
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Sigma-Branch: Hierarchical Single-Path Network Reconstruction for Dynamic Inference with Reduced Active Parameters
FADA: Accessible fetal ultrasound
interpretation
and annotation with a selectively distilled unified vision-language
model
🧠
Deep Learning
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for FADA: Accessible fetal ultrasound interpretation and annotation with a selectively distilled unified vision-language model
harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
🤖
AI
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help