Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference Optimization
⚡ Inference Optimization
Specific
Quantization, Model Compression, KV Cache, Speculative Decoding
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
313
posts in
24.1
ms
Where to Host Your Open-Source
Model
(Under 10B Parameters)
💾
KV Cache
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector
Quantization
⚡
FlashAttention
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for NSVQ: Mitigating Codebook Collapse by Stabilizing Encoder Drift in Vector Quantization
BeeLlama.cpp
DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
⚡
CUDA
sleepingrobots.com
·
4d
4 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
Google releases Gemma 4 QAT
models
for local AI on enterprise laptops
🔲
TPU Architecture
4sysops.com
·
4d
4 days ago
Actions for Google releases Gemma 4 QAT models for local AI on enterprise laptops
How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
💾
KV Cache
Content type:
Video
youtube.com
·
6d
6 days ago
Actions for How to Run Gemma 4 12B Locally - The Best AI For Consumer Laptops
Quantized
Stochastic Primal-Dual Methods for Distributed
Optimization
under Relaxed Global Geometry
↩️
Backpropagation
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry
Improved performance and
model
support with GGUF
🔄
Transformers
Content type:
Blog
ollama.com
·
6d
6 days ago
Actions for Improved performance and model support with GGUF
Nvidia DGX Spark GB10 – AI
Models
and Guide with
vLLM
and Autonomous Script
💾
KV Cache
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
Youssof Altoukhi (@Youssofal_)
💾
KV Cache
xcancel.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
Joint Structural
Pruning
and Mixed-Precision
Quantization
for
LLM
Compression
🎭
Mixture of Experts
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression
Gemma 4 12B: A unified, encoder-free multimodal
model
⚡
FlashAttention
Content type:
Discussion
news.ycombinator.com
·
3d
3 days ago
·
Hacker News
Actions for Gemma 4 12B: A unified, encoder-free multimodal model
UniSVQ:
2-bit
Unified Scalar-Vector
Quantization
🔧
MLIR
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for UniSVQ: 2-bit Unified Scalar-Vector Quantization
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
💾
KV Cache
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
VIA-SD: Verification via
Intra-Model
Routing for
Speculative
Decoding
📊
LLM Evaluation
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
Quality Is Not a Safety Proxy Under
Quantization
🔄
Transformers
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for Quality Is Not a Safety Proxy Under Quantization
fix(memory): move local
llama.cpp
runtime to provider plugin · openclaw/openclaw@3137110
🔧
MLIR
Content type:
Code
github.com
·
2d
2 days ago
Actions for fix(memory): move local llama.cpp runtime to provider plugin · openclaw/openclaw@3137110
Holding the FP8 Quality Ceiling at
8-Bit
Weights and Activations: INT8 and GGUF Post-Training
Quantization
of Ideogram 4.0 for Consumer GPUs
⚡
CUDA
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving.
💾
KV Cache
Content type:
Code
github.com
·
4d
4 days ago
·
Hacker News
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
STAR-KV
: Low-Rank
KV
Cache
Compression
via Soft Thresholding for Adaptive Rank Control
💾
KV Cache
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language
Models
💾
KV Cache
Content type:
Academic
arxiv.org
·
4h
4 hours ago
Actions for Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help