Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference
🤖 Inference
Model Serving, Quantization, vLLM, TensorRT
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
262
posts in
11.6
ms
Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
🧠
LLMs
Content type:
Blog
towardsai.net
·
2d
2 days ago
Actions for Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
🧠
LLMs
Content type:
Blog
adambien.blog
·
11h
11 hours ago
Actions for 146th airhacks tv: Rust, Java 25, AI Agents, BCE, Web Components, zunit, zb
DeskDash - a free Windows tool to easily manage your
GGUF
files
🧠
LLMs
gerry7.itch.io
·
3d
3 days ago
·
r/LocalLLaMA
Actions for DeskDash - a free Windows tool to easily manage your GGUF files
Here's a
llama.cpp
CLI Command builder.
⚙️
Systems Programming
llamabuilding.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for Here's a llama.cpp CLI Command builder.
Speculators v0.5.0: DFlash support and online training
🧠
LLMs
developers.redhat.com
·
6d
6 days ago
Actions for Speculators v0.5.0: DFlash support and online training
Alignment Collapse Under KV Cache
Quantization
: Diagnosis and Mitigation
🧠
LLMs
Content type:
Academic
arxiv.org
·
10h
10 hours ago
Actions for Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
TFLite Edge
Model
Quantizer
Snippet
🤖
AI
itsevilduck.gumroad.com
·
1d
1 day ago
·
DEV
Actions for TFLite Edge Model Quantizer Snippet
Show HN: Run
Llama.cpp
In-Process from Java with Project Panama FFM
🧠
LLMs
deemwar-products.github.io
·
5d
5 days ago
·
Hacker News
Actions for Show HN: Run Llama.cpp In-Process from Java with Project Panama FFM
Train
Models
Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
🧠
LLMs
Content type:
News
Content type:
Blog
developer.nvidia.com
·
1d
1 day ago
Actions for Train Models Faster with JAX and MaxText Using NVFP4 on NVIDIA Blackwell
Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive
llama.cpp
conversions suffer accuracy loss
🏗️
MLSys
Content type:
News
digg.com
·
4d
4 days ago
Actions for Google DeepMind releases Gemma 4 QAT, but Unsloth developer Daniel Han warns naive llama.cpp conversions suffer accuracy loss
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🧠
LLMs
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
The latest Gemma 4
models
use a training trick to slash their on-device memory footprint
🧠
LLMs
androidauthority.com
·
4d
4 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
🛠️
Compilers
Content type:
Blog
mimo.xiaomi.com
·
2d
2 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization
backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🧠
LLMs
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
PagedAttention
vs Traditional KV Cache: How
vLLM
Reinvented GPU Memory for
LLM
Inference
⚙️
Systems Programming
Content type:
Blog
medium.com
·
1d
1 day ago
Actions for PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference
Apples to Apples: MLX vs.
Llama.cpp
for Gemma 4 12B on an M1 16GB
🔧
Hardware
Content type:
Blog
ziraph.com
·
4d
4 days ago
·
Hacker News
Actions for Apples to Apples: MLX vs. Llama.cpp for Gemma 4 12B on an M1 16GB
Token4Token — pay-per-token
inference
on Gnosis + Swarm
🧠
LLMs
t4t.eth.link
·
1d
1 day ago
·
Hacker News
Actions for Token4Token — pay-per-token inference on Gnosis + Swarm
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🛠️
Compilers
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
🤖
AI
Content type:
Blog
Content type:
Discussion
tildalice.io
·
4d
4 days ago
Actions for Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
Optimal Post-Training
Quantization
Scales and Where to Find Them
🧠
LLMs
Content type:
Academic
arxiv.org
·
10h
10 hours ago
Actions for Optimal Post-Training Quantization Scales and Where to Find Them
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help