Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference
⚡ Inference
Specific
LLM inference, vLLM, model serving, inference optimization
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
357
posts in
19.5
ms
DiffusionGemma: 4x Faster Text Generation
🧠
LLMs
Content type:
News
Content type:
Blog
blog.google
·
2h
2 hours ago
·
Hacker News
,
r/LocalLLaMA
,
r/singularity
Actions for DiffusionGemma: 4x Faster Text Generation
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient
LLM
inference.
🧠
LLMs
Content type:
Code
github.com
·
3d
3 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
🎛️
Fine-tuning
Content type:
Blog
mimo.xiaomi.com
·
2d
2 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🧠
LLMs
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
🎛️
Fine-tuning
Content type:
Blog
Content type:
Discussion
tildalice.io
·
5d
5 days ago
Actions for Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
🌐
Open Source AI
Content type:
News
decrypt.co
·
1d
1 day ago
Actions for China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
MLPerf and the rise of
latency-aware
LLM
benchmarking
🧠
LLMs
edn.com
·
5d
5 days ago
Actions for MLPerf and the rise of latency-aware LLM benchmarking
TFLite Edge
Model
Quantizer
Snippet
🧠
LLMs
itsevilduck.gumroad.com
·
2d
2 days ago
·
DEV
Actions for TFLite Edge Model Quantizer Snippet
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
🎛️
Fine-tuning
Content type:
Code
github.com
·
2h
2 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🧠
LLMs
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
UniSVQ: 2-bit Unified Scalar-Vector
Quantization
🎛️
Fine-tuning
Content type:
Academic
arxiv.org
·
14h
14 hours ago
Actions for UniSVQ: 2-bit Unified Scalar-Vector Quantization
On-device AI is a margin decision
🌐
Open Source AI
Content type:
Blog
ziraph.com
·
42m
42 minutes ago
·
Hacker News
Actions for On-device AI is a margin decision
Youssof Altoukhi (@Youssofal_)
🧠
LLMs
xcancel.com
·
2d
2 days ago
·
r/LocalLLaMA
Actions for Youssof Altoukhi (@Youssofal_)
The latest Gemma 4
models
use a training trick to slash their on-device memory footprint
🌐
Open Source AI
androidauthority.com
·
4d
4 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
Show HN:
Ext-Infer
🧠
LLMs
infer.displace.tech
·
3d
3 days ago
·
Hacker News
Actions for Show HN: Ext-Infer
What's in the Box? A Field Guide to AI
Models
🌐
Open Source AI
Content type:
Blog
iankduncan.com
·
1d
1 day ago
Actions for What's in the Box? A Field Guide to AI Models
OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
🌐
Open Source AI
linuxiac.com
·
2d
2 days ago
Actions for OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
A field journal on Ray Data and Daft for multimodal data lake (14 minute read)
🧠
LLMs
Content type:
Blog
mehulbatra.medium.com
·
6d
6 days ago
Actions for A field journal on Ray Data and Daft for multimodal data lake (14 minute read)
AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
🌐
Open Source AI
phoronix.com
·
2h
2 hours ago
Actions for AMD's Lemonade SDK For Local AI Adds NVIDIA CUDA Support
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🧠
LLMs
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help