Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Inference
⚡ Inference
LLM inference, model serving, vLLM, TensorRT, latency
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
318
posts in
6.6
ms
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
🧠
LLMs
Content type:
Blog
mimo.xiaomi.com
·
2d
2 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
📊
AI Models
vettedconsumer.com
·
4d
4 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant
KV
cache
+ HIP-graph-safe Flash-Attention for
llama.cpp
, fully measured on real hardware.
🧠
LLMs
Content type:
Code
github.com
·
4h
4 hours ago
·
Hacker News
Actions for KaiFelixBennett/gemma4-turboquant-rdna4: Run Gemma-4-31B at full 256K context on a $1,400 AMD RDNA4 GPU (gfx1201): TurboQuant KV cache + HIP-graph-safe Flash-Attention for llama.cpp, fully measured on real hardware.
RKSC: Reasoning-Aware
KV
Cache
Sharing and Confident Early Exit for Multi-Step
LLM
Inference
🧠
LLMs
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for RKSC: Reasoning-Aware KV Cache Sharing and Confident Early Exit for Multi-Step LLM Inference
China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
🤖
AI
Content type:
News
decrypt.co
·
1d
1 day ago
Actions for China's Xiaomi MiMo Is Now 15X Faster Than ChatGPT and Claude (4 minute read)
Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
🤖
AI
local-llm.utop.workers.dev
·
3d
3 days ago
·
Hacker News
Actions for Running Qwen 35B MoE at 450k Context on a Single 32GB GPU
Unsloth Gemma 4 QAT
🤖
AI
unsloth.ai
·
5d
5 days ago
Actions for Unsloth Gemma 4 QAT
Here's a
llama.cpp
CLI Command builder.
🧠
LLMs
llamabuilding.com
·
1d
1 day ago
·
r/LocalLLaMA
Actions for Here's a llama.cpp CLI Command builder.
Domain-Specific Small Language
Models
(Manning)
🤖
AI
i-programmer.info
·
5h
5 hours ago
Actions for Domain-Specific Small Language Models (Manning)
What's in the Box? A Field Guide to AI
Models
📊
AI Models
Content type:
Blog
iankduncan.com
·
1d
1 day ago
Actions for What's in the Box? A Field Guide to AI Models
Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
🤖
AI
Content type:
Blog
Content type:
Discussion
tildalice.io
·
5d
5 days ago
Actions for Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
🧠
LLMs
huggingface.co
·
2d
2 days ago
·
r/LocalLLaMA
Actions for google/gemma-4-31B-it · fix: chat template — null handling, reasoning preservation, turn-tag balance, input validation
The latest Gemma 4
models
use a training trick to slash their on-device memory footprint
🤖
AI
androidauthority.com
·
5d
5 days ago
Actions for The latest Gemma 4 models use a training trick to slash their on-device memory footprint
Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
🤖
AI
aarushgupta.io
·
1d
1 day ago
·
Lobsters
,
Hacker News
Actions for Ultrafast machine learning on FPGAs via Kolmogorov-Arnold Networks
The hidden bottleneck in
LLM
inference
and the impact on MLPerf benchmarking
🧠
LLMs
edn.com
·
6d
6 days ago
Actions for The hidden bottleneck in LLM inference and the impact on MLPerf benchmarking
AI
Serving
Platform That Adapts to Your
Model
🔧
MLOps
Content type:
Blog
databricks.com
·
4h
4 hours ago
Actions for AI Serving Platform That Adapts to Your Model
Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
🤖
AI Agents
gizchina.com
·
1d
1 day ago
Actions for Xiaomi MiMo-V2.5-Pro Just Hit 1,000 Tokens Per Second!
Running
LLM
Inference
on Kubernetes: What It Actually Takes
🧠
LLMs
Content type:
Blog
fairwinds.com
·
5d
5 days ago
Actions for Running LLM Inference on Kubernetes: What It Actually Takes
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
🤖
AI
Content type:
Code
github.com
·
20h
20 hours ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
UniSVQ: 2-bit Unified Scalar-Vector
Quantization
🧠
LLMs
Content type:
Academic
arxiv.org
·
16h
16 hours ago
Actions for UniSVQ: 2-bit Unified Scalar-Vector Quantization
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help