Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
202
posts in
6.6
ms
Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support
🤖
LLM
alternativeto.net
·
2d
2 days ago
Actions for Ollama 0.30 delivers faster NVIDIA GPU performance and wider hardware support
Pruned YOLOv8
ONNX
INT8 Fails: 3 Fixes That Work
🤖
AI
Content type:
Blog
Content type:
Discussion
tildalice.io
·
5d
5 days ago
Actions for Pruned YOLOv8 ONNX INT8 Fails: 3 Fixes That Work
ReasonAlloc: Hierarchical
Decoding-Time
KV
Cache
Budget Allocation for Reasoning Models
🤖
Agents
Content type:
Academic
arxiv.org
·
23h
23 hours ago
Actions for ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
Nvidia DGX Spark GB10 – AI
Models
and Guide with
vLLM
and Autonomous Script
⚡
Vllm
Content type:
Code
github.com
·
5d
5 days ago
·
Hacker News
Actions for Nvidia DGX Spark GB10 – AI Models and Guide with vLLM and Autonomous Script
Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
🤖
LLM
Content type:
Blog
dnhkng.github.io
·
2d
2 days ago
Actions for Building & Benchmarking: LLMs on a 16GB Jetson Orin NX for Hermes Agent
Massive AI Storage Demand Creates a New Memory Wall
🤖
LLM
Content type:
News
eetimes.com
·
13h
13 hours ago
Actions for Massive AI Storage Demand Creates a New Memory Wall
Gemma 4 12B: A unified, encoder-free multimodal
model
🤖
LLM
Content type:
Discussion
news.ycombinator.com
·
3d
3 days ago
·
Hacker News
Actions for Gemma 4 12B: A unified, encoder-free multimodal model
Breaking the Ice: Analyzing Cold Start Latency in
vLLM
🤖
LLM
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for Breaking the Ice: Analyzing Cold Start Latency in vLLM
Two Leaps to 1000 Tokens/s on a 1T-Parameter
Model
: On
Inference
Systems, Execution Boundaries, and Co-Design
🤖
Agents
Content type:
Blog
tilert.ai
·
2d
2 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
MiMo-v2.5-Pro-UltraSpeed: 1T
model
with 1000 TPS
⚡
Vllm
Content type:
Blog
mimo.xiaomi.com
·
3d
3 days ago
·
Hacker News
,
r/LocalLLaMA
Actions for MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
OpenCV 5 Debuts with Improved
ONNX
Support and Native AI Upgrades
🤖
AI
Content type:
News
hackster.io
·
12h
12 hours ago
Actions for OpenCV 5 Debuts with Improved ONNX Support and Native AI Upgrades
heterodoxin/graphkv: Graph-guided
KV
cache
compression for memory-efficient
LLM
inference.
🤖
LLM
Content type:
Code
github.com
·
4d
4 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Where to Host Your Open-Source
Model
(Under 10B Parameters)
🤖
LLM
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
Google Shrank Gemma 4 by 72% and Unsloth Fixed the
4-Bit
Bug Nobody Else Caught on One 4090, and
4-Bit
Shouldn’t Be This Good
🤖
LLM
Content type:
Blog
towardsai.net
·
2d
2 days ago
Actions for Google Shrank Gemma 4 by 72% and Unsloth Fixed the 4-Bit Bug Nobody Else Caught on One 4090, and 4-Bit Shouldn’t Be This Good
IntentKV: Cross-Turn Intent-Aware
KV
Cache
Pruning for Agent
Inference
🤖
Agents
Content type:
Academic
arxiv.org
·
23h
23 hours ago
Actions for IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30
🤖
LLM
newsletter.artofsaience.com
·
6d
6 days ago
Actions for DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30
MoQ GGUFs and GSQ:
Low-Bit
GGUFs Are About to Get Much Better
🤖
LLM
Content type:
News
Content type:
Blog
kaitchup.substack.com
·
5d
5 days ago
·
r/LocalLLaMA
Actions for MoQ GGUFs and GSQ: Low-Bit GGUFs Are About to Get Much Better
Report: GKE
Inference
Gateway delivers up to 92% faster AI responses
🤖
LLM
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🤖
LLM
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
Anatomy of a high-performance EP kernel
⚡
Vllm
Content type:
Blog
fergusfinn.com
·
1d
1 day ago
·
Hacker News
Actions for Anatomy of a high-performance EP kernel
Sign up or log in to see more results
Sign Up
Login
« Page 2
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help