Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Vllm
⚡ Vllm
Specific
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
112
posts in
6.5
ms
Report: GKE Inference Gateway delivers up to 92% faster AI responses
🤖
LLM
Content type:
Blog
cloud.google.com
·
2d
2 days ago
·
Hacker News
Actions for Report: GKE Inference Gateway delivers up to 92% faster AI responses
WEKA software speeds long context AI inferencing on Oracle’s public cloud
🤖
Agents
Content type:
News
blocksandfiles.com
·
11h
11 hours ago
Actions for WEKA software speeds long context AI inferencing on Oracle’s public cloud
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
🤖
LLM Inference
Content type:
Academic
arxiv.org
·
1d
1 day ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
What Arm-based innovations happened in May 2026?
🤖
LLM
Content type:
Blog
newsroom.arm.com
·
5d
5 days ago
Actions for What Arm-based innovations happened in May 2026?
High
Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk
🤖
LLM Inference
ncnonline.net
·
1d
1 day ago
Actions for High Bandwidth Flash | A New Memory for AI Data Centers and Edge Computing | Sandisk
Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
🤖
LLM
Content type:
News
Content type:
Blog
blog.google
·
5d
5 days ago
·
Hacker News
Actions for Gemma 4 QAT models: Optimizing model compression for mobile and laptop efficiency
libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA
KV-cache
. Weights available on Hugging Face.
🤖
LLM Inference
Content type:
Code
github.com
·
1d
1 day ago
Actions for libertywing/FlashMemory-Deepseek-V4: FlashMemory DS-V4 Retriever: a lightweight retriever that sparsifies DeepSeek-V4 CSA KV-cache. Weights available on Hugging Face.
Where to Host Your
Open-Source
Model (Under 10B Parameters)
🤖
LLM
digitalocean.com
·
6d
6 days ago
Actions for Where to Host Your Open-Source Model (Under 10B Parameters)
Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈
🤖
LLM Inference
tldr.tech
·
1d
1 day ago
Actions for Claude Fable 5 🚀, Gemini 3.5 Live Translate 📱, scaling test time compute 📈
ReasonAlloc: Hierarchical Decoding-Time
KV
Cache
Budget Allocation for Reasoning Models
🤖
Agents
Content type:
Academic
arxiv.org
·
22h
22 hours ago
Actions for ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models
OpenCV 5.0 Computer Vision Library Released with Rewritten DNN
Engine
🤖
LLM Inference
linuxiac.com
·
2d
2 days ago
Actions for OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
OpenCV 5 release - New DNN
engine
with enhanced ONNX and
LLM/VLM
support, Intel, Arm, and RISC-V hardware optimizations - CNX Software
🤖
LLM
Content type:
News
cnx-software.com
·
22h
22 hours ago
Actions for OpenCV 5 release - New DNN engine with enhanced ONNX and LLM/VLM support, Intel, Arm, and RISC-V hardware optimizations - CNX Software
Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and
Co-Design
🤖
LLM Inference
Content type:
Blog
tilert.ai
·
2d
2 days ago
·
Hacker News
Actions for Two Leaps to 1000 Tokens/s on a 1T-Parameter Model: On Inference Systems, Execution Boundaries, and Co-Design
IntentKV: Cross-Turn Intent-Aware
KV
Cache
Pruning for Agent Inference
🤖
Agents
Content type:
Academic
arxiv.org
·
22h
22 hours ago
Actions for IntentKV: Cross-Turn Intent-Aware KV Cache Pruning for Agent Inference
From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
🤖
LLM
Content type:
Blog
jimmysong.io
·
1d
1 day ago
Actions for From GPU to Token: The 8-Layer Observability Stack for AI Infrastructure
huawei-csl/KVarN: KVarN is a native
vLLM
KV-cache
quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
🤖
LLM Inference
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for huawei-csl/KVarN: KVarN is a native vLLM KV-cache quantization backend for your agents: 3-5x more context, throughput above FP16, and FP16-level accuracy. Calibration-free, one flag.
BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
🤖
LLM Inference
sleepingrobots.com
·
4d
4 days ago
Actions for BeeLlama.cpp DFlash on Strix Halo: 2.7x Gemma 31B, But MTP Is Still Faster
A system programmer’s guide to
LLM
inference
🤖
LLM Inference
Content type:
Blog
blog.xiangpeng.systems
·
3d
3 days ago
·
Hacker News
Actions for A system programmer’s guide to LLM inference
DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30
🤖
LLM
newsletter.artofsaience.com
·
6d
6 days ago
Actions for DeepSeek V4, LeCun's Bet Against LLMs, and Lovable's Self-Improving Agent - The Tokenizer Edition #30
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🤖
LLM Inference
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help