Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Inference
🤖 AI Inference
Model Serving, Inference Optimization, ONNX, Model Deployment
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
140
posts in
21.1
ms
harshuljain13/llm-inference-at-scale
: A Practitioner handbook for production
llm
serving
.
⚡
Inference
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
,
r/LLM
Actions for harshuljain13/llm-inference-at-scale: A Practitioner handbook for production llm serving.
Inferoa
AI
harness claimed 90% cache savings. We ran it and measured 97.8%
🧠
LLM Tooling
zozo123.github.io
·
2d
2 days ago
·
Hacker News
Actions for Inferoa AI harness claimed 90% cache savings. We ran it and measured 97.8%
Metrics that Matter with
Serverless
Inference
☁️
Cloud Computing
digitalocean.com
·
17h
17 hours ago
Actions for Metrics that Matter with Serverless Inference
How ERGO Hestia reduced time-to-market with Lakebase and Mosaic
AI
Model
Serving
⚙️
ML Infrastructure
Content type:
Blog
databricks.com
·
1d
1 day ago
Actions for How ERGO Hestia reduced time-to-market with Lakebase and Mosaic AI Model Serving
DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
🇨🇳
Chinese Technology
Content type:
News
newsletter.semianalysis.com
·
3d
3 days ago
·
Hacker News
·
Cited by 1 article
Actions for DeepSeekV4 1.6T Day 0 to Day 43 Performance Over Time - Huawei, GB300 NVL72, MI355X, B200
OpenCV 5 Debuts with Improved
ONNX
Support and Native
AI
Upgrades
👁️
Computer Vision
Content type:
News
hackster.io
·
2d
2 days ago
Actions for OpenCV 5 Debuts with Improved ONNX Support and Native AI Upgrades
12B Gemma 4 QAT
Deployment
with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI
🔧
MCP
Content type:
Blog
medium.com
·
1d
1 day ago
Actions for 12B Gemma 4 QAT Deployment with NVIDIA L4, Cloud Run, MCP, and Antigravity CLI
vicharak-in/Gati: Gati Accelerates Your CNN Algorithms!
🤖
AI
Content type:
Code
github.com
·
17h
17 hours ago
·
Hacker News
Actions for vicharak-in/Gati: Gati Accelerates Your CNN Algorithms!
OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
👁️
Computer Vision
linuxiac.com
·
4d
4 days ago
Actions for OpenCV 5.0 Computer Vision Library Released with Rewritten DNN Engine
Intelligent
inference
scheduling with
llm-d
on Red Hat
AI
🧠
LLM
developers.redhat.com
·
2d
2 days ago
Actions for Intelligent inference scheduling with llm-d on Red Hat AI
GGUF vs GPTQ vs AWQ: The Plain-English Guide to
LLM
Quantization
(and Which One to Pick)
⚡
Quantization
vettedconsumer.com
·
6d
6 days ago
·
Hacker News
Actions for GGUF vs GPTQ vs AWQ: The Plain-English Guide to LLM Quantization (and Which One to Pick)
CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster
📱
Edge AI
Content type:
Blog
Content type:
Discussion
tildalice.io
·
6d
6 days ago
Actions for CoreML vs TFLite: iPhone 15 Pro GPU 2.3x Faster
2x GH200 for
LLM
inference
, Part 2:
vLLM
, DeepSeek V4 Flash, and MTP
⚡
Quantization
Content type:
Blog
dnhkng.github.io
·
5d
5 days ago
Actions for 2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP
massimo92/spark: CLI tool for
serving
LLMs with
vLLM
on NVIDIA DGX Spark. One file, zero friction.
🟩
Nvidia
Content type:
Code
github.com
·
1d
1 day ago
·
Hacker News
Actions for massimo92/spark: CLI tool for serving LLMs with vLLM on NVIDIA DGX Spark. One file, zero friction.
PagedAttention vs Traditional KV Cache: How
vLLM
Reinvented GPU Memory for
LLM
Inference
⚡
Inference
Content type:
Blog
medium.com
·
4d
4 days ago
Actions for PagedAttention vs Traditional KV Cache: How vLLM Reinvented GPU Memory for LLM Inference
OpenCV 5 release - New DNN engine with enhanced
ONNX
and
LLM/VLM
support, Intel, Arm, and RISC-V hardware
optimizations
- CNX Software
👁️
Computer Vision
Content type:
News
cnx-software.com
·
3d
3 days ago
Actions for OpenCV 5 release - New DNN engine with enhanced ONNX and LLM/VLM support, Intel, Arm, and RISC-V hardware optimizations - CNX Software
I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)
🖥️
Local AI
Content type:
Code
github.com
·
2d
2 days ago
·
r/LocalLLaMA
·
Cited by 1 article
Actions for I wired a fully offline voice loop to Ollama + LM Studio — 100% CPU, no GPU, nothing leaves your machine (Silero VAD + Parakeet STT + Supertonic TTS 3)
AI
Serving
Platform That Adapts to Your
Model
📊
Compute Markets
Content type:
Blog
databricks.com
·
2d
2 days ago
Actions for AI Serving Platform That Adapts to Your Model
KJLdefeated/RL.cu: RLVR training for
LLM
in CUDA/C++
🟩
Nvidia
Content type:
Code
github.com
·
6d
6 days ago
·
Hacker News
Actions for KJLdefeated/RL.cu: RLVR training for LLM in CUDA/C++
heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient
LLM
inference
.
💾
KV Cache
Content type:
Code
github.com
·
6d
6 days ago
·
r/LocalLLaMA
Actions for heterodoxin/graphkv: Graph-guided KV cache compression for memory-efficient LLM inference.
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help