Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
🤖 LLM Inference
Specific
Model Serving, Quantization, vLLM, ONNX Runtime
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
192
posts in
7.2
ms
llm-inference-at-scale/content/00
_foundations/00.1_why_
llm
_
inference
_is_different/why_
llm
_
inference
_is_different.md at master ·
harshuljain13/llm-inference-at-scale
🔮
Speculative Decoding
github.com
·
1d
·
Hacker News
Benchmarking AI
inference
on CPUs: A transparent blueprint for the enterprise
⚙️
Performance Profiling
next.redhat.com
·
22h
Characterization of machine learning compilers for
LLM
inference
on NVIDIA GPUs
🔮
Speculative Decoding
link.springer.com
·
5d
·
Hacker News
RTP-LLM
: High-Performance Alibaba
LLM
Inference
Engine
🔮
Speculative Decoding
arxiv.org
·
16h
Reliable
LLM
Inference
at Scale
📡
Edge AI
databricks.com
·
1d
Booming AI Revenues Boost
Inference
Startups to Decacorn Status
📡
Edge AI
newcomer.co
·
5h
Databricks’
Model
Units Redefine
LLM
Inference
Economics, But Can Reliability Scale?
🔮
Speculative Decoding
futurumgroup.com
·
6h
LoRA vs Adapter vs Prefix Tuning: PEFT Memory Comparison
🔮
Speculative Decoding
tildalice.io
·
6d
The same 16 GPUs, twice the users:
Inference-aware
routing for
LLM
clusters
📡
Edge Computing
redhat.com
·
2d
Why Enterprise AI Infrastructure Is Becoming a DevOps Problem
🎯
AI Agents
devops.com
·
1h
OpenCode Now Supports DigitalOcean
Inference
Router for
Intelligent
Model
Routing
📡
Edge AI
digitalocean.com
·
23h
Real-time
LLM
Inference
on Standard GPUs (3,000 tokens/s per request)
🔮
Speculative Decoding
blog.kog.ai
·
1d
·
Hacker News
,
Hacker News
Running LLMs locally on a Mac
⚙️
MLOps
danmackinlay.name
·
6d
Reachy Mini goes fully local
⏱️
Latency Engineering
huggingface.co
·
23h
·
Hacker News
Local
LLM
Deployment
: Ollama vs
vLLM
vs LM Studio Compared
🪟
Tauri
sitepoint.com
·
1d
Nvidia Dynamo Snapshot: Fast Startup for
Inference
Workloads on Kubernetes
🚀
Performance
developer.nvidia.com
·
1d
·
Hacker News
Reinforcement Learning is an Infrastructure Problem
📊
Performance Tools
modal.com
·
20h
Cohere Open-Sourced Command A+, a 218B MoE
Model
Built for Enterprise Agents
🎯
AI Agents
firethering.com
·
6d
·
Hacker News
Running AI
inference
on Rebellions ATOM NPU with Red Hat AI
🍓
Raspberry Pi Clusters
developers.redhat.com
·
2d
Global Fixed Point DSP Market Size, Industry Share, Trends & Forecast 2026-2034
💾
Embedded Systems
verifiedmarketreports.com
·
5d
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help