Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
🧠 Inference Engineering
model serving, inference optimization, LLM inference, throughput
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
151629
posts in
37.3
ms
Fast
Heterogeneous
Serving: Scalable Mixed-Scale LLM Allocation for
SLO-Constrained
Inference
⚙️
MLOps
arxiv.org
·
14h
RetroInfer
: A
Vector
Storage Engine for Scalable Long-Context LLM Inference
⚙️
MLOps
vldb.org
·
1d
LLM
inference
engine from
scratch
in C++
🚀
Speculative Decoding
anirudhsathiya.com
·
4d
·
Hacker News
Presentation:
Latency
: The Race to Zero...Are We There Yet?
🕸️
Distributed Systems
infoq.com
·
4h
The Engine Behind Modern LLM Inference, Part 1: Continuous
Batching
,
PagedAttention
, and the End of…
🔄
Continuous Batching
medium.com
·
1d
How to achieve
P90
sub-microsecond
latency in a C++ FIX engine
⚗️
Kernel Fusion
akinocal1.substack.com
·
20h
·
Substack
Dockerizing
ML Models: A Data Engineer’s Guide to Model
Serving
🚀
Model Serving
medium.com
·
4d
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
🚀
Speculative Decoding
pub.towardsai.net
·
1d
Understanding the
Counterintuitive
Relationship Between Completion Time,
Throughput
, and Latency in…
🔄
Continuous Batching
medium.com
·
20h
Thinking
microscopes
: agentic AI and the future of
electron
microscopy
🚀
Speculative Decoding
nature.com
·
4h
AI agents aren’t
failing
. The
coordination
layer is
failing
🕸️
Distributed Systems
infoworld.com
·
9h
Apfel
-- A CLI and
http
server for the on-device Apple Intelligence LLM
🚀
Model Serving
discuss.privacyguides.net
·
2d
Inside the LLM Black Box: The
True
Architecture of
Latency
and Cost
⚙️
MLOps
akanuri.medium.com
·
6d
Things done to
overcome
latency
pains
⚡
FlashAttention
http2-explained.haxx.se
·
1d
GPU Memory for LLM Inference: Why
Llama-70B
Doesn't Fit
🚀
Speculative Decoding
darshanfofadiya.com
·
4d
·
Hacker News
Advanced
Prompt
Caching
at Scale
🔄
Continuous Batching
digitalocean.com
·
2d
KV Cache in LLM Inference: From
PagedAttention
(2023) to Reasoning Model
Bottlenecks
(2026)
💾
KV Cache
medium.com
·
3d
Reducing
P999
Latency in Distributed Databases with
TiDB
8.5
🔄
Continuous Batching
pingcap.com
·
1d
Blink: CPU-Free LLM Inference by
Delegating
the Serving Stack to GPU and
SmartNIC
🚀
Model Serving
arxiv.org
·
14h
Luce-Org/luce-megakernel
:
Megakernel
to match Apple Silicon Efficiency at 2x the Throughput on a RTX 3090
⚡
Triton
github.com
·
2d
·
Hacker News
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help