Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Close
Copied to clipboard
Close
Unable to share or copy to clipboard
Close
⚡ Inference
Specific
LLM inference, vLLM, speculative decoding, latency, throughput
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
151409
posts in
12.9
ms
LLM
inference
engine from
scratch
in C++
🧠
LLMs
anirudhsathiya.com
·
4d
·
Hacker News
The Engine Behind Modern LLM Inference, Part 1: Continuous
Batching
,
PagedAttention
, and the End of…
🧠
LLMs
medium.com
·
23h
Inside LLM Inference: KV Cache,
Prefill
, and the
Decode
Bottleneck
🧠
LLMs
pub.towardsai.net
·
1d
AsyncTLS
: Efficient Generative LLM Inference with
Asynchronous
Two-level Sparse Attention
🧠
LLMs
arxiv.org
·
12h
RetroInfer
: A
Vector
Storage Engine for Scalable Long-Context LLM Inference
🧠
LLMs
vldb.org
·
1d
Semidynamics
Secures SK
hynix
Investment to Advance Memory-Centric AI Inference Architecture
💾
Agent Memory
hpcwire.com
·
5h
·
Hacker News
Overcoming
inference
challenges
🧠
Reasoning Models
redhat.com
·
3d
I Ran My
KYB
Engine at Three
Quantization
Levels. Accuracy Didn't Move. Cost Dropped 6x.
📊
Model Evaluation
walsenburgtech.com
·
23h
·
Hacker News
We Put a Gaming Box in the
Inference
Loop
🧠
Reasoning Models
write.as
·
2d
Prediction: The "Inference
Supercycle
" Could Be
Bigger
Than the Training Boom. 1 Growth Stock to Own.
🧠
Reasoning Models
finance.yahoo.com
·
23h
milanm/AutoGrad-Engine
: A complete GPT language model (training and inference) in ~600 lines of pure C#, zero dependencies
🧠
LLMs
github.com
·
1d
·
Hacker News
Inside the LLM Black Box: The
True
Architecture of
Latency
and Cost
🧠
LLMs
akanuri.medium.com
·
6d
New course: Efficient Inference with SGLang: Text and Image Generation, built in partnership with LMSys @
lmsysorg
and
RadixArk
@
radixark
, and taught by Richard ...
🧠
LLMs
twitter.macworks.dev
·
23h
UCCL-EP
: Portable
Expert-Parallel
Communication
🔌
MCP
uccl-project.github.io
·
2d
·
Hacker News
How to achieve
P90
sub-microsecond
latency in a C++ FIX engine
🔌
MCP
akinocal1.substack.com
·
19h
·
Substack
TurboQuant
Is Quietly
Solving
LLM Inference’s Worst Memory Problem
🧠
LLMs
medium.com
·
5d
Attn-QAT
: Making 4-Bit Attention Actually Work
🎛️
Fine-tuning
haoailab.com
·
1d
Better MoE model inference with
warp
decode
🧠
LLMs
cursor.com
·
4d
·
Hacker News
Building the
Blueprint
for Premium
Inference
🧪
Synthetic Data
sambanova.ai
·
1d
GPU Memory for LLM Inference: Why
Llama-70B
Doesn't Fit
🧠
LLMs
darshanfofadiya.com
·
4d
·
Hacker News
Loading...
Loading more...
Page 2 »
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help