Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
LLM serving, inference optimization, token generation, vLLM
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
143
posts in
23.4
ms
llama.cpp
vs.
vLLM
: Choosing the right local LLM inference engine
⚡
KV Cache
developers.redhat.com
·
3d
3 days ago
·
Covers 7 stories
Actions for llama.cpp vs. vLLM: Choosing the right local LLM inference engine
67% Cost Savings with PD Disaggregation Using Ray and
vLLM
on AMD MI325X
⚡
KV Cache
Content type:
Blog
anyscale.com
·
2d
2 days ago
·
Hacker News
Actions for 67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X
AI
Inference
at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
⚡
KV Cache
Content type:
Blog
thecybersidekick.beehiiv.com
·
1h
1 hour ago
·
DEV
Actions for AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
JetFlow: Breaking the Scaling Ceiling of
Speculative
Decoding
with Parallel Tree Drafting
⚡
KV Cache
Content type:
Academic
arxiv.org
·
9h
9 hours ago
Actions for JetFlow: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting
DFlash and
Spec
V2
Decoding
(14 minute read)
⚡
KV Cache
Content type:
Blog
lmsys.org
·
2d
2 days ago
·
Covers:
Looking for a self-hosted alternative to Modal.com for running ML workloads
,
MiMo-v2.5-Pro-UltraSpeed: 1T model with 1000 TPS
+2 more
Actions for DFlash and Spec V2 Decoding (14 minute read)
PagedAttention
is more than virtual memory
⚡
KV Cache
thecomputersciencebook.com
·
3d
3 days ago
·
Hacker News
·
Covers:
Efficient Memory Management for Large Language Model Serving with PagedAttention
Actions for PagedAttention is more than virtual memory
Parallelize
speculative
decoding
with P-EAGLE on Amazon SageMaker AI
⚡
KV Cache
Content type:
Blog
aws.amazon.com
·
1d
1 day ago
Actions for Parallelize speculative decoding with P-EAGLE on Amazon SageMaker AI
ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
, Ollama, LM Studio,
llama.cpp
).
⚡
KV Cache
Content type:
Code
github.com
·
3d
3 days ago
·
Hacker News
·
Covers:
uv
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
Unlocking Extreme AMD Instinct
Inference
with Software-Hardware
Co-Optimization
💻
Software Engineering
Content type:
Blog
rocm.blogs.amd.com
·
1d
1 day ago
·
Hacker News
Actions for Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
Most people use Ollama or
llama.cpp
for local LLMs, but these are the tools I switch to when it gets serious
⚡
KV Cache
xda-developers.com
·
4d
4 days ago
·
Covers:
vllm-project/vllm
,
sgl-project/sglang
+2 more
Actions for Most people use Ollama or llama.cpp for local LLMs, but these are the tools I switch to when it gets serious
Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs,
vLLM
on Google Kubernetes Engine — Football…
⚡
KV Cache
Content type:
Blog
medium.com
·
2d
2 days ago
Actions for Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…
RAG Observability with Langfuse,
vLLM
, and FAISS
🔍
RAG
pyimagesearch.com
·
3d
3 days ago
Actions for RAG Observability with Langfuse, vLLM, and FAISS
vLLM
Internalised: The Mechanics of Modern
LLM
Inference
⚡
KV Cache
Content type:
Blog
medium.com
·
4d
4 days ago
Actions for vLLM Internalised: The Mechanics of Modern LLM Inference
Less-relevant results
GLM-5.2: Built for Long-Horizon Tasks
⚡
KV Cache
Content type:
Blog
huggingface.co
·
1d
1 day ago
·
Hacker News
,
r/LocalLLaMA
·
Cited by 1 article
·
Covers:
New model GLM-Experimental is quite good (not local so far)
,
GLM Coding Plan for Claude Code
Actions for GLM-5.2: Built for Long-Horizon Tasks
The
KV
Cache
, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
⚡
KV Cache
vettedconsumer.com
·
3d
3 days ago
·
Hacker News
·
Covers:
Efficient Memory Management for Large Language Model Serving with PagedAttention
,
DeepSeek-V2: A Strong, Economical, and Efficient MOE Language Model
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
Speculative
Decoding
| LM Studio
💬
LLMs
lmstudio.ai
·
4d
4 days ago
Actions for Speculative Decoding | LM Studio
Green AI:
Speculative
Decoding
as an Environmental Necessity
🤖
AI Agents
towardsdeeplearning.com
·
1d
1 day ago
Actions for Green AI: Speculative Decoding as an Environmental Necessity
EfficientRollout: System-Aware
Self-Speculative
Decoding
for RL Rollouts
🔧
MLOps
Content type:
Academic
arxiv.org
·
9h
9 hours ago
Actions for EfficientRollout: System-Aware Self-Speculative Decoding for RL Rollouts
Run a local coding model with pi and LM Studio
⚡
KV Cache
zarar.dev
·
21h
21 hours ago
·
Covers:
Pi.dev: There are many coding agents, but this one is mine
,
Opencode – open-source alternative to Claude Code
+3 more
Actions for Run a local coding model with pi and LM Studio
A brief history of
KV
cache
compression developments
⚡
KV Cache
Content type:
Blog
martinalderson.com
·
3d
3 days ago
·
Covers:
TurboQuant: Redefining AI efficiency with extreme compression
Actions for A brief history of KV cache compression developments
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Dislike
Report