Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Prompt Caching
💾 Prompt Caching
Specific
Context Reuse, KV Cache, Inference Optimization, Token Efficiency
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
123
posts in
127.0
ms
🔓
Open Source AI
Anyscale blog posts
·
3d
3 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
🔓
Open Source AI
fitservers.com
·
2d
2 days ago
The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
⚡
Fast AI Inference
arxiv.org
·
2d
2 days ago
CacheWeaver
: Cache-Aware Evidence Ordering for
Efficient
Grounded RAG
Inference
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference
🔓
Open Source AI
GitHub
·
5d
5 days ago
Sors: a Rust proxy that reorders
prompts
to maximize
vLLM
prefix
cache hits
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits
🏗️
LLM Infrastructure
medium.com
·
1d
1 day ago
vLLM
, Function Calling, and World Models explained
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for vLLM, Function Calling, and World Models explained
🔄
LLM RAG Pipelines
pyimagesearch.com
·
6d
6 days ago
RAG Observability with Langfuse,
vLLM
, and FAISS
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for RAG Observability with Langfuse, vLLM, and FAISS
🔌
Claude Plugins
code.claude.com
·
3d
3 days ago
How Claude Code uses
prompt
caching
Covers
How I built a three-tier content quality ladder for programmatic directory ETL
Covered by
DEV Community
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How Claude Code uses prompt caching
🧠
Memory Management
thecomputersciencebook.com
·
6d
6 days ago
PagedAttention is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
🏗️
LLM Infrastructure
GitHub
·
2d
2 days ago
Profile(v2.1.4) physics-aware
optimizer
for
vLLM
(31→470
tok/s
on A100)
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Profile(v2.1.4) physics-aware optimizer for vLLM (31→470 tok/s on A100)
🏗️
LLM Infrastructure
vettedconsumer.com
·
6d
6 days ago
The
KV
Cache
, Explained: Why
Long
Context Eats Your VRAM (and How to Fit More)
Covers
2 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
🧠
LLM Inference
arxiv.org
·
2d
2 days ago
UltraQuant: 4-bit
KV
Caching
for
Context-Heavy
Agents
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for UltraQuant: 4-bit KV Caching for Context-Heavy Agents
🏗️
LLM Infrastructure
abhishek.it
·
3d
3 days ago
Running GLM-5.2 5x faster at 500tps with limitation
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running GLM-5.2 5x faster at 500tps with limitation
⚡
Fast AI Inference
thecybersidekick.beehiiv.com
·
3d
3 days ago
AI
Inference
at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
Discussed on
DEV
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
🏗️
LLM Infrastructure
GitHub
·
2d
2 days ago
I got tired of not understanding how
vLLM
works under the hood, so I built my own mini
inference
engine from scratch.
Discussed on
r/LLM
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.
🔓
Open Source AI
medium.com
·
5d
5 days ago
Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs,
vLLM
on Google Kubernetes Engine — Football…
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Deploying NVIDIA Nemotron-3 Ultra 550B, with B200 GPUs, vLLM on Google Kubernetes Engine — Football…
🏗️
LLM Infrastructure
Google Cloud Blog
·
4d
4 days ago
Scaling Ray Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
🔓
Open Source AI
GitHub
·
1d
1 day ago
datalab-to/lift: Extract structured data from documents quickly and accurately.
Covered by
habr.com
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for datalab-to/lift: Extract structured data from documents quickly and accurately.
🏗️
LLM Infrastructure
Anyscale blog posts
·
5d
5 days ago
67% Cost Savings with PD Disaggregation Using Ray and
vLLM
on AMD MI325X
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X
🏗️
LLM Infrastructure
arxiv.org
·
4d
4 days ago
Models Take Notes at
Prefill
:
KV
Cache
Can Be Editable and Composable
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Models Take Notes at Prefill: KV Cache Can Be Editable and Composable
🤖
AI
GitHub
·
6d
6 days ago
ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (
vLLM
, Ollama, LM Studio, llama.cpp).
Covers
uv
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ahwurm/localharness: Model-agnostic agent harness for local LLMs — configure agents in YAML and run them on your own hardware (vLLM, Ollama, LM Studio, llama.cpp).
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report