Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
⚡ Continuous Batching
GitHub
·
2d
2 days ago
I got tired of not understanding how
vLLM
works under the hood, so I built my own mini
inference
engine from scratch.
Discussed on
r/LLM
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.
thecomputersciencebook.com
·
6d
6 days ago
PagedAttention
is more than virtual memory
Covers
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PagedAttention is more than virtual memory
thecybersidekick.beehiiv.com
·
3d
3 days ago
AI
Inference
at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
Discussed on
DEV
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm
vettedconsumer.com
·
6d
6 days ago
The
KV
Cache
, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
Covers
2 stories
See all stories this covers
including
Efficient Memory Management for Large Language Model Serving with PagedAttention
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)
arxiv.org
·
5d
5 days ago
SwiftCache: Efficient
LLM
Serving for Multi-turn Conversations with Heterogeneous
KV
Cache
Sharing
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing
Anyscale blog posts
·
3d
3 days ago
High Performance Distributed
Inference
with Ray Serve
LLM
Covered by
Google Cloud Blog
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for High Performance Distributed Inference with Ray Serve LLM
pyimagesearch.com
·
6d
6 days ago
RAG Observability with Langfuse,
vLLM
, and FAISS
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for RAG Observability with Langfuse, vLLM, and FAISS
fitservers.com
·
2d
2 days ago
The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server
networkworld.com
·
4d
4 days ago
Tether is shipping TurboQuant
KV-cache
quantization with Vulkan support into its QVAC SDK
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK
portal.neuralwatt.com
·
10h
10 hours ago
Neuralwatt: Energy-based pricing for AI
inference
. Efficient prompts cost less
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less
Google Cloud Blog
·
4d
4 days ago
Scaling Ray Serve
LLM
on GKE: Performance without losing the developer experience
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Scaling Ray Serve LLM on GKE: Performance without losing the developer experience
mstar.stanford.edu
·
3d
3 days ago
M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models
devashish.me
·
5d
5 days ago
Two Qwen3 models on one DGX Spark: the residency math
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Two Qwen3 models on one DGX Spark: the residency math
aws.amazon.com
·
6d
6 days ago
How Public AI delivers sovereign
LLM
inference
on AWS and Intel
Covers
4 stories
See all stories this covers
including
Hugging Face – Fun chat with your own Artificial Intelligence
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How Public AI delivers sovereign LLM inference on AWS and Intel
rocm.blogs.amd.com
·
5d
5 days ago
Unlocking Extreme AMD Instinct
Inference
with Software-Hardware
Co-Optimization
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization
medium.com
·
6d
6 days ago
KV
Cache
Explained: Why LLMs Recompute Everything and How We Stop It
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for KV Cache Explained: Why LLMs Recompute Everything and How We Stop It
digitalocean.com
·
2d
2 days ago
Efficient
LLM
Compression with SparseGPT and Wanda on GPU Cloud
Covers
NVIDIA Triton Inference Server — NVIDIA Triton Inference Server
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Efficient LLM Compression with SparseGPT and Wanda on GPU Cloud
ubuntu.com
·
17h
17 hours ago
Developing web apps with local
LLM
inference
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Developing web apps with local LLM inference
medium.com
·
1d
1 day ago
vLLM
, Function Calling, and World Models explained
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for vLLM, Function Calling, and World Models explained
SiliconANGLE
·
3d
3 days ago
AI
inference
provider Baseten reportedly raising $1.5B in funding
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI inference provider Baseten reportedly raising $1.5B in funding
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report