Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
LLM Inference
🧠 LLM Inference
Specific
Quantization, Attention Mechanisms, Batch Processing, KV Caching
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
186
posts in
44.1
ms
🏗️
LLM Infrastructure
GitHub
·
2d
2 days ago
Pipeline-parallel
LLM
inference
across GPUs on separate machines
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Pipeline-parallel LLM inference across GPUs on separate machines
🏗️
LLM Infrastructure
Anyscale blog posts
·
5d
5 days ago
67% Cost Savings with PD Disaggregation Using Ray and
vLLM
on AMD MI325X
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for 67% Cost Savings with PD Disaggregation Using Ray and vLLM on AMD MI325X
🤖
AI
GitHub
·
1d
1 day ago
Running a 35B MoE
model
on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Running a 35B MoE model on a 2017 AMD RX 580 8GB via Vulkan (no ROCm/CUDA)
🧠
Inference Serving
Towards AI
·
3d
3 days ago
Continuous
Batching
: How to Keep Your GPU Actually Busy
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Continuous Batching: How to Keep Your GPU Actually Busy
Less-relevant results
🤖
AI
GitHub
·
1d
1 day ago
Show HN: Alloy – a PyTorch backend and
inference
engine for Apple Silicon
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Show HN: Alloy – a PyTorch backend and inference engine for Apple Silicon
📱
Edge AI Optimization
arxiv.org
·
3d
3 days ago
From Tokens to Energy Flexibility:
Quantization-Enabled
Demand Response for Data Centers with
LLM
Inference
Workloads
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for From Tokens to Energy Flexibility: Quantization-Enabled Demand Response for Data Centers with LLM Inference Workloads
🔓
Open Source AI
GitHub
·
2d
2 days ago
yifanfeng97/Hyper-Extract
Covered by
何夕2077的个人站
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for yifanfeng97/Hyper-Extract
🏗️
LLM Infrastructure
arxiv.org
·
5d
5 days ago
SwiftCache: Efficient
LLM
Serving for Multi-turn Conversations with Heterogeneous
KV
Cache
Sharing
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing
🏗️
LLM Infrastructure
arxiv.org
·
4d
4 days ago
AnchorKV: Safety-Aware
KV
Cache
Compression via Soft Penalty with a Refusal Anchor
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AnchorKV: Safety-Aware KV Cache Compression via Soft Penalty with a Refusal Anchor
🏗️
LLM Infrastructure
arxiv.org
·
5d
5 days ago
ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point
Quantization-Aware
Training
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for ReQAT: Achieving Full-Precision Reasoning Accuracy with 4-bit Floating-Point Quantization-Aware Training
🤖
AI
Towards AI
·
3d
3 days ago
How to Run NVIDIA’s Nemotron Locally on Your Laptop or Desktop
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for How to Run NVIDIA’s Nemotron Locally on Your Laptop or Desktop
🤖
AI
GitHub
·
4d
4 days ago
Native
Inference
Engine for macOS 14 or newer
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Native Inference Engine for macOS 14 or newer
🧠
Inference Serving
arxiv.org
·
5d
5 days ago
PolyKV: Heterogeneous Retention and Allocation for
KV
Cache
Compression
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PolyKV: Heterogeneous Retention and Allocation for KV Cache Compression
💾
Prompt Caching
GitHub
·
5d
5 days ago
Sors: a Rust proxy that reorders prompts to maximize
vLLM
prefix
cache
hits
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Sors: a Rust proxy that reorders prompts to maximize vLLM prefix cache hits
🏗️
LLM Infrastructure
arxiv.org
·
5d
5 days ago
SMEPilot: Characterizing and Optimizing
LLM
Inference
with Scalable Matrix Extensions
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for SMEPilot: Characterizing and Optimizing LLM Inference with Scalable Matrix Extensions
🏗️
LLM Infrastructure
GitHub
·
4d
4 days ago
Cosmicgpt – A GPT-in-space simulator to research SpaceX AI satellite viability
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Cosmicgpt – A GPT-in-space simulator to research SpaceX AI satellite viability
🏆
LLM Benchmarking
arxiv.org
·
4d
4 days ago
Towards Distributed
Inference
of LLMs on a P2P Network
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Towards Distributed Inference of LLMs on a P2P Network
🤖
AI
GitHub
·
6d
6 days ago
robert-mcdermott/phlox: Phlox is a self-hostable chat application with an agentic harness, document RAG, code execution, and MCP integration — running over any
model
provider: AWS Bedrock or any OpenAI-compatible endpoint (OpenAI,
Ollama
,
vLLM
, LiteLLM, LM Studio, local models).
Covers
2 stories
See all stories this covers
including
Ollama
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for robert-mcdermott/phlox: Phlox is a self-hostable chat application with an agentic harness, document RAG, code execution, and MCP integration — running over any model provider: AWS Bedrock or any OpenAI-compatible endpoint (OpenAI, Ollama, vLLM, LiteLLM, LM Studio, local models).
📱
Edge AI Optimization
arxiv.org
·
6d
6 days ago
Efficient On-Device Diffusion
LLM
Inference
with Mobile NPU
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Efficient On-Device Diffusion LLM Inference with Mobile NPU
🤖
AI
GitHub
·
4d
4 days ago
Show HN: Selora – local
model
for Home Assistant
Covers
4 stories
See all stories this covers
including
Model Context Protocol And OAuth
Discussed on
Hacker News
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Show HN: Selora – local model for Home Assistant
« Page 1
·
Page 3 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report