Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Systems-level optimizations for LLM serving
🔧 Systems-level optimizations for LLM serving
Specific
Filter Results
Timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
35
posts in
5.9
ms
RedKnot: Efficient Long-Context
LLM
Serving
with Head-Aware
KV
Reuse and SegPagedAttention
💬
Prompt optimizations for LLM serving
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention
STAR-KV
: Low-Rank
KV
Cache
Compression via Soft Thresholding for Adaptive Rank Control
✨
Model optimizations in LLMs
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control
TileFuse: A Fused Mixed-Precision Kernel Library for Efficient
Quantized
LLM
Inference
on AMD NPUs
🔢
Quantization of LLMs
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs
VIA-SD: Verification via
Intra-Model
Routing for
Speculative
Decoding
💬
Prompt optimizations for LLM serving
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for VIA-SD: Verification via Intra-Model Routing for Speculative Decoding
APEX4: Efficient Pure W4A4
LLM
Inference
via Intra-SM Compute Rebalancing
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing
Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial
LLM
Backends
🚀
LLM serving frameworks
Content type:
Academic
arxiv.org
·
3d
3 days ago
Actions for Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends
INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
⚙️
AI Infrastructure Automation
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration
From Rigid to Dynamic: Entropy-Guided Adaptive
Inference
for Long-Context LLMs
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs
Tangram: Unlocking Non-Uniform
KV
Cache
for Efficient Multi-turn
LLM
Serving
💬
Prompt optimizations for LLM serving
Content type:
Academic
arxiv.org
·
6d
6 days ago
·
Hacker News
Actions for Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving
SpectrumKV: Per-Token Mixed-Precision
KV
Cache
Transfer for
Prefill-Decode
Disaggregated LLM Serving
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving
Less-relevant results
Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language
Models
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models
YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition
🚀
LLM serving frameworks
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition
End-to-End Context Compression at Scale
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for End-to-End Context Compression at Scale
AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn
LLM
Agent
Serving
💬
Prompt optimizations for LLM serving
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving
Teaching Diffusion to
Speculate
Left-to-Right
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Teaching Diffusion to Speculate Left-to-Right
Rethinking LoRA Memory Through the Lens of
KV
Cache
Compression
📊
AI Performance Profiling
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for Rethinking LoRA Memory Through the Lens of KV Cache Compression
FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
2d
2 days ago
·
Hacker News
Actions for FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention
Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
🧠
Large Language Models (LLMs)
Content type:
Academic
arxiv.org
·
19h
19 hours ago
Actions for Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning
Still: Amortized
KV
Cache
Compaction in a Single Forward Pass
🌐
Distributed LLM Systems
Content type:
Academic
arxiv.org
·
2d
2 days ago
Actions for Still: Amortized KV Cache Compaction in a Single Forward Pass
AdaPLD: Adaptive Retrieval and Reuse for Efficient
Model-Free
Speculative
Decoding
🔍
Retrieval-augmented generation
Content type:
Academic
arxiv.org
·
6d
6 days ago
Actions for AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help