Systems-level optimizations for LLM serving

Feeds to Scour
SubscribedAll
Scoured 35 posts in 7.7 ms

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

 Model optimizations in LLMs  Content type: Academic
arxiv.org·

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

 🔢Quantization of LLMs  Content type: Academic
arxiv.org·

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

 🚀LLM serving frameworks  Content type: Academic
arxiv.org·

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

 ⚙️AI Infrastructure Automation  Content type: Academic
arxiv.org·

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org··Hacker News

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·
Less-relevant results

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

 🚀LLM serving frameworks  Content type: Academic
arxiv.org·

End-to-End Context Compression at Scale

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

 💬Prompt optimizations for LLM serving  Content type: Academic
arxiv.org·

Teaching Diffusion to Speculate Left-to-Right

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

Rethinking LoRA Memory Through the Lens of KV Cache Compression

 📊AI Performance Profiling  Content type: Academic
arxiv.org·

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org··Hacker News

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

 🧠Large Language Models (LLMs)  Content type: Academic
arxiv.org·

Still: Amortized KV Cache Compaction in a Single Forward Pass

 🌐Distributed LLM Systems  Content type: Academic
arxiv.org·

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

 🔍Retrieval-augmented generation  Content type: Academic
arxiv.org·

Keyboard Shortcuts

Navigation

Next / previous item
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help