🔧 Systems-level optimizations for LLM serving - pleto · Scour

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

💬Prompt optimizations for LLM serving Academic

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

✨Model optimizations in LLMs Academic

TileFuse: A Fused Mixed-Precision Kernel Library for Efficient Quantized LLM Inference on AMD NPUs

🔢Quantization of LLMs Academic

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

💬Prompt optimizations for LLM serving Academic

APEX4: Efficient Pure W4A4 LLM Inference via Intra-SM Compute Rebalancing

🧠Large Language Models (LLMs) Academic

Clairvoyant: Predictive SJF Scheduling to Mitigate Head-of-Line Blocking in Serial LLM Backends

🚀LLM serving frameworks Academic

INFRAMIND: Infrastructure-Aware Multi-Agent Orchestration

⚙️AI Infrastructure Automation Academic

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

🧠Large Language Models (LLMs) Academic

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

💬Prompt optimizations for LLM serving Academic

arxiv.org··Hacker News

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

🧠Large Language Models (LLMs) Academic

Less-relevant results

Reroute, Don't Remove: Recoverable Visual Token Routing for Vision-Language Models

🧠Large Language Models (LLMs) Academic

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

🚀LLM serving frameworks Academic

End-to-End Context Compression at Scale

🧠Large Language Models (LLMs) Academic

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

💬Prompt optimizations for LLM serving Academic

Teaching Diffusion to Speculate Left-to-Right

🧠Large Language Models (LLMs) Academic

Rethinking LoRA Memory Through the Lens of KV Cache Compression

📊AI Performance Profiling Academic

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

🧠Large Language Models (LLMs) Academic

arxiv.org··Hacker News

Task-Aware Structured Memory for Dynamic Multi-modal In-Context Learning

🧠Large Language Models (LLMs) Academic

Still: Amortized KV Cache Compaction in a Single Forward Pass

🌐Distributed LLM Systems Academic

AdaPLD: Adaptive Retrieval and Reuse for Efficient Model-Free Speculative Decoding

🔍Retrieval-augmented generation Academic

Log in to enable infinite scrolling