💬 Prompt optimizations for LLM serving - pleto · Scour

Characterizing Software Aging in GPU-Based LLM Serving Systems

🔧Systems-level optimizations for LLM serving Academic

RedKnot: Efficient Long-Context LLM Serving with Head-Aware KV Reuse and SegPagedAttention

🔧Systems-level optimizations for LLM serving Academic

SIFT: Selective-Index For Fast Compute of RAG Prefill by Exploiting Attention Invariance

🔍Retrieval-augmented generation Academic

SPEAR: A System for Post-Quantization Error-Adaptive Recovery Enabling Efficient Low-Bit LLM Serving

✨Model optimizations in LLMs Academic

Less-relevant results

What Should a Skill Remember? Quality-Cost Trade-offs in Cost-Aware Skill Rewriting for Language Model Agents

🧠Large Language Models (LLMs) Academic

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

🔧Systems-level optimizations for LLM serving Academic

arxiv.org··Hacker News

Fairness-Aware and Latency-Controllable Scheduling for Chunked-Prefill LLM Serving

🔧Systems-level optimizations for LLM serving Academic

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

🧠Large Language Models (LLMs) Academic

Vortex: Efficient and Programmable Sparse Attention Serving for AI Agents

🤖Agents using LLMs Academic

Agents All the Way Down; A Methodology for Building Custom AI Agents from Substrate to Production

🤖Agents using LLMs Academic

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

⚡Real-time AI Systems Academic

AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

🔧Systems-level optimizations for LLM serving Academic

No more posts from pleto's subscribed feeds.

Scour all 25258 feeds Learn more about Feeds

Log in to enable infinite scrolling