🔧 Systems-level optimizations for LLM serving - pleto · Scour

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

⚡Real-time AI Systems Academic

AVIS: Adaptive Test-Time Scaling for Vision-Language Models

🧠Large Language Models (LLMs) Academic

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

🧠Large Language Models (LLMs) Academic

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

🧠Large Language Models (LLMs) Academic

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

🧠Large Language Models (LLMs) Academic

TRADE: Transducer-Augmented Decoder for Speech LLM

⚡Real-time AI Systems Academic

Beyond tokens: a unified framework for latent communication in LLM-based multi-agent systems

🤖Agents using LLMs Academic

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

🔍Retrieval-augmented generation Academic

EinSort: Sorting is All We Need for Tensorizing LLM

✨Model optimizations in LLMs Academic

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

🧠Large Language Models (LLMs) Academic

Towards Tight Bounds for Streaming Attention

🧠Large Language Models (LLMs) Academic

InternVideo3: Agentify Foundation Models with Multimodal Contextual Reasoning

🔍Retrieval-augmented generation Academic

Latent Reasoning with Normalizing Flows

🧠Large Language Models (LLMs) Academic

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

⚡Real-time AI Systems Academic

Visual Para-Thinker++: A Single-Policy Multi-Agent Framework for Visual Reasoning

🤖Agents using LLMs Academic

No more posts from pleto's subscribed feeds.

Scour all 25258 feeds Learn more about Feeds

Log in to enable infinite scrolling