🚀 Inference - abhik · Scour

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference 🔧MLOps

arxiv.org·6h

Overcoming inference challenges 🔧MLOps

redhat.com·3d

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference 🧠LLMs

vldb.org·1d

The Engine Behind Modern LLM Inference, Part 1: Continuous Batching, PagedAttention, and the End of… 🧠LLMs

medium.com·17h

Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck 🧠LLMs

pub.towardsai.net

·1d

Inference Arena – new benchmark of local inference and training 🔧MLOps

kvark.github.io·4d·Hacker News

LLM inference engine from scratch in C++ 🧠LLMs

anirudhsathiya.com·4d·Hacker News

The case for Model-as-a-Service over self-managed inference 🔧MLOps

news.ycombinator.com·3d·Hacker News

vLLM introduces memory optimizations for long-context inference 🧠LLMs

github.com·5d·Hacker News

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC 🔧MLOps

arxiv.org·6h

Dockerizing ML Models: A Data Engineer’s Guide to Model Serving 🔧MLOps

medium.com

·4d

TurboQuant Explained: Extreme AI Compression for Faster, Cheaper LLM Inference and Vector Search 🧠LLMs

medium.com

·5d

benchmarking inference of popular models on consumer hardware 🔧MLOps

inferena.tech·5d·Hacker News

Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference 🧠LLMs

arxiv.org·6h

AsyncTLS: Efficient Generative LLM Inference with Asynchronous Two-level Sparse Attention 🧠LLMs

arxiv.org·6h

Scheduling the Unschedulable: Taming Black-Box LLM Inference at Scale 🔧MLOps

arxiv.org·1d

Multi-Turn Reasoning LLMs for Task Offloading in Mobile Edge Computing 🧠LLMs

arxiv.org·1d

Wiring the 'Why': A Unified Taxonomy and Survey of Abductive Reasoning in LLMs 🧠LLMs

arxiv.org·6h

Comparative Characterization of KV Cache Management Strategies for LLM Inference 🧠LLMs

arxiv.org·2d

LLM Evaluation as Tensor Completion: Low Rank Structure and Semiparametric Efficiency 🧠LLMs

arxiv.org·2d

Loading more...