🧠 Inference Engineering - nayyara.airlangga · Scour

Fast Heterogeneous Serving: Scalable Mixed-Scale LLM Allocation for SLO-Constrained Inference ⚙️MLOps

arxiv.org·14h

RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference ⚙️MLOps

vldb.org·1d

LLM inference engine from scratch in C++ 🚀Speculative Decoding

anirudhsathiya.com·4d·Hacker News

Presentation: Latency: The Race to Zero...Are We There Yet? 🕸️Distributed Systems

infoq.com

·4h

The Engine Behind Modern LLM Inference, Part 1: Continuous Batching, PagedAttention, and the End of… 🔄Continuous Batching

medium.com·1d

How to achieve P90 sub-microsecond latency in a C++ FIX engine ⚗️Kernel Fusion

akinocal1.substack.com·20h·Substack

Dockerizing ML Models: A Data Engineer’s Guide to Model Serving 🚀Model Serving

medium.com

·4d

Inside LLM Inference: KV Cache, Prefill, and the Decode Bottleneck 🚀Speculative Decoding

pub.towardsai.net

·1d

Understanding the Counterintuitive Relationship Between Completion Time, Throughput, and Latency in… 🔄Continuous Batching

medium.com·20h

Thinking microscopes: agentic AI and the future of electron microscopy 🚀Speculative Decoding

nature.com·4h

AI agents aren’t failing. The coordination layer is failing 🕸️Distributed Systems

infoworld.com·9h

Apfel -- A CLI and http server for the on-device Apple Intelligence LLM 🚀Model Serving

discuss.privacyguides.net

·2d

Inside the LLM Black Box: The True Architecture of Latency and Cost ⚙️MLOps

akanuri.medium.com·6d

Things done to overcome latency pains ⚡FlashAttention

http2-explained.haxx.se·1d

GPU Memory for LLM Inference: Why Llama-70B Doesn't Fit 🚀Speculative Decoding

darshanfofadiya.com·4d·Hacker News

Advanced Prompt Caching at Scale 🔄Continuous Batching

digitalocean.com·2d

KV Cache in LLM Inference: From PagedAttention (2023) to Reasoning Model Bottlenecks (2026) 💾KV Cache

medium.com·3d

Reducing P999 Latency in Distributed Databases with TiDB 8.5 🔄Continuous Batching

pingcap.com·1d

Blink: CPU-Free LLM Inference by Delegating the Serving Stack to GPU and SmartNIC 🚀Model Serving

arxiv.org·14h

Luce-Org/luce-megakernel: Megakernel to match Apple Silicon Efficiency at 2x the Throughput on a RTX 3090 ⚡Triton

github.com·2d·Hacker News

Loading more...