⚡ Continuous Batching

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

thecomputersciencebook.com·

PagedAttention is more than virtual memory

Covers Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

vettedconsumer.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

pyimagesearch.com·

RAG Observability with Langfuse, vLLM, and FAISS

fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

networkworld.com·

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

portal.neuralwatt.com·

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Discussed on Hacker News

Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

mstar.stanford.edu·

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Discussed on Hacker News

Two Qwen3 models on one DGX Spark: the residency math

Discussed on Hacker News

aws.amazon.com·

How Public AI delivers sovereign LLM inference on AWS and Intel

Covers 4 stories including Hugging Face – Fun chat with your own Artificial Intelligence

rocm.blogs.amd.com·

Unlocking Extreme AMD Instinct Inference with Software-Hardware Co-Optimization

Discussed on Hacker News

·

KV Cache Explained: Why LLMs Recompute Everything and How We Stop It

digitalocean.com·

Efficient LLM Compression with SparseGPT and Wanda on GPU Cloud

Covers NVIDIA Triton Inference Server — NVIDIA Triton Inference Server

Developing web apps with local LLM inference

·

vLLM, Function Calling, and World Models explained

AI inference provider Baseten reportedly raising $1.5B in funding

Log in to enable infinite scrolling