⚡ Continuous Batching

I got tired of not understanding how vLLM works under the hood, so I built my own mini inference engine from scratch.

Discussed on r/LLM

thecomputersciencebook.com·

PagedAttention is more than virtual memory

Covers Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

portal.neuralwatt.com·

Neuralwatt: Energy-based pricing for AI inference. Efficient prompts cost less

Discussed on Hacker News

Developing web apps with local LLM inference

fitservers.com·

The Complete Guide to Deploying DeepSeek R1 on a Dedicated Server

thecybersidekick.beehiiv.com·

AI Inference at the Edge: Running Real-Time LLMs in Kubernetes Without a GPU Farm

Discussed on DEV

digitalocean.com·

Efficient LLM Compression with SparseGPT and Wanda on GPU Cloud

Covers NVIDIA Triton Inference Server — NVIDIA Triton Inference Server

vettedconsumer.com·

The KV Cache, Explained: Why Long Context Eats Your VRAM (and How to Fit More)

Covers 2 stories including Efficient Memory Management for Large Language Model Serving with PagedAttention

Discussed on Hacker News

·

vLLM, Function Calling, and World Models explained

SwiftCache: Efficient LLM Serving for Multi-turn Conversations with Heterogeneous KV Cache Sharing

·

The Context Budget That Will Decide Everyday AI

Anyscale blog posts·

High Performance Distributed Inference with Ray Serve LLM

Covered by Google Cloud Blog

Discussed on Hacker News

pyimagesearch.com·

RAG Observability with Langfuse, vLLM, and FAISS

networkworld.com·

Tether is shipping TurboQuant KV-cache quantization with Vulkan support into its QVAC SDK

·

[AINews] GLM > GPT? GLM-5.2 passes vibe check; Z.ai forecasts Open Fable by December

Show HN: Evaluating Local LLMs as language translators for my app

Discussed on Hacker News

Google Cloud Blog·

Scaling Ray Serve LLM on GKE: Performance without losing the developer experience

mstar.stanford.edu·

M* (M-Star): A Modular, Extensible, Serving System for Multimodal Models

Discussed on Hacker News

Two Qwen3 models on one DGX Spark: the residency math

Discussed on Hacker News

Hivekeep - self-host a team of AI agents in one container, with a real UI (MIT)

Log in to enable infinite scrolling