vLLM Blog · Scour

DeepSeek-V3.2 on GB300: Performance Breakthrough

blog.vllm.ai·1d

Driving vLLM WideEP and Large-Scale Serving Toward Maturity on Blackwell (Part I)

blog.vllm.ai·1w

GPT-OSS Performance Optimizations on NVIDIA Blackwell: Pushing the Pareto Frontier

blog.vllm.ai·1w

Building Mixture-of-Models on AMD GPUs with vLLM-SR

blog.vllm.ai·3w

Inside vLLM’s New KV Offloading Connector: Smarter Memory Transfer for Maximizing Inference Throughput

blog.vllm.ai·5w

vLLM Semantic Router v0.1 Iris: The First Major Release

blog.vllm.ai·5w

Introducing vLLM Playground: A Modern Web Interface for Managing and Interacting with vLLM Servers

blog.vllm.ai·6w

Announcing vllm.ai Website and Some Community Updates

blog.vllm.ai·7w

vLLM-Omni Diffusion Cache Acceleration

blog.vllm.ai·8w

vLLM Large Scale Serving: DeepSeek @ 2.2k tok/s/H200 with Wide-EP

blog.vllm.ai·8w·

Discuss: Hacker News

AMD × vLLM Semantic Router: Building the System Intelligence Together

blog.vllm.ai·8w

Encoder Disaggregation for Scalable Multimodal Model Serving

blog.vllm.ai·8w

Run Highly Efficient and Accurate AI Agents with NVIDIA Nemotron 3 Nano on vLLM

blog.vllm.ai·8w

Token-Level Truth: Real-Time Hallucination Detection for Production LLMs

blog.vllm.ai·8w

vLLM Router: A High-Performance and Prefill/Decode Aware Load Balancer for Large-scale Serving

blog.vllm.ai·9w

Diving into speculative decoding training support for vLLM with Speculators v0.3.0

blog.vllm.ai·9w

Advancing Low‑Bit Quantization for LLMs: AutoRound x LLM Compressor

blog.vllm.ai·9w

Tracing Hanging and Complicated GPU Kernels Down To The Source Code

blog.vllm.ai·10w

Announcing vLLM-Omni: Easy, Fast, and Cheap Omni-Modality Model Serving

blog.vllm.ai·10w

Streamlined multi-node serving with Ray symmetric-run

blog.vllm.ai·12w

Loading more...