Back to article

vllm-project/vllm (opens in new tab) 24 articles covering this post

github.com··DEV, Hacker News·Open original

Covered in 24 articles

Gemma 4 dense by default: why your local agent doesn't want the MoE

Multi-Head Latent Attention (MLA)

Gemma 4 Didn't Just Get Smarter. It Became a Different Kind of Model. Here's What the Agentic Numbers Actually Mean.

The AI stack every developer will depend on in 2026

What GenAI Actually Costs in Production

The Community Champions Program

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

huggingface.co··r/LocalLLaMA

MiniMaxAI/MiniMax-M3

huggingface.co··r/LocalLLaMA

llmfan46/MiniMax-M2.7-ultra-uncensored-heretic-GGUF

huggingface.co··r/LocalLLaMA

DreamFast/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Safetensor-Benchmark

huggingface.co·

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

huggingface.co··r/LocalLLaMA

Learn to optimize, deploy, and benchmark LLMs with vLLM: A New Free Course

developers.redhat.com·

Improve vLLM Semantic Router accuracy with fine-tuning

developers.redhat.com·

How to prevent AI inference stack silent failures

developers.redhat.com·

Structured LLM Outputs

dottxt-ai.github.io··Hacker News

Apple M5 Max vs NVIDIA: AI Performance Verdict (2026)

Why agentic AI needs an open inference stack

The Roadmap for Mastering LLMOps in 2026

machinelearningmastery.com·

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

RL Doesn't Work on Slurm

blog.skypilot.co··Hacker News

3-Part Series: LLM Latency in Production (Part 1)

towardsai.net·

Introducing OpenRL: A self-hosted post-training API for fine-tuning LLMs

opensource.googleblog.com··Blogger

Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai

vllm.ai··Hacker News

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

vllm.ai··Hacker News