vllm-project/vllm

github.com · · Covered in 24 articles from 13 sources

Mi50 32GB / GFX906 - vLLM Qwen 3.5 Configuration for Qwen 3.5:9B AWQ-4bit

huggingface.co··r/LocalLLaMA

MiniMaxAI/MiniMax-M3

huggingface.co··r/LocalLLaMA

Why agentic AI needs an open inference stack

3-Part Series: LLM Latency in Production (Part 1)

towardsai.net·

Learn to optimize, deploy, and benchmark LLMs with vLLM: A New Free Course

developers.redhat.com·

Fast and Efficient LLM Inference with vLLM: A New Course with Deeplearning.ai

vllm.ai··Hacker News

Improve vLLM Semantic Router accuracy with fine-tuning

developers.redhat.com·

The Community Champions Program

The Roadmap for Mastering LLMOps in 2026

machinelearningmastery.com·

Introducing OpenRL: A self-hosted post-training API for fine-tuning LLMs

opensource.googleblog.com··Blogger

Anthropic raises $65B in Series H at a $965B post-money valuation, releases Opus 4.8 and Dynamic Workflows

Structured LLM Outputs

dottxt-ai.github.io··Hacker News

EAGLE 3.1: Advancing Speculative Decoding Through Collaboration Between the EAGLE Team, vLLM, and TorchSpec

vllm.ai··Hacker News

Gemma 4 dense by default: why your local agent doesn't want the MoE

Multi-Head Latent Attention (MLA)

How to prevent AI inference stack silent failures

developers.redhat.com·

RL Doesn't Work on Slurm

blog.skypilot.co··Hacker News

llmfan46/MiniMax-M2.7-ultra-uncensored-heretic-GGUF

huggingface.co··r/LocalLLaMA

Gemma 4 Didn't Just Get Smarter. It Became a Different Kind of Model. Here's What the Agentic Numbers Actually Mean.

The AI stack every developer will depend on in 2026

Apple M5 Max vs NVIDIA: AI Performance Verdict (2026)

What GenAI Actually Costs in Production

DreamFast/Qwen3.6-27B-Uncensored-HauhauCS-Aggressive-Safetensor-Benchmark

huggingface.co·

Used over a million tokens in three separate sessions to test Qwen 3.6 35b (new Multi-token Prediction version)

huggingface.co··r/LocalLLaMA