Qwen3.6-27B + vLLM + Hermes on 24GB VRAM: May 2026 Recipe (opens in new tab)
If you want to reproduce my current local Hermes Agent + Qwen3.6-27B setup, this is the shape I would start from. Target One local coding agent. One 24GB GPU. Long context. Tools enabled. Thinking enabled. No child agents fighting the main request. The goal is not peak tok/s on a short prompt. The goal is: can the same agent session keep working after hours of tool calls without losing prefix locality, timing out during prefill, or getting wrecked by auxiliary requests? Model This setup is in...
Read the original article