I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won. (opens in new tab)
I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway. Here's the uncomfortable part, and the optimizations that still made it worth doing. ## The setup 4× RTX 3090 (Ampere — no native BF16), 96 GB VRAM total, 44 cores Models: Qwen3.6-35B-A3B (Q8_0, MoE) and Qwen3-Coder-Next (Q6_K, hybrid) lla...
Read the original article