DEV Community

I spent two weeks optimizing 96GB of VRAM for local LLMs. Paid APIs still won. (opens in new tab)

Discussed on DEV

I run a homelab with four RTX 3090s — 96 GB of VRAM, 44 CPU cores. For two weeks I tried to make it my daily driver for local LLM inference instead of paying for cloud APIs. I got it working. Then I looked at the numbers and subscribed to a paid API anyway. Here's the uncomfortable part, and the optimizations that still made it worth doing. ## The setup 4× RTX 3090 (Ampere — no native BF16), 96 GB VRAM total, 44 cores Models: Qwen3.6-35B-A3B (Q8_0, MoE) and Qwen3-Coder-Next (Q6_K, hybrid) lla...

Read the original article
Sign in to keep reading the full article.

Keyboard Shortcuts

Navigation

Next / previous post
j/k
Open post
oorEnter
Preview post
v

Post Actions

Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s

Recommendations

Add interest / feed
Enter
Not interested
x

Go to

Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Discover
gb
Search
/

General

Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help