My Dual MBP setup for offline LLM coding (w/ Qwen3 Coder 30B A3B)

People here often tout about dual GPUs. And here I am, showing my dual Macbooks setup :P jk jk, stay with me, don't laugh.

The setup:

M2 Max macbook, with 64GB unified memory for serving LLM via LMStudio
M1 Pro macbook, with 16GB unified memory (doesn't matter), as a client, running Claude Code

The model I'm using is Qwen3 Coder 30B A3B, Q8 MLX (temp = 0.1, repeat penalty = 1.05, top k = 20, context size = 51200). To my surprise, both the code quality and the stability in Claude Code was so good.

I've been trying 32B models for coding previously when QwQ 32 and Qwen2.5 Coder was still around, and none of them work. With Qwen3, it makes me feel like we finally have some actual-useful offline...

People here often tout about dual GPUs. And here I am, showing my dual Macbooks setup :P jk jk, stay with me, don't laugh.

The setup:

M2 Max macbook, with 64GB unified memory for serving LLM via LMStudio
M1 Pro macbook, with 16GB unified memory (doesn't matter), as a client, running Claude Code

Now back to the dual MBP setup, you may ask, why? The main thing is the 64GB MBP, running in clam shell and its only job is for the LLM inference, not doing anything else, so I can ultilize a bit more memory for the Q8 quant instead of Q4.

You can see in the below screenshot, it takes 27GB memory to sit idle with the model loaded, and 47GB during generation.

https://i.imgur.com/fTxdDRO.png

The 2nd macbook is unneccesary, it's just something I have at hand. I can use Claude Code on my phone or a Pi if needed.

Now, on inference performance: If I just chat in LMStudio with Qwen3 Coder, it run really fast. But with Claude Code's fatty system prompt, it took about 2 to 3 seconds for prompt processing per request (not so bad), and token generation was about 56 tok/s, pretty much comfortable to use.

On Qwen3 Coder performance: My main workflow is ask Claude Code to perform some search in the codebase, and answer some of my questions, Qwen3 did very good on this, answer quality usually on par with other frontier LLMs in Cursor. Then I'll write a more detailed instruction for the task and let it edit the code, I find that, the more detailed my prompt, the better Qwen3 generate the code.

The only down side is Claude Code's websearch won't work with this setup. But it can be solved by using MCP, i'm also not relying on web search in CC that much.

When I need to move off the work laptop, I don't know if I want to build a custom PC with a dedicated GPU or just go with a mini PC with unified memory, getting over 24GB VRAM with a dedicated GPU will be costly.

I also heard people say 32B dense model works better than A3B, but slower. I think I will try it at some point, but for now, I'm feel quite comfortable with this setup.

submitted by /u/bobaburger [link] [comments]

Similar Posts