2x GH200 for LLM inference, Part 3: GLM-5.2, expert offload, and the CPU question (opens in new tab)
Introduction Part 1 measured the dual GH200 workstation as a memory system. Part 2 used those measurements to explain why DeepSeek V4 Flash can be fast in vLLM when the model layout fits the hardware: keep hot weights in HBM, avoid unnecessary Hopper-to-Hopper traffic, and use MTP only where the acceptance rate pays for the draft work. GLM-5.2 starts at 2.39 output tok/s on this machine and a...
Read the original article