2x GH200 for LLM inference, Part 2: vLLM, DeepSeek V4 Flash, and MTP (opens in new tab)

Covers 2 stories including antirez/ds4: DeepSeek 4 Flash local inference engine for Metal

Introduction A while back I did some optimisation on my Hopper system for MiniMax M2.1, and this was followed by some deeper GH200 benchmarking, where I measured the machine as a memory-shuffling system. The result was a simple topology map: each Hopper has fast local HBM, each Hopper has a fast NVLink C2C path to its own Grace CPU, and the path between the two Hoppers is not a normal GPU peer...

Read the original article