Delayed Tensor Parallelism for Faster Transformer Inference (opens in new tab)

Covers pytorch/torchtitanDiscussed on Hacker News

DTP is a new Transformer architecture that hides communication overhead behind computation and weight streaming, enabling significantly faster batch-size-one inference on AMD and NVIDIA GPUs.

Read the original article