Delayed Tensor Parallelism for Faster Transformer Inference (opens in new tab)
DTP is a new Transformer architecture that hides communication overhead behind computation and weight streaming, enabling significantly faster batch-size-one inference on AMD and NVIDIA GPUs.
Read the original article