Understanding multi GPU Parallelism paradigms
datta0.github.io·12h·
Discuss: Hacker News
Flag this post

We’ve been talking about Transformers all this while. But how do we get the most out of our hardware? There are two different paradigms that we can talk about here. One case where your model happily fits on one GPU but you have many GPUs at your disposal and you want to save time by distributing the workload across multiple GPUs. Another case is where your workload doesn’t even fit entirely on a single GPU and you need to work that around. Let’s discuss each of these in a little more detail. We will also try to give analogies for each of the paradigm for easier understanding. Note that anytime we mention GPU or GPUx hereon, you can safely replace it with any compute device. It can be GPU or TPU or a set of GPUs on single node etc.

Data Parallelism

Say you have to carry boxes o…

Similar Posts

Loading similar posts...