We’ve been talking about Transformers all this while. But how do we get the most out of our hardware? There are two different paradigms that we can talk about here. One case where your model happily fits on one GPU but you have many GPUs at your disposal and you want to save time by distributing the workload across multiple GPUs. Another case is where your workload doesn’t even fit entirely on a single GPU and you need to work that around. Let’s discuss each of these in a little more detail. We will also try to give analogies for each of the paradigm for easier understanding. Note that anytime we mention GPU or GPUx hereon, you can safely replace it with any compute device. It can be GPU or TPU or a set of GPUs on single node etc.

Data Parallelism

Say you have to carry boxes o…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help