Writing an LLM from scratch, part 29 -- using DistributedDataParallel to train a base model from scratch in the cloud (opens in new tab)

Having trained a base model from scratch on my own machine over 48 hours, I wanted to make it faster by training with multiple GPUs in the cloud.