Writing an LLM from scratch, part 32a -- Interventions: training a baseline model

The interventions

I listed a number of possible interventions at the end of the RTX 3090 post; I’m not going to do them all, but for completeness, here’s the full list:

The amount of training data. I’m not going to dig into this one; it looks like it does help, but the returns diminish rapidly, so I think that in order to get any serious improvement we’d need to train for much more than two days locally. In the one "extended training" test I did, I managed to get the loss down from 4.167 to 4.135, which was... less-than-inspiring.
The number of epochs. I’m going to stick to single-epoch training – that is, I’ll train on a single pass through an amount of non-repeating data chosen to take 48 hours to handle on my local machine.
The bias on the Wq, Wk and Wv matrices. This one definitely sounds worth looking into – easy, as it’s just a change to a config flag, and makes the model more like the original GPT-2. I’ll give that a go.
Dropout. I’ve read that for single-epoch training, dropout doesn’t help (which doesn’t quite work with my mental model of what it’s for, but does sound plausible). Worth a look!
The learning rate, and weight decay. The values I’ve used for these are basically copypasta from the book. I think I should learn to understand these and try to optimise them a bit.
The precision. I’m using AMP, which means that some calculations are done in 16-bit rather than 32-bit, and calling set_float32_matmul_precision with "high" to let PyTorch choose to use the GPU’s tensor cores, which use TF32, a kind of "32-bit float lite" (see the post on the local train for details). Those both (at least potentially) reduce the precision of the train below what you’d get if you trained with full-fat float32. Would reverting that be worth the longer train time? I should probably at least poke at that.
The batch size. I’ve already, in effect, tried playing with that. The different cloud machines I played with had different amounts of per-GPU VRAM, so supported different per-GPU micro-batch sizes. So I wound up trying batch sizes from 512 (the same as the original GPT-2 was trained with) down to 104 in the cloud, plus my local trains with a batch size of 6. I did a rough-and-ready calculation at the end of the cloud training post where I estimated that the ideal batch size might be something like 97. So, probably not worth much more investigation.
Exploding gradients. In one of my local trains, and in three out of the four cloud trains, I had sudden spikes in both training and validation loss. It generally took quite a bit of training – maybe 10-15% of training time – to get back on track after some of these, so we had what could be seen as wasted time in the training runs. Exploding gradients can be fixed by gradient clipping, which is relatively easy to do. Definitely worth investigating!

I’m going to work through each of those apart from the first two and the batch size (and will retrospectively add links to the posts when I do), trying a train with just that intervention and nothing else, on a cloud machine. Once that’s done, I’ll bake all of the things that helped into the training loop, and do another local train – with gradient accumulation to make the batch size match the cloud instances’.

The cloud machine size that I decided to use for this was the one that came out the most cost-effective (and due to its VRAM size, had the best loss) in my earlier cloud training test: an 8x A100 machine with 40 GiB VRAM per GPU.

But first, we need a baseline model.

Why a new baseline?

I’ve already done a train on an 8x A100 40 GiB machine – why do we need a new one?

In my cloud training post, I came to the conclusion that the cost in terms of training time of running a periodic validation loop as we trained was not really worth it, at least in this case. Two of the biggest reasons to have validation during training are to work out when you’re overfitting on a multi-epoch train, and to see how your model can handle datasets that it has not been trained on.

In a single-epoch train like this, you’re not going to overfit – every sample it sees will be new to it – and the training loss itself is over samples it’s not been trained on at the time it was calculated, for the same reason (though of course it will be trained on them as soon as we do the backward pass starting with that loss).

Of course, it’s not perfect – a big benefit of the validation loss is that it’s over the same held-back dataset on every run – and there are arguments for keeping it (albeit, perhaps doing full runs less frequently than I was). But for these experiments, I decided that I’d simply drop it.

I also wanted to introduce a consistent random seed at the start of the training loop. I didn’t have that in my cloud trains, and of course if we want to have solid results on whether each intervention really does improve matters, then we need one so that we can be sure they’re all starting from the same point.

Both of those meant that I couldn’t use the earlier train on the 8x A100 40 GiB machine as a baseline; I’d need a new one, introducing those two changes: no validation during the training run (using training loss as a proxy), and setting a random seed at the start for reproducibility.

So: what was the baseline train going to look like?

Creating the baseline

The first step was to strip out the validation code and to replace it with code that just took periodic checkpoints, keeping track of which one had the best average training loss over the period since the previous one. Next, I decided to plot on the training chart that is generated during the run not just the training loss, but also an indicator of the maximum and minimum training loss over all of the steps in that period. Then I added the random seed, which I set to 42.

A couple of bugfixes, and we were left with this version of the code.

One thing to highlight: in the train.json file that specifies the various training parameters, I set the per-GPU micro-batch size to 12 rather than the 13 I’d used on this size of machine earlier. Two reasons for that:

Firstly, I’m going to want to do a local run with gradient accumulation later, using all of the helpful interventions. With gradient accumulation, you do a number of steps with batches that you can fit into your memory, but you don’t update the gradients each time. After a number of those, you do one big update based on the accumulated gradients – hence the name. The full batch is all of those smaller batches taken together.

If I want that to closely match the cloud train, I’ll want the accumulated batches to be the same size as each global batch in the cloud.

Now, on my local machine, I can fit a batch of 6 into VRAM. So that means that the full batch needs to be divisible by 6 1. On the cloud train, with a micro-batch of 13 and 8 GPUs, we had an overall batch size of 104 in the previous train. 104 is not divisible by 6: no joy. But with a micro-batch size of 12, we have an overall batch of 12×8=96, which means we’d be able to do gradient accumulation and do a parameter update every 96÷6=16 steps.

Secondly, while my estimate of the ideal overall batch size was based on a rather arbitrary bit of curve-fitting, it did say that 97 was the ideal size. So it could be interesting to see whether it did help!

So, having coded that up and set up the configuration, it was time to run it.

Here’s the training chart it came up with:

Baseline training run on an 8x A100 with 40 GiB/GPU

Note the loss spikes at around global steps 4,200, 13,000 and 23,000. Those are important, I’ll explain why later.

The training run reported this at the end:

Training complete in 12,243.523 seconds
Tokens seen: 3,260,252,160
Throughput: 266,284 tokens/second
Final train loss: 3.743

So it took about 3h24m to train, even less than we expected from the previous cloud experiments’ estimates of how long it would take excluding validation. About US$35 in cost.

Here is the model on Hugging Face Hub.

Let’s see how it looks.

Evals

For these intervention posts, I won’t run the instruction-following tests, as they can only be run against a batch of models in one go to get results that are consistent with each other.

But the smoke test – how does it complete the sequence Every effort moves you is worthwhile:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_smoke.py runs/8xa100m40-baseline/model.json runs/8xa100m40-baseline/checkpoints/best/model.safetensors
Every effort moves you in on a good cause.
If it doesn’t work you would like to join the

Looks good! Reasonably coherent.

Now we can find the loss on our held-back test set:

giles@perry:~/Dev/ddp-base-model-from-scratch (main)$ uv run test_loss.py datasets/ runs/8xa100m40-baseline/model.json runs/8xa100m40-baseline/checkpoints/best/model.safetensors
Fetching 4 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4/4 [00:00<00:00, 990.57it/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3200/3200 [04:53<00:00, 10.91it/s]
Loss against our test dataset: 3.692

That’s a bit worse than the 3.674 we got for the original cloud train. Either the calculations of the optimal batch size I did were not quite right (entirely likely, they were very ad-hoc) or the model weights we started with, given the random seed we’re using, just happened to lead us in a slightly worse direction (also plausible). Either way, it’s in line with what we expected, and is still better than the test loss of 3.725 that we got with the second-best machine in the cloud comparison post (the 8x H100 80 GiB with a global batch size of 216).

So: we have a solid baseline model – before we wrap up, let’s consider those spikes in the loss that I called out in the training chart.

The loss spikes

Random spikes in the loss are a Bad Thing, right? Certainly they’re a bad thing for a train in general, especially if you don’t know for sure what’s causing them. But my working assumption has been that they’re caused by exploding gradients – for some specific sample in the dataset, the gradients have gone up to some insanely high value, and we’ve had a bad update to our parameters as a result. It hasn’t completely knocked the model back to its starting point, but it does take some time to recover, so we lose the benefit of some of our training.

If that is the case – and it’s not just something like a batch happening to have stuff that’s wildly different to the rest of the training data, or something weird in the optimiser – then gradient clipping is the solution. I wanted to see if it would help the model quality in general, but of course if we hadn’t had any loss spikes in this baseline train it would have been hard to see if that was the case!

So I was very glad to see them here, as if there had been none I would either have had to do a gradient clipping experiment with no real expectation of it helping – or do another baseline train with a different random seed in the hope that that caused some spikes, which would have cost another US$35.

All in all, it was good to see them there, as it sets us up well for that experiment.

Wrapping up

So, we’ve trained a baseline model that we can make changes to – the interventions I listed at the start – and get a pretty reliable understanding of whether or not they help the quality of the final model. With that in place, we’re in a good position to start running those intervention tests!

Given the loss spike situation in that chart, I think that a solid first one to go for – even though it was the last in that list at the top of this post – is gradient clipping. Where are those loss spikes coming from, and if it’s exploding gradients, what happens if we limit the damage they do with gradient clipping?

Stay tuned! I’ve already done the training run for that (while I wrote this one up), so I should be able to post about it tomorrow.

Well, you could potentially do something with batches of different sizes, but that would be fiddly. ↩