Writing an LLM from scratch, part 22 – training our LLM

Randomness and seeding

One thing I really do recommend to anyone working through the book is that you type in all of the code, and run it yourself – it really will help you remember how stuff fits together.

There is one slight issue I found with that, however: the book has a number of examples where you get output from code that uses randomness – for example, where you take a look at the loss it has on some sample text before you start training, or make it generate samples during the train.

Now, in theory, because Raschka puts torch.manual_seed calls before all of these, the results you get should be exactly the same as the outputs in the book. However, the amount of code we’re working with at this stage is quite large – we have various helper functions that were created in earlier sections, for example. And some of these use randomness.

That means that to get the same results as the ones in the book, you would need to ensure that all of the code that uses randomness was running in exactly the same order as it was when Raschka did it for the book. That turns out to be surprisingly hard!

My instinct is that it doesn’t actually matter all that much. So long as the loss numbers that you see are in the same ballpark as the ones in the book, and the outputs you see are roughly equally incoherent (before training) and become more coherent at what feels like the same kind of rate, you’re fine. Probably the most important one to look out for is when the training run starts – you should see loss on the training set decreasing steadily, just like in the book, and likewise as in the book, the validation loss should plateau out pretty early.

Optimisers

When I have built simple backpropagation through neural networks in the past, I’ve generally updated parameters by multiplying the gradients by a small number, the learning rate, and then subtracting them from their respective parameters to get updated ones – classic stochastic gradient descent.

Non-trivial ML uses optimisers; I’d come across them while fine-tuning LLMs, and also used one in the RNN code I wrote last week. Instead of updating the parameters yourself, you ask the optimiser to do it for you, by calling its step function. AdamW appears to be the default optimiser in most textbooks, though Muon seems to be the most popular in use, if my AI X/Twitter feed is to be believed.

I don’t understand how optimisers work in any detail, and I’m going to have to dig into that in the future. However, my high-level simplified picture right now is that they dynamically adjust the learning rate over time, so that it’s easier to take big “jumps” downwards on the gradients when you start, and then smaller ones later. I believe they can also sometimes avoid local minima in the loss landscape – a nice metaphor I read somewhere (lost the source, sadly) was that simple gradient descent was like rolling a ball down a hill, but (some?) optimisers give the ball a bit of momentum so that it can coast over a small uphill portion, so long as the general slope is downwards.

Anyway, more investigation needed later.

In practice, with AdamW, you initialise it at the start of your training loop, with a learning rate (which I imagine is similar to the one my older code used, a scaling factor for gradients) and a weight decay (:shrug:). You also provide it with the parameters it’s going to be managing.

In the training loop, at the start of each input batch, you tell it to zero out the gradients it’s managing with optimizer.zero_grad(), run the data through your model and calculate your loss, and then after calling loss.backward() to get your gradients, you just call optimizer.step(), and that does the parameter update.

Again, I want to dig into how optimisers work in more detail in the future. But for now, I think that’s all I need to know.

Speed, and the cost of training

The book tells you how to train on a public domain book, “The Verdict” by Edith Wharton. Full training on the hardware that people are likely to have to hand would be extremely expensive, so we just train on that short example, then later on learn how to download and use the weights that OpenAI made available for their GPT-2 models.

But there was something that surprised me a little. When talking about the training run on “The Verdict”, Raschka says that it takes “about 5 minutes to complete on a MacBook Air”.

On my machine using CUDA on an RTX 3090, it took just less than eleven seconds.

This makes perfect sense, of course – there’s a really good reason why AI training is normally done on GPUs or custom hardware, and the MacBook Air would presumably be training on the CPU. But I was a little surprised at how huge the difference was in this simple example!

Now, while the book mentions that Llama 2 probably cost hundreds of thousands of dollars to train, I must admit that I do wonder how much it really would cost to train a 124M parameter model on my own hardware – or, indeed, on the machines with 8x 80GiB A100 GPUs that I rented from Lambda Labs during my fine-tuning experiments.

Andrej Karpathy was able to train a 124M GPT-2 model for $20, using his hand-written C/CUDA LLM system llm.c. That is undoubtedly more efficient than the PyTorch code that we’re working on in this book. But it really would be interesting to find out whether it would be doable for me at all! The training data he used is the 10B-token version of the FineWeb collection, which is freely available. 1

I think I have a good candidate for a next project when I’ve finished the book; see how many tokens/second I can train on locally – that will allow me to estimate how long it would take to train one epoch over the whole training set. I imagine that will be longer than I’m willing to leave my desktop machine tied up doing this, but then I can try mixing in the lessons I learned doing fine-tuning, and see if I can get it up and running on Lambda Labs. If the cost is in the tens of dollars, or even a hundred or so, I really think it would be worthwhile!

“Memorisation”, temperature and top-k sampling

One thing I found a little confusing in this chapter – and this is very much a nit – was the section on preventing “memorisation”; I think this was due to a mismatch in the meaning I attach to the word, and the way it’s used here.

To me, memorisation is something that the model does during training – if you keep training a 124M-parameter model on a 20,000-character file, as we’re doing here, then whatever happens the model is going to memorise it – it’s unavoidable. The only way to reduce memorisation in this sense would be to increase the amount of training data (and even then, as the findings in the lawsuit by the New York Times against OpenAI show, some stuff would be memorised).

In the book, “memorisation” is being used to mean something more like what I’d call “parroting” – issues with the model just repeating the stuff that it has memorised, because it was always choosing the most-probable next word. Avoiding this is super-important, of course! It’s just the framing that confused me a little.

The techniques are nifty, anyway. The first cut – just use the softmaxed logits as a probability distribution and sample from it – is obvious enough. Temperature is a clever trick on top of that – just divide the logits by some number greater than one before softmax, and you can make the distribution that comes out flatter (or you can make it more “pointy” by dividing by a number less than 1). The graphs in the book showing how that works are great, but I asked Claude to knock together a temperature playground website, which I found made things even clearer to me.

And finally, the top-k technique – only consider the k most probable tokens, and then do the temperature/softmax calculations – was a sensible addition to add on top of that. The code is clever: identify the top k logits, get the value of the lowest one of them, and then replace every logit less than that with minus infinity. When you run that through softmax, you get zeros for the ones that were replaced, and the probability distribution is based on the remainder.

So: excellent stuff, and very well explained in the book – it just didn’t feel like preventing “memorisation” specifically was what it was doing, at least based on what I take the word to mean.

Downloading the OpenAI weights

At the end of the chapter, we download the weights for the original GPT-2 model that OpenAI produced from their site, and load them into our own model.

The code to download weights is (thankfully) something that you don’t need to type in, as it’s downloadable from GitHub. And in one specific related case, I’ll also contradict what I said earlier about typing stuff in yourself – I definitely recommend that you copy the load_weights_into_gpt that copies the downloaded weights into our own model from GitHub too. I did actually type it all in and I don’t think I gained anything from doing that.

One thing I did notice while going through that section was that I’d been making a mistake as I wrote up this series; I’d thought that all GPT-2 models had 768 embedding dimensions. It turns out that this is only true of the 124M model in that series, and the larger ones have more. That makes a lot of sense – and I’ve updated the older posts to reflect it.

Wrapping up

That’s all I really have to add to what is in the rest of chapter 5. Like I said at the start, it feels almost like a let-down to be writing so little about a section of the book that has such amazing results! But now we have a working LLM, and at least the foundations that might allow us to train our own from scratch if we had the resources.

Next up: using it to classify text. Will this be quick and easy? Or will it lead down another fascinating rabbit hole? Time will tell...

His new nanochat – a from-scratch trainable chatbot – is even cooler. ↩