Writing an LLM from scratch, part 26 – evaluating the fine-tuned model

Generating the test set responses

Unusually, when at the start of section 7.7 we generate some sample responses for the instructions in our test set, I got exactly the same results as in the book. For once, I guess, everything that uses randomness was happening in the same order as it did when Raschka ran it on his machine.

The next step was to generate a file with all of the responses to all of the test instructions, which took 18.9 seconds on my RTX 3090 (compared to a minute on an A100, per the book – that’s quite surprising!)

Once that was done, it was time to install Ollama so that I could use the Llama 3 model to evaluate my own.

Ollama

I’ve never used Ollama before – when playing with other people’s models, I’ve always used Hugging Face’s Transformers library.

It’s a neat package, though. It wraps llama.cpp, which is a pure C/C++ inference framework (with CUDA support), and makes it easy to download and run models that have been packaged for it. Being written in C, I would imagine that it’s faster than PyTorch/Transformers – though, being inference-only, it’s less useful if you’re planning to do things like training or fine-tuning the models.

My desktop is running a fairly customised install of Arch Linux, and I didn’t want to use the default install procedure (which puts it into your system-wide /bin and /lib directories). But it turns out that it’s a very well-packaged app, and you don’t need to do that.

Using the manual install instructions for Linux, I just created a new directory ~/Dev/ollama, and then cded there and downloaded it:

wget https://ollama.com/download/ollama-linux-amd64.tgz

It was about 1.75 GiB. I then untarred it:

tar xf ollama-linux-amd64.tgz

...and then I could run commands with full paths, for example:

~/Dev/ollama/bin/ollama serve

...to start up the server, or

~/Dev/ollama/bin/ollama run llama3

...to start a session.

Neat! It’s always good to see pre-built binary packages that have no issues with their install location.

Actually running the evaluation

The next step was to throw all of the generated test responses (and their associated targets) at Llama 3 and see what it thought about how close they were.

Again, this all worked without trouble. I noted that the responses I was getting from Llama 3 were not the same as the ones in the book – Raschka notes that Ollama is non-deterministic, so there’s no surprise there (though it does make me wonder why it accepts a seed parameter in the API call).

When I got on to the final eval, where you run the test results through Llama 3 and ask it to rate them compared to the target outputs, it took 11 seconds to run, and I got an average score of 48.95 / 100, which is close enough to the 50.32 that appears in the book. 1 I’d run an eval on my model, using a smarter model to judge its responses!

Somewhat surprisingly, that number was stable over multiple runs. So perhaps there is some level of determinism in Ollama now that wasn’t present when the book was written, and the seed (eg. 123) is of value. Or perhaps Raschka’s comment about it being non-deterministic was more of a “between machines” thing rather than for multiple runs on the same machine – though then I’m not sure why he suggests re-running it for multiple results.

Anyway – that was it! Eval done. And, to my amazement, that was the end of the chapter – and almost the end of the book. We’ve built an LLM from scratch, fine-tuned it, and evaluated it by using a smarter model to judge how well it was following instructions.

This is the end...

...or at least the end of the beginning.

Having run the evaluation, I’ve reached the end of the main part of “Build a Large Language Model (from Scratch)”. But I don’t think I’ve reached the end of this project, there’s still more to do (not least working through the appendices).

So, coming up next: a post summarising what I’ve got through so far in this series, and what the next steps are to wrap it up.

I also got 110 out of 110 scores – that is, every response from Llama 3 was parseable as an integer. That actually kind of surprised me! Models like to be chatty and helpful. But looking into it, the famous X post by Riley Goodside where he had to “threaten” Bard to stop it from saying “Sure, no problem! Here’s your JSON” was almost two years ago. ↩

Generating the test set responses

Ollama

Actually running the evaluation

This is the end...

Similar Posts