LLM fine-tuning: LoRA vs full fine-tuning

LLM Fine-Tuning: LoRA vs Full Fine-Tuning — a Comparison

What is fine-tuning, why would you use it, what are some ways to do it, and how to measure the results? All that, and more, in this article.

If you run an LLM, and you want to steer its behavior (make it do things it would not normally do), you have a range of options. Prompting will influence its behavior: set a system prompt carefully, and it will apply to everything the model does. Plugging the model into external data sources, for example via MCPs, will also change the outputs.

LLM Fine-Tuning: LoRA vs Full Fine-Tuning — a Comparison

What is fine-tuning, why would you use it, what are some ways to do it, and how to measure the results? All that, and more, in this article.

But prompting and data sources only go so far. If you need to make radical changes to the style of the output, the model itself needs to change. One way to do this is to put the model through a bit of extra training: take a dataset containing data that looks like the output style you want to see from the model, and run a training loop with it. Since the model is already trained from scratch, we don’t call this “training”, but rather “fine-tuning”. It is a form of supervised fine-tuning.

The model will learn from the dataset used in fine-tuning. Its style and tone will resemble the dataset, and it will talk like the person who wrote the comments in the dataset. It may also learn a few facts contained only in this new dataset, but poorly and unreliably. MCPs are still better at teaching a model new facts.

Purpose

I wanted to have a clear comparison of various LLM fine-tuning techniques, with performance numbers attached to the variants. The choice of the dataset used here is incidental, I just used what I had.

In real life, this would be done to make the LLM “talk more like the dataset”.

Types of Fine-Tuning

Full Fine-Tuning

The most obvious technique involves training the model as-is. The entire original model undergoes training. All model weights are adjusted, you’re not limited to some subset of the weights, the whole “volume” of the model is used, the effects of training are spread out over many bits of storage.

But this is also the most compute-intensive and memory-intensive process. Exact numbers may vary, but if you compare inference (just running the model) with full fine-tuning, the latter may require an order of magnitude more RAM, or beyond. The amount of compute involved is also huge.

PEFT Methods: LoRA, QLoRA

PEFT (Parameter-Efficient Fine-Tuning) begins with the observation that it is possible to change only some weights in fine-tuning, and still get decent results. Less compute is needed, and the memory required also decreases dramatically.

PEFT

There are many kinds of PEFT methods. LoRA (Low-Rank Adaptation) is one example: freeze the original model weights, add low-rank weight matrices within the model, and only these extra matrices are adjusted in training. Most weights don’t change.

QLoRA (Quantized LoRA) works in a similar way, but you also quantize the weights: instead of storing them on 16 bits, you store them in smaller formats, such as 8 bits, 4 bits, by discarding the less significant bits. This reduces the memory usage even further.

By aggressively applying QLoRA along with other memory-saving techniques, it is not impossible to fine-tune an LLM using not much more memory than the amount needed for inference. This is the reason why the Unsloth library is popular: it allows fine-tuning popular LLMs on consumer GPUs.

The Dataset

For this comparison, I’ve used a social media dataset structured as prompts and answers. The answers (the comment body) are my own comments on a large social media site. The prompts (the parent text) are the posts and comments I was replying to. Aggregated over two decades of posting, this amounts to about 37k of prompts and answers, or an average of 5 comments per day (don’t judge me). The format may look like this:

parent_text | comment_body -----------------|-------------- Hi, how are you? | Fine, thanks. Is water wet? | Of course!

The train/test split is 4:1. Most comments and their parents are 70 … 300 characters long, typical for social media content.

To compare fine-tuning techniques, many datasets will work well. I am only using this dataset because, obviously, I am extremely familiar with it. Using benchmarks is one thing. But data you are intimately familiar with will provide an immediate sense of “correctness” from even just one single example. The “vibe checks” are very easy with this dataset — for me.

For obvious reasons, I cannot make this dataset public. Regardless, it’s only useful in this form to me. Feel free to substitute it with any conversational dataset, and the results should be quite similar. You may find something suitable here:

GitHub - ad-freiburg/large-qa-datasets: A collection of large question answering datasets

The Base Model

Many open-weights models will work well for this comparison. I chose the Gemma 3 family for several reasons: these models perform quite well for their size, are popular and well-understood, and I have a Gemma 3 model (the 27B variant) as the default when I run local inference (again, an argument from familiarity).

Gemma 3

Specifically, I’ve done most fine-tuning tasks here with google/gemma-3–12b-it as the base model.

Hardware

The system I’ve used for all tasks is a clone of the NVIDIA DGX Spark, a system optimized for machine learning tasks in a mini PC format.

NVIDIA DGX Spark

The DGX Spark has 128 GB of unified memory, shared by CPU and GPU, and this is the most important parameter. All fine-tuning code I’ve used will run without changes on any GPU that has the same amount of memory available.

The GPU is a somewhat fancy RTX 5070, compute capability 12.1, capped at 100 W power draw (140 W for CPU+GPU, and 240 W for the whole machine). The memory bandwidth is 273 GB/s, less than what you normally see in a dedicated GPU, so it will run a little slower than a regular Blackwell chip. On the other hand, the device is whisper-quiet under heavy load, and the electricity bill is very affordable.

Benchmarks

Two main benchmarks were used to evaluate the general knowledge and task-specific abilities of various models in this comparison: MMLU-Pro and LiveCodeBench.

MMLU-Pro uses complex, challenging questions ranging over a wide variety of topics such as Biology, Business, Chemistry, Computer Science, Economics, Engineering, Health, History, Law, Math, Philosophy, Physics, Psychology, etc. The model is tested on its breadth of knowledge. The test presents the model with multiple-choice questions, and the model must select the right answer.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

LiveCodeBench is a coding benchmark, which is especially interesting since LLMs are often used to generate code. The models are presented with a problem, and are asked to write code that solves the problem. Several metrics are collected:

pass@1 — the model is asked to write a single solution; the metric at the end shows the probability that the model guesses right in one shot
pass@5 — the model is asked to write 5 different solution; the metric shows the probability that at least 1 out of 5 solutions is correct
pass@10 — the probability that at least 1 out of 10 generated solutions is correct

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

I’ve cloned the major repositories for these benchmarks, updated their dependencies, made them work with recent Python modules, and made them available here:

Code

The code used in this comparison is available here:

GitHub - FlorinAndrei/llm-finetune-comparison: Comparing LLM fine-tuning techniques

The file you may want to read is the train.ipynb notebook. The code is straightforward. It begins with processing the dataset. If LoRA is enabled, it uses the peft library to insert the low-ranking matrices, and the BitsAndBytes methods from the transformers library to quantize the model in 4 bits (so this actually QLoRA).

Flash attention is used where possible, along with packing and gradient checkpointing — these are techniques that may save memory, or compute, or trade one for the other. Only one epoch of training is performed, at a constant learning rate.

At the end, the fine-tuned model performs inference with a few extracts from the test dataset, and the results are logged.

In the .venv I had Python 3.12.3, CUDA 13.0, PyTorch 2.9.1, and Transformers 4.57.3.

Compute and Memory Usage in Training

Full fine-tuning takes a slightly longer time than QLoRA to complete.

Full fine-tuning takes a much larger amount of memory than QLoRA.

Memory usage for all models increased while training; the increase was substantial with QLoRA (40%), much less with full fine-tuning (a few percentage points).

Results

Eval Loss

This is the eval loss while training, for all models. It’s a measure of how different the model outputs are from the “ideal” answers in the test slice of the dataset. Lower is better (models are “closer” to the dataset, or “less different”). Again, the dataset is social media conversations, so that’s the type of output evaluated here.

I don’t have an actual benchmark for how well the models perform on the social media answers, so this is a proxy for that performance.

The best (lowest) value is from Gemma 3 12B in full fine-tuning with a learning rate of 1e-5: the final loss is 2.3192. Same model with QLoRA at a learning rate of 2e-5 obtained a loss of 2.3578, not much worse, but the difference is measurable.

Full fine-tuning did 1.7% better than QLoRA by eval loss.

A much smaller model, Gemma 3 4B, in full fine-tuning at 1e-5, had a loss of 2.478, noticeably worse than the bigger models.

All models in full fine-tuning still showed some slope at the end of the training, indicating that continuing beyond just one epoch may have improved performance even further. I’ve decided to stop training at one epoch to save time.

These are the best rounds of each technique, full fine-tuning vs QLoRA:

Based on the eval loss, the best models from each category (12B full, 12B lora, 4B full) were promoted to the next benchmarks, along with the original 12B model (not fine-tuned). The other models were discarded.

MMLU-Pro

This is the benchmark that uses complex questions from a variety of topics. The accuracy estimated by the benchmark is simply the number of questions answered correctly, divided by the total number of questions. Here are the accuracy results, aggregated over all topics, for each model:

The original (base) model, which did not go through fine-tuning, has the best performance (accuracy). The smaller 4B model has the worst performance. The QLoRA-trained model performed ever so slightly better than the full-finetuning model.

And this is the performance for each model, for each category:

Compared to the original model, fine-tuning (either full or QLoRA) did not reduce performance very much in the fields of biology, economics, history, and maybe law.

Fine-tuning (either full or QLoRA) reduced performance more significantly, compared to the original model, in the fields of business, chemistry, math, physics. These are more or less the fields where the performance of the 4B model is significantly lower than the performance of all 12B models.

LiveCodeBench

This is the coding benchmark, with three different metrics that evaluate the models, respectively, for the chance to get 1 solution right out of 1, 5, and 10 proposed solutions. And here are the results:

The original 12B model is the best, of course. The 4B model is too small to compete with the bigger models. And again QLoRA outperforms full fine-tuning by all metrics — more on this later.

Inference Examples

Using gemma3-12b-full-finetuning here are some inference examples.

Prompt:

Explain quantum mechanics.

Answers:

If we’re going to be honest, it’s probably the most successful theory ever. Everything it predicts is verified to some extent. But it’s also the most counter-intuitive. It’s like trying to describe the surface of a 4-dimensional object in a book. You can’t really do it, but you can provide analogies and metaphors. There’s a lot of good books out there. A popular one is “Six Easy Pieces” by Richard Feynman. It’s not exactly a rigorous treatment, but it’s a very good introduction. I’d also recommend “The Black Hole War” by Leonard Susskind. It’s a very recent book that makes sense of the “string theory” controversy. There’s also “The Elegant Universe” by Brian Greene. It’s a bit more popular-oriented, but it’s good too. You could also look for the lectures by the MIT OpenCourseWare. Finally, it’s probably the best idea to just go ahead and study physics in college.

— — —

I’ll explain it to you, but it’s not gonna be the explanation you’re looking for. Go home and read some pop-sci books.

— — —

> Explain quantum mechanics. That’s like asking “explain life”. It’s complicated.

That, indeed, sounds a lot like my comments on social media.

And here is the original Gemma 3 12B answering the same prompt: https://gist.github.com/FlorinAndrei/1b78df09aa9bc6dde459fd5bdd2c5907

Prompt:

Who are you?

Answers:

I am the one and only.

— — —

Not that guy.

— — —

A mere observer.

And here is the original Gemma 3 12B answering the same prompt: https://gist.github.com/FlorinAndrei/5af3f008151e990fecbb46a2e4dd5e3c

Acquired Knowledge

Fine-tuning is not the best way to instill new knowledge in a model — use MCP for that, instead. But anyway, did the fine-tuned models learn facts that are specific to the dataset used in fine-tuning? Let’s prompt a model with a few facts that are true about me, and are scattered across my comments.

Prompt:

What did you study?

Answer:

Physics.

And here is the original Gemma 3 12B answering the same prompt: https://gist.github.com/FlorinAndrei/286bd170bdaedb67454ffff80afab46f

Prompt:

Where did you grow up?

Answer:

Europe.

And here is the original Gemma 3 12B answering the same prompt: https://gist.github.com/FlorinAndrei/8cc2a9987c0bde4de52c0e5ce21c155b

Conclusions

LLM fine-tuning definitely works well for infusing a different style and tone of voice to models. The dataset I’ve used contains tens of thousands of social media comments, usually pretty brief (a few hundred characters), not often offering extensive explanations, and sometimes sarcastic. The models learn that style quite well.

The models will forget some of the facts they knew before fine-tuning. This is shown by the decreased scores in the knowledge and skills benchmarks. Full fine-tuning has a stronger negative impact, because all the model’s weights are adjusted. QLoRA retains the original knowledge better, but at the price of being less adept at learning the new style. All this can likely be adjusted by playing with the tuning parameters, especially the learning rate, but that is not the topic of this article.

Full fine-tuning requires far more resources than QLoRA. You might be able to squeeze a QLoRA fine-tuning loop into a consumer GPU, but it’s not easy and you would have to take extreme measures. Full fine-tuning seems impossible to do on consumer GPUs. The main restriction is the amount of VRAM required.

LLM fine-tuning: LoRA vs full fine-tuning — a comparison was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.