📌 Note: This article was originally written in 2023. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.
Introduction
If you’ve spent any time around LLMs, you’ve seen the term fine-tuning pop up again and again. Fine-tuning is how we adapt a big, general-purpose model to a specific job. Today, we’ll unpack what fine-tuning really means, why it became central to the LLM story, what made it expensive, and how techniques like LoRA changed the game.
Pre-Trained Models: The Starting Line
Before fine-tuning, we need to talk about pre-trained models. Elsewhere I’ve wr…
📌 Note: This article was originally written in 2023. Even though I’ve updated parts of it, some parts may feel a bit dated by today’s standards. However, most of the key ideas about LLMs remain just as relevant today.
Introduction
If you’ve spent any time around LLMs, you’ve seen the term fine-tuning pop up again and again. Fine-tuning is how we adapt a big, general-purpose model to a specific job. Today, we’ll unpack what fine-tuning really means, why it became central to the LLM story, what made it expensive, and how techniques like LoRA changed the game.
Pre-Trained Models: The Starting Line
Before fine-tuning, we need to talk about pre-trained models. Elsewhere I’ve written about LLMs as foundation models—large models trained on broad, unlabeled text so they learn rich language patterns. That “foundation” is what downstream tasks stand on.
Picture building an email spam classifier the old-school way:
-
Collect a lot of emails.
-
Label each one as spam or not spam.
-
Train a machine learning model (yes, deep learning counts) on that labeled dataset. This approach is reliable and time-tested. But it has hard limits:
-
You need tons of labeled examples, and labeling is expensive.
-
Even with lots of data, language is messy; performance doesn’t easily generalize to every writing style and context on Earth. So someone proposes a different plan:
- Pre-train a language model on a mountain of unlabeled text (not just emails).
- Then fine-tune that model on a smaller, labeled dataset for the specific task (e.g., spam detection). And it works—really well. You still gather data for step (1), but unlabeled text is far easier to obtain than carefully labeled spam examples. The result is a model with strong general language ability that needs only a light nudge to specialize.
That’s where the names come from:
- Pre-training: the large, general learning phase.
- Fine-tuning: the smaller, specialized adaptation phase—fine as in “small adjustments.”
A New Era (And Its Price Tag)
Once the community split training into pre-training and fine-tuning, results jumped:
- Pre-training on vast, varied text made the base model’s language understanding far stronger, so downstream tasks (like spam classification) started from a higher baseline and achieved better results.
- The same pre-trained model could be reused for many tasks via fine tuning like classification, summarization, sentiment, you name it. But then reality hit: pre-training is brutally expensive. As evidence mounted that “bigger is better,” everyone pumped in more data and scaled up model size—from millions of parameters to billions, then to the trillions. Only a handful of organizations can now afford top-tier pre-training runs.
Result: most teams stopped building pre-trained models. They adopt a high-quality pre-trained model (open or commercial) and adapt it to their needs. Thus began the Age of Fine-Tuning.
What Fine-Tuning Really Does
At a nuts-and-bolts level, both pre-training and fine-tuning do the same thing: they optimize the model’s parameters (weights) with gradient descent.
- Pre-training starts from random or near-random weights and learns general language ability from massive unlabeled text.
- Fine-tuning continues training, but from an already competent starting point, nudging weights toward the target task using the smaller labeled dataset.
Faster training. You’re not starting from scratch. Think of traveling from Los Angeles to San Francisco or Sacramento. Pre-training is like getting a free ride to San Jose. From there, you only finish the shorter trip to your destination. 1. Better results. Starting from a halfway point makes it easier to find the right path. If you only know San Jose → San Francisco, it’s far easier than figuring out the whole Los Angeles → San Francisco route on your own. As pre-trained models grew, however, even fine-tuning became heavy. If your base model has 7B parameters, full fine-tuning still means updating all 7B of them—and storing a separate copy per task or customer.
That’s painful for both compute and MLOps.
Parameter-Efficient Fine-Tuning (PEFT): The Plot Twist
Enter Parameter-Efficient Fine-Tuning (PEFT). The core question:
Do we really need to update every weight of the base model to adapt it to a task like spam detection?
Often, no. PEFT methods keep the base model frozen and learn a small set of additional parameters that steer the model toward the target task.
Among many PEFT techniques, the most widely adopted is LoRA.
LoRA, Explained (Conceptually)
LoRA (Low-Rank Adaptation) does two main things:
During training (fine-tuning):
- Freeze the original (pre-trained) weights.
- Attach tiny trainable modules that represent a low-rank update to selected layers. You only train these small adapters (the LoRA weights), not the massive base.
During inference:
- Use an effective weight that is the sum of the frozen base weight plus the learned low-rank update (often scaled). In practice, frameworks either explicitly add the update to the base weight or implement an equivalent fused computation. In the diagram above, the blue box on the left represents the frozen weights of the pre-trained model, while the two orange trapezoids on the right are the newly trained LoRA weights.
How much smaller is this? According to the LoRA paper, you can train roughly 0.1% of GPT-2 Large’s parameters (0.77M vs 774M) and match or beat full fine-tuning. For GPT-3 (175B), training around 4.7M parameters (0.0026%) can reach comparable quality. That’s wild—and why LoRA is the default starting point for many teams.
A Brief Technical Aside (Skip If You Like)
Let’s take a quick detour into something a bit technical—but still digestible. If this isn’t your thing, feel free to jump ahead.
What LoRA Actually Means
LoRA stands for Low-Rank Adaptation.
The “adaptation” part means we don’t retrain the entire pre-trained model from scratch. Instead, we add a lightweight adapter—kind of like those little plug converters you need when you bring an American hair dryer to Europe. Same device, different socket, works just fine.
The “low-rank” part comes from a technique called rank decomposition. The idea is that you can take a large weight matrix W and split it into the product of two smaller matrices, A and B:
W ≈ A × B
Both A and B have much smaller rank (think: fewer independent columns) compared to the original W. That’s why it’s called low-rank. Don’t worry—we won’t dive deeper into linear algebra here; the key point is that this trick lets us approximate huge matrices with something much leaner.
How the Size of LoRA Weights Is Decided
The critical knob in LoRA is the parameter r (often shown in diagrams). Let’s walk through an example:
- Suppose a pre-trained weight matrix W has dimensions 500 × 500. That’s 250,000 parameters.
- If we set r = 4, then:
- A has dimensions 500 × 4 → 2,000 parameters.
- B has dimensions 4 × 500 → 2,000 parameters.
- Together, A + B = 4,000 parameters. That’s only 1.6% of the original weight size (4,000 ÷ 250,000). Pretty efficient, right?
Now, if we shrink r to 2, the parameter count drops even further. But here’s the trade-off:
- Smaller r → fewer trainable parameters → lighter but lower accuracy.
- Larger r → closer to the original weight size → higher accuracy, but heavier. So, choosing the right r is one of the most important experimental decisions when training with LoRA. It’s the dial that balances efficiency and performance.
LoRA: The Pros and Cons
Let’s zoom out a little and talk about the bigger picture.
What exactly are the strengths and weaknesses of LoRA?
Advantages of LoRA
The most obvious win:
Compared to full fine-tuning, LoRA only trains a tiny fraction of the parameters—yet can achieve performance that’s just as good, or sometimes even better.
Another big advantage is weight management. With full fine-tuning, you’d have to maintain a separate copy of the entire model for each task. LoRA avoids that. The pre-trained model stays untouched, and you just swap in task-specific LoRA weights.
For example:
- Want to detect spam in emails? Attach the spam-filter LoRA weights.
- Want to classify sentiment? Swap them out for the sentiment-analysis LoRA weights. At small scale, this might not feel like a game-changer. But when things get big, it’s a lifesaver.
Imagine you’ve built an LLM that rivals GPT-4. Some customers love it out of the box, but others want to fine-tune it for their specific tasks. Since the base model is massive, you’d naturally offer a LoRA-based fine-tuning service. Over time, you’ll be managing thousands or even millions of fine-tuned models. If you had done this with full fine-tuning, you’d have to maintain a separate GPT-4-sized model for each variant. With LoRA, you just keep the single base model and ship tiny task-specific LoRA weights on demand. Much cheaper, much cleaner.
Disadvantages of LoRA
Of course, there are trade-offs.
Because LoRA adds extra weight matrices on top of the pre-trained model, it can consume a bit more memory and slightly slow down inference. The model has to compute through both the LoRA weights and the original weights before merging the results.
That said, in practice, LoRA weights are so small compared to the base model that the overhead is usually negligible.
LoRA weights are an adaptation of the pre-trained weights. That means if the base model changes, all your LoRA weights need to be retrained. The more fine-tuned models you’ve built, the more retraining you’ll face. This dependency can be painful at large scale.
Prompt Engineering (a.k.a. In-Context Learning) vs. Fine-Tuning
Both fine-tuning and prompt engineering try to improve a model’s performance on your task—but they work very differently:
Prompt engineering / In-Context Learning (ICL): You do not modify model weights. Instead, you craft instructions and examples in the prompt (zero-shot / few-shot), giving the model the context it needs to behave the way you want—right now.
Fine-tuning (full or PEFT/LoRA): You do modify parameters (all of them for full fine-tuning, or a tiny subset for PEFT). The behavior change persists without needing long prompts. Why use prompts? Because they’re fast, cheap, and accessible—no training pipeline required. Why fine-tune? Because some tasks require consistent behavior or domain adaptation that prompts can’t reliably achieve, or they demand short prompts / low latency in production.
Full Fine-Tuning vs. LoRA (PEFT) vs. Prompting: When to Use What
There’s no one-size-fits-all answer—your choice depends on model size, task complexity, data availability, latency, cost, and operational constraints. But here are some up-to-date heuristics:
Very large models (≈70B+ parameters)
- Start with prompting (zero-shot → few-shot). For many tasks, strong prompting and retrieval-augmented generation (RAG) are often enough.
- If prompting isn’t sufficient, try LoRA/PEFT. Adapter-based fine-tuning is the current industry standard for customizing large models.
- Full fine-tuning is generally unrealistic at this scale due to compute and deployment costs.
Mid-size models (≈7B–30B)
- Zero-shot may be hit or miss; few-shot is often workable for lightweight tasks.
- LoRA/PEFT remains the most practical choice, balancing performance and efficiency.
- Full fine-tuning is technically possible on modern GPUs but still costly, especially if you need to serve multiple task-specific variants.
Small models (≈2B–7B)
- Few-shot prompting usually underperforms.
- Fine-tuning is essential to unlock usable quality.
- With current hardware, full fine-tuning is feasible at this scale, but LoRA (or QLoRA for memory efficiency) is often preferred for faster iteration and cheaper deployment. Remember: these are rules of thumb, not laws. Data quality, safety constraints, latency targets, serving budget, and your MLOps setup matter as much as parameter counts.
Wrapping Up
We covered:
- Why pre-training + fine-tuning outperformed training from scratch.
- How fine-tuning itself became expensive as base models grew.
- How PEFT, especially LoRA, slashed cost by training tiny adapters while keeping the base frozen.
- When to choose prompting vs. LoRA vs. full fine-tuning, and the trade-offs involved. As with most engineering choices, there’s no silver bullet. Choose the tool that fits the problem, the constraints, and the product you’re trying to build. If this post helped demystify the landscape—even a little—that’s a win.