
This blog guides you about the full process of fine-tuning in details using simple words
You’ve used ChatGPT or Claude. They work well for general tasks, but they break down when you need strict consistency, domain-specific behavior, or predictable outputs at scale. Your company’s writing style. Your domain’s Knowledge. Your exact requirements.
That’s where fine-tuning comes in. This guide walks you through everything you need to know about Fine Tuning.
How an LLM Actually Works


This blog guides you about the full process of fine-tuning in details using simple words
You’ve used ChatGPT or Claude. They work well for general tasks, but they break down when you need strict consistency, domain-specific behavior, or predictable outputs at scale. Your company’s writing style. Your domain’s Knowledge. Your exact requirements.
That’s where fine-tuning comes in. This guide walks you through everything you need to know about Fine Tuning.
How an LLM Actually Works

Before fine-tuning, you need to understand what an LLM is at a basic level.
Tokens, not words
LLMs work with tokens, not words. A token is a chunk of text. In English, one token is often around 3–4 characters, but this varies by language, tokenizer, and model. The word “understanding” might be 2–3 tokens depending on the specific model.
Predicting the next token
An LLM’s core task is simple: predict the next token. Given “The cat sat on the”, it predicts what comes next based on patterns it has learned. That’s the foundation of everything these models do.
Weights encode patterns
Models have billions of numbers called weights or parameters. These encode patterns learned during training. When you see “7B model” or “70B model”, that B means billions of parameters. These weights store everything the model knows about language, reasoning, and patterns.
Training vs inference
Training adjusts these weights by showing the model examples. This is slow, expensive, and happens offline. Inference uses the learned weights to generate text. This is fast, relatively cheap, and happens every time you use the model.
Hard limits
The context window is how much text the model can process at once. Some modern models support large context windows, sometimes tens or even hundreds of thousands of tokens, but this depends on the specific model and provider. Parameter count also matters. More parameters usually mean stronger capabilities, but they require more memory and compute.
What is Fine Tuning?
Fine Tuning is simply making the pre-trained model specialized on your specific Dataset so now the model is Smarter for your specific Problems
This saves you Time and money as the model also have the knowledge of its pre-trained data and you don’t have to train it again.
Fine-tuning in machine learning and artificial intelligence (AI) is the process of taking an already trained or pre-trained model and training it on other datasets so it can perform specialized tasks. It allows large language models (LLMs) and other generative AI tools to perform specific tasks, like image processing .[source]
Why Fine-Tuning Exists

Base models know language, facts, and general reasoning. But they don’t know your specific needs.
Inconsistent formatting
You need JSON with specific fields every time. Prompts work most of the time, but the failures break your system. When you’re making thousands of API calls daily, even a small failure rate becomes unacceptable.
Domain terminology
Your field uses specialized jargon. Medical terms have precise meanings. Legal language has strict structures. Internal acronyms mean something very specific. Base models guess or misuse these because they haven’t seen your domain deeply enough.
Specific tone and voice
Your brand has a distinct voice. Maybe formal and precise. Maybe casual and conversational. Prompts drift over long responses. The model starts out following instructions and gradually forgets halfway through.
Complex multi-step processes
You have workflows with edge cases and exceptions. Fitting everything into prompts hits token limits or confuses the model. Instructions get long, the model loses track, and mistakes happen.
Cost at scale
Long instruction prompts cost money every single request. If your instruction prompt is 2,000 tokens and you make 100,000 calls monthly, that’s 200 million instruction tokens you’re paying for repeatedly.
Fine-tuning teaches the model your patterns directly, so they occur with much higher probability during generation, not as guaranteed rules.
Alternatives First

Prompting
What it is: Detailed instructions in every request.
When it works:
- Tasks vary between requests
- You have a small number of examples to work from
- You need to change behavior quickly
- Instructions fit in the context window
When it fails:
- Instructions grow too long
- Consistency isn’t good enough
- You repeat the same complex instructions frequently
Cost: Pay per input token every time.
Maintenance: Change the prompt, test, done.
In-Context Learning
What it is: Include several examples in your prompt showing what you want.
Why it works: The model sees patterns in your examples and mimics them. No training required.
When it works:
- Your task is clear from examples
- Examples fit in context
- You have good examples but not thousands
When it fails:
- Examples don’t fit in available context
- Model doesn’t generalize well enough
- Consistency isn’t sufficient
- Token costs add up at scale
Cost: Examples count as input tokens in every request.
RAG (Retrieval Augmented Generation)
What it is: Store knowledge in a database. Retrieve relevant information for each query. Feed it to the model as context.
When it works:
- Knowledge changes frequently
- Your knowledge base is large
- You need to cite sources
- Facts must update without retraining
When it fails:
- You need consistent behavior patterns, not just facts
- Retrieval adds unacceptable latency
- Your task isn’t about retrieving information
Cost: Storage for embeddings plus retrieval compute.
Maintenance: Update your knowledge base anytime. No retraining.
Summing up
- Prompting: Like writing a sticky note for a colleague. Quick, easy, but they might miss details.
- RAG: Like giving the colleague a textbook to look up answers during the exam. They have the facts, but they still need to know how to read/write.
- Fine-Tuning: Like sending the colleague to medical school. It takes months and money, but now they think like a doctor without needing to look up basic terms.
In-Context Learning Explained
You show examples in the prompt. The model’s attention mechanism notices patterns and applies them. This is powerful but limited.
Examples consume context window space. More examples mean less room for your actual query and response. The model may ignore examples partway through long outputs or give more weight to recent examples, leading to inconsistency.
Every request pays for those example tokens. At high request volumes, costs accumulate quickly. When these limits start to matter, that’s when fine-tuning becomes worth considering.
What Fine-Tuning Can and Cannot Improve

Fine-tuning improves:
✅ Consistent output formatting Making the model reliably produce specific structures.
✅ Domain-specific terminology usage Using specialized jargon correctly.
✅ Tone and style consistency Maintaining formal vs. casual voice. Your brand voice.
✅ Following complex multi-step processes Learning detailed workflows with edge cases.
✅ Handling domain-specific edge cases Unusual inputs specific to your use case.
✅ Reducing verbosity when needed Teaching the model to be more concise.
Fine-tuning has limits:
❌ Adding new factual knowledge reliably Fine-tuning is not a reliable way to add or maintain facts. Use RAG.
❌ Fixing fundamental reasoning weaknesses If the base model struggles with a capability, fine-tuning won’t create it.
❌ Changing core model capabilities You can’t fine-tune a smaller model to match a much larger one’s reasoning.
❌ Eliminating hallucinations completely Fine-tuning can reduce hallucinations in specific domains but won’t eliminate them.
❌ Changing inference speed Inference speed depends on model architecture and size, not training.
❌ Handling completely unseen tasks If training data has no examples of task X, the model won’t suddenly learn to do task X.
The Real Problems and Risks
Fine-tuning comes with real challenges.

Data quality matters intensely
Bad data teaches bad patterns. Inconsistent formatting produces inconsistent outputs. Errors in training data become errors in the model. Garbage in, garbage out applies here with brutal honesty.
Overfitting
The model memorizes training data instead of learning general patterns. It performs perfectly on training examples but fails on new inputs. This happens with limited data, excessive training, or data that does not reflect real usage.
Regression on general capabilities
Fine-tuning on narrow data can degrade performance outside that domain. This is known as catastrophic forgetting. A model trained heavily on customer support tone might struggle with casual conversation later.
Cost and infrastructure
Training requires GPU compute. Costs vary widely by provider and region. Development usually involves multiple training runs. You need GPUs with enough memory, storage for data and model weights, tools to manage runs, and monitoring for failures.
Maintenance
Requirements change. Data distributions shift. Models need periodic retraining. Fine-tuning is ongoing work, not a one-time task.
Quantization

Before LoRA and QLoRA, understand quantization.
What it is: Model weights normally use 16-bit or 32-bit floating-point numbers. Quantization reduces this to 8-bit or 4-bit integers.
Why it matters: Memory usage drops significantly. A 7 billion parameter model requires roughly 14GB at 16-bit precision but around 3.5GB at 4-bit precision.
You can work with larger models on limited hardware.
The trade-off: Lower precision loses information. A 16-bit number represents 65,536 different values. A 4-bit number only represents 16 values.
However, research shows the practical impact is often minimal for most tasks.
When quantization struggles: Tasks requiring extreme precision (certain math problems, specific code generation edge cases) might suffer. Most practical tasks work fine.
Types of Fine-Tuning

Full Fine-Tuning
What it does: Updates every parameter in the model.
Why it exists: Theoretically maximum quality.
When to use it:
- Large compute budget
- Maximum performance required
- Research purposes
When not to use it:
- Limited budget
- Want fast iteration
- Most real-world scenarios
Requirements: Very high memory requirements. A 7B parameter model typically needs 60GB+ of VRAM for full fine-tuning.
Cost: Expensive. Requires significant GPU resources.
PEFT (Parameter-Efficient Fine-Tuning)
What it is: Methods that train only a small subset of parameters instead of the entire model.
Why it exists: Full fine-tuning is expensive. PEFT achieves similar quality at a fraction of the cost.
The most popular PEFT method is LoRA.
LoRA (Low-Rank Adaptation)
What it does: Instead of updating all parameters, you train small adapter matrices. Typically this involves training roughly 1–5% of the original parameters.
The mechanism: Original weights stay frozen. Small matrices (adapters) get inserted into model layers. Only these adapters train.
Why it works: Most fine-tuning information can be captured in a smaller number of carefully placed weights.
Advantages:
- Much faster than full fine-tuning
- Uses far less memory
- Can swap different LoRA adapters on the same base model
- Quality approaches full fine-tuning for most tasks
When to use it: This should be your default choice unless you have specific reasons not to.
Requirements: Significantly lower than full fine-tuning. Memory needs depend on model size and batch size, but are much more manageable than full fine-tuning.
QLoRA (Quantized LoRA)
What it does: Combines LoRA with quantization. LoRA adapters train on top of a quantized (compressed) base model.
The mechanism: Base model gets quantized to 4-bit. LoRA adapters train in higher precision. During training, the model temporarily dequantizes for computations, then quantizes again for storage.
Why it works: The LoRA training with higher precision helps the model learn about and reduce the quantization errors.
Advantages: QLoRA can make it possible to fine-tune very large models on limited hardware, sometimes even a single high-memory GPU, but this typically requires careful configuration and may involve CPU offloading.
Trade-offs: Training is slower due to additional quantization and dequantization steps. Performance loss is typically minimal for most tasks.
When to use it:
- Limited GPU memory
- Want to train larger models than your hardware normally allows
- Experimenting and want to minimize costs
When not to use it:
- You already have sufficient memory for regular LoRA
- Training speed is critical
Instruction Tuning
What it is: Training the model on instruction-following examples. Input is an instruction. Output is the correct response.
Why it exists: Base models predict text but aren’t specifically trained to follow instructions in a particular format. Instruction tuning teaches them your specific instruction patterns.
When to use it: Most practical fine-tuning involves instruction tuning. You’re teaching the model to follow your specific formats and requirements.
Implementation: Usually combined with LoRA or QLoRA. You’re not choosing between instruction tuning and LoRA. You’re doing instruction tuning using LoRA as the method.
Libraries and Tools
Most fine-tuning setups today combine a core deep learning library with higher-level tools that handle model loading, optimization, and training workflows.
PyTorch and TensorFlow are core deep learning libraries. PyTorch is more common for LLM work. Hugging Face Transformers is the standard library for loading pre-trained models and training. Use it when you want control and understand training loops. PEFT provides LoRA, QLoRA, and related methods when you want direct control.
Unsloth focuses on optimizing single-GPU fine-tuning by reducing memory usage and improving training speed for certain model architectures, with gains depending on hardware and configuration.
Axolotl is a configuration-driven fine-tuning framework with flexible options. LLaMA Factory supports many models with both CLI and Web UI workflows. Use it when you prefer less code.
DeepSpeed supports training very large models across multiple GPUs. Use it for large clusters. Skip it for one or two GPUs.
For limited GPUs, consider Axolotl or LLaMA Factory. For single-GPU setups, consider Unsloth. For maximum control, use Transformers with PEFT directly.
Step-by-Step Fine-Tuning Workflow
The actual process.

Step 1: Data Preparation (The Engineering Reality)
This is where your model lives or dies. Most people think data prep is just “fixing typos.” In reality, it is about formatting, templating, and tokenization.
1. The Standard Format: JSONL
We rarely use CSVs for training. The standard is JSONL (JSON Lines), where every line is a valid, independent JSON object. This allows for streaming large datasets without loading everything into RAM.
The “Messages” Format (Modern Standard): Most tools (Axolotl, Unsloth, Hugging Face) now prefer the OpenAI-style conversation format:
JSON
{"messages": [ {"role": "user", "content": "Explain quantum computing."}, {"role": "assistant", "content": "It uses qubits to exist in multiple states at once."} ]}2. The “Invisible” Layer: Chat Templates
This is the #1 reason fine-tunes fail. Your model does not actually “read” JSON. It reads a single long string of text. You need a Chat Template to convert that JSON into the specific string format the base model expects.
- If you use Llama 3, it expects: <|start_header_id|>user<|end_header_id|>\n...
- If you use ChatML (Qwen/Yi), it expects: <|im_start|>user\n...
The Warning: If you feed raw text or the wrong format to the model, it learns garbage. Always ensure your training pipeline uses the tokenizer’s apply_chat_template function. Do not manually hardcode these strings unless you are 100% sure of the model architecture.
3. The EOS Token Trap (Critical)
Every training example must end with a special EOS (End of Sequence) token.
- What it does: It tells the model, “Stop talking now.”
- The Risk: If your training data is missing this invisible token, your fine-tuned model will answer your question and then keep hallucinating forever until it hits the hard token limit.
- The Fix: Most modern libraries (like Unsloth/TRL) handle this automatically, but you must verify that add_eos_token=True is set in your tokenizer config.
4. The “Quality Over Quantity” Checklist
Before you train, run a script to check these specific issues:
- Deduplication: Remove exact duplicates. If the model sees the exact same sentence 50 times, it will overfit on that specific phrasing.
- “As an AI” Removal: Scan your dataset for phrases like “As an AI language model” or “I cannot answer that.” If you train on these, your model will become more likely to refuse requests.
- Token Length: Calculate the token count of your longest example. If your longest example is 8,000 tokens but you train with a max_seq_length of 4,096, the end of your data (usually the answer!) gets cut off. The model learns to ask questions but never learns how to finish answering them.
Step 2: Choosing a Pre-Trained Model (The Foundation)
Don’t just pick the most famous name. You are choosing an architecture and a starting point.
1. The “Base” vs. “Instruct” Dilemma
This is the first question you will face on Hugging Face.
- Base Models (e.g., Llama-3-8B): These are raw completion engines. They predict the next word but don't inherently know how to "chat." Use these if you are teaching a completely new format (like writing code comments or specialized JSON structures) from scratch.
- Instruct Models (e.g., Llama-3-8B-Instruct): These have already been fine-tuned to follow instructions.
- The Engineering Choice: For 95% of use cases, start with the Instruct version. You are performing “Continued Fine-Tuning.” It is much easier to steer an existing chat model towards your specific tone than to teach a base model how to chat from zero.
2. The Hardware Reality (VRAM Math)
You cannot download a model that doesn’t fit in your GPU.
- 7B — 9B Models (Llama 3, Qwen 2.5, Gemma 2): The sweet spot.
- Requirement: Fits on a single consumer GPU (16GB–24GB VRAM) easily with 4-bit quantization.
- Use Case: Fast, cheap, good for specific tasks.
- 70B+ Models: The heavyweights.
- Requirement: Requires A100s (80GB) or multiple GPUs.
- Use Case: Complex reasoning where the 8B models fail.
3. License Traps
- Apache 2.0 / MIT (e.g., Qwen, Yi, OLMo): Truly open. Commercial use is free.
- Community Licenses (e.g., Llama 3, Mistral): Free for most, but have restrictions if you have >700 million users or specific use cases. Always read the LICENSE file on the repo.
Step 3: Dataset Splitting (The Truth Teller)
Most tutorials say “split 80/20” and move on. If you do this blindly, you might ruin your project due to Data Leakage.
1. The Purpose of Validation
We do not use the validation set to train. We use it to measure reality.
- Training Loss tells you: “Is the model memorizing this data?”
- Validation Loss tells you: “Is the model actually learning the concept?”
2. The “Data Leakage” Danger
This happens when your training data and validation data are too similar.
- The Scenario: You have 5 variations of the question “How do I reset my password?” in your dataset.
- The Mistake: You do a random shuffle. 4 variations end up in Training, and 1 ends up in Validation.
- The Result: The model memorizes the answer from the Training set and gets the Validation question right perfectly. You think you have a genius model. In reality, it has learned nothing.
- The Fix: Deduplicate semantically before splitting. Ensure that “Password Reset” concepts are either all in training or all in validation, or be very careful about variation.
3. Stratified Splitting
If your dataset covers 3 distinct topics (e.g., SQL, Python, JavaScript), make sure your validation set has examples of all three. If your random split accidentally puts all the “Python” examples in Training, you will have no idea if your model is actually good at Python code generation until you deploy it and it breaks.
Pro Tip: Always manually read 10–20 rows of your validation set. If they look exactly like your training data, your evaluation will be worthless.
Step 4 The Hyperparameters That Matter
This is where fine-tuning goes from magic to engineering. You aren’t just “training”; you are balancing three constraints: Compute Memory, Training Stability, and Model Quality.
1. Learning Rate (LR)
- What it is: How “large” the steps are that the model takes towards the solution.
The Trade-off:
- Too High: The loss explodes (becomes NaN) or the model forgets its original language skills (Catastrophic Forgetting).
- Too Low: The model learns nothing, and you waste hours of compute.
- Recommendation: Start with 2e-4 for QLoRA/LoRA.
2. Batch Size
- What it is: How many examples the model looks at before updating its weights.
The Trade-off:
- Too Large: You hit Out Of Memory (OOM) errors.
- Too Small: The training becomes unstable (noisy gradients) because single bad examples skew the update.
- The Fix (Gradient Accumulation): If your GPU fits a batch of 2, but you need a batch of 16, set Gradient Accumulation to 8. The model calculates 8 mini-batches, sums the gradients, and then updates.
3. Epochs
- What it is: How many times the model sees your entire dataset.
The Trade-off:
- Too Many: Overfitting. The model memorizes your training data verbatim but fails on new inputs.
- Too Few: Underfitting. The model doesn’t learn the format or style.
- Recommendation: Start with 1–3 epochs. Stop immediately if Validation Loss starts increasing.
Step 5: Training (The “Health Monitor”)
Running the script is the easy part. Knowing if it is actually learning is the hard part. You generally monitor this using tools like WandB (Weights & Biases) or TensorBoard.
1. The Tale of Two Curves
You will see two lines on your graph. Their relationship tells you everything.
- Training Loss (Blue Line): This measures how well the model remembers the specific examples it is seeing right now. This should always go down. If it goes up or stays flat, your Learning Rate is likely too high or your data is broken.
- Validation Loss (Orange Line): This measures how well the model generalizes to unseen data.
- Ideal: Goes down steadily alongside training loss.
- The Danger Zone: It hits a minimum point and starts curling upwards.
2. The Overfitting Trap
When Validation Loss starts rising while Training Loss keeps falling, your model has stopped “learning” and started “cramming.” It is memorizing the training data at the expense of general intelligence.
- The Fix: This is why we use Checkpoints. If you train for 3 epochs, save the model every 0.5 epochs. You will likely throw away the final model (Epoch 3) and keep the one from Epoch 2.5 where the validation loss was lowest.
3. The “NaN” Panic
If your loss suddenly spikes to NaN (Not a Number) or 0.00, your training has crashed mathematically.
- Cause: Usually Gradient Explosion. The numbers got too big for the floating-point format (fp16) to handle.
- Fix: Lower the learning rate or use gradient_clipping.
Step 6: Evaluation (The Reality Check)
Stop using BLEU and ROUGE. Unless you are doing pure translation, these metrics are meaningless for LLMs. If the correct answer is “The sky is blue” and your model says “The firmament is azure,” BLEU gives it a score of 0, even though the answer is perfect.
1. The Modern Standard: LLM-as-a-Judge
Since we can’t mathematically calculate “good writing,” we use a smarter model to grade our fine-tune.
- How it works: You take 50 test questions. You generate answers from your fine-tuned model. You then feed both the question and the answer to GPT-4 (or Claude 3.5 Sonnet) with a rubric.
- The Rubric: “Rate the following answer on a scale of 1–5 for accuracy, tone, and format compliance.”
- Why it wins: It correlates highly with human preference but scales instantly and cheaply.
2. Functional Evaluation (The “Unit Test”)
If you are fine-tuning for structured output (JSON, SQL, Code), you don’t need vibes; you need syntax.
- The Metric: Execution Success Rate.
- The Test: Generate 100 SQL queries. Run them against a real database. How many actually execute without syntax errors? That is your accuracy score.
3. The “Vibe Check” (Side-by-Side)
Do not trust your memory. Humans are biased.
- The Blind Test: Create a spreadsheet with three columns: Input, Model A Output, Model B Output. Blind the names. Read them and mark which one you prefer.
- The Catch: Often, a base model (like Llama-3-Instruct) is better at general reasoning than your fine-tune. Be honest. If the base model wins, your fine-tuning data might be degrading the model (Catastrophic Forgetting).
Step7: Iteration
You’ll probably need multiple training runs.
Early runs might overfit or underfit. Adjust hyperparameters. Check if training data matches production. This is normal. Budget for it.
Key Considerations Before Fine-Tuning
Dataset size and quality: Research suggests several hundred high-quality examples can work with parameter-efficient methods, though 1,000–10,000 examples typically provide better performance. Exact needs depend heavily on task complexity.
Quality matters more than quantity.
Domain similarity to base model: Pick a base model trained on data somewhat related to your task. Fine-tuning works better when building on relevant existing knowledge.
Compute and budget constraints: GPU costs vary by provider and region. For small LoRA fine-tuning runs on 7B models, costs are often in the tens of dollars range for a few hours of training on a single GPU.
Budget for multiple training runs during development.
Overfitting risks: Small datasets overfit easily. Homogeneous data (all examples very similar) overfits easily. Training too long overfits even with good data.
Maintenance cost: Requirements change. Data distributions shift. Plan for periodic retraining.
Clear definition of success: What does “better” mean? How will you measure it? What threshold do you need?
Define this before training. Otherwise you won’t know when you’re done.
When Fine-Tuning is Right vs. Wrong
Let’s be concrete.
Use prompting when:
Example: Summarizing customer feedback You need: Different summary styles for different teams Why prompting works: Instructions change per request. Fine-tuning would lock you into one style.
Use in-context learning when:
Example: Categorizing support tickets You have: Dozens of examples of categories Why it works: Examples fit in context. Task is clear from examples.
Use RAG when:
Example: Answering questions about your product documentation You need: Up-to-date answers as docs change Why it works: Docs change regularly. Retraining isn’t feasible. Retrieval gives current info.
Use fine-tuning when:
Example 1: Generating SQL queries from natural language You have: Thousands of examples of question-SQL pairs Why it works: Complex instruction patterns. You need consistency. Prompts hit token limits.
Example 2: Customer support chatbot with your brand voice You have: Thousands of real support conversations Why it works: Tone needs to be consistent. In-context examples drift. You’re making many thousands of requests monthly.
Example 3: Code generation in your company’s style You have: Many code examples following your conventions Why it works: Specific formatting rules. Naming conventions. Architecture patterns. Prompts can’t capture all nuances efficiently.
Don’t fine-tune when:
Example 1: You have limited examples Why: Not enough data. In-context learning likely works better.
Example 2: Requirements change frequently Why: You’ll spend all your time retraining. Use prompting for flexibility.
Example 3: You need to cite sources Why: Fine-tuning doesn’t add reliable citations. Use RAG.
Example 4: Prompting already works well Why: Don’t fix what isn’t broken.
Closing Thoughts
Most problems don’t need fine-tuning. Start simple. Escalate only when needed.
Fine-tuning is powerful when used correctly. The goal is not to use advanced techniques. The goal is to ship something that works.
Start small. Learn from one real run. Then scale.
Now go build something.
A Comprehensive Guide on Fine-Tuning AI Models On Your Dataset For Beginners was originally published in Towards AI on Medium, where people are continuing the conversation by highlighting and responding to this story.