When Prompting Isn’t Enough: The Case for Training
While prompting is a powerful tool for guiding LLM behavior, it becomes insufficient in two key scenarios:
When domain-specific training data exists: If you have substantial, high-quality data specific to your use case, training can fundamentally improve model performance in ways that prompting cannot match. 1.
When domain adaptation is required: General-purpose LLMs trained on broad internet data often struggle with specialized domains like medicine, law, finance, or proprietary enterprise contexts.
Understanding Domain Adaptation
Domain adaptation is the process of customizing a generative AI foundation model that has been trained on massive amounts of public data to increase its knowledge and capabilities for a…
When Prompting Isn’t Enough: The Case for Training
While prompting is a powerful tool for guiding LLM behavior, it becomes insufficient in two key scenarios:
When domain-specific training data exists: If you have substantial, high-quality data specific to your use case, training can fundamentally improve model performance in ways that prompting cannot match. 1.
When domain adaptation is required: General-purpose LLMs trained on broad internet data often struggle with specialized domains like medicine, law, finance, or proprietary enterprise contexts.
Understanding Domain Adaptation
Domain adaptation is the process of customizing a generative AI foundation model that has been trained on massive amounts of public data to increase its knowledge and capabilities for a specific domain or use case. This may involve adapting models for specialized verticals, enhancing abilities in particular languages, or personalizing models to a company’s unique concepts and terminology.
The shift toward training-based adaptation is understandable because training typically leads to better results on specialized tasks, and training has become more affordable due to growing numbers of techniques that enable efficient training. For example, while training cost increased from $900 for the original Transformer to over $4 million for GPT-3, it has recently dropped to around $0.8 million to train a GPT-3 on-par model like Phi3.5.
Key Training Approaches for Domain Adaptation
Continued Pre-Training (CPT) Also known as second-stage pre-training, this involves further training a foundation model with new, unseen domain data using the same self-supervised algorithm from initial pre-training. All model weights are typically updated, with a fraction of the original data mixed with new domain-specific data to prevent catastrophic forgetting.
Fine-Tuning Methods The process of adapting a pre-trained language model using an annotated dataset in a supervised manner or using reinforcement learning techniques. Recent advances include:
- LoRA (Low-Rank Adaptation): Adds small, trainable matrices to model layers, drastically reducing the number of parameters that need updating
- Adapter Layers: Insert lightweight, task-specific layers within transformer blocks
- Prefix-Tuning: Optimizes only prefix tokens prepended to inputs at each layer
- Direct Preference Optimization (DPO): Aligns models with human preferences without requiring a separate reward model
Recent research in 2025 explores how various fine-tuning strategies including CPT, SFT (Supervised Fine-Tuning), and preference-based optimization approaches like DPO and ORPO affect model performance. Notably, model merging—combining multiple fine-tuned models—can lead to emergent capabilities that surpass individual parent models.
The Challenge of Catastrophic Forgetting
Any approach that updates model weights is susceptible to catastrophic forgetting, where the model loses previously learned skills and knowledge. For instance, models fine-tuned in medical domains have shown degraded performance on instruction-following and common QA tasks.
Cramming: Efficient Training for Research
"Cramming" refers to the experimental challenge of training an LLM on a single GPU within a single day. This approach has become valuable for research teams with limited computational resources, enabling rapid experimentation with training techniques and architectural choices.
Decoding: How LLMs Generate Text
Decoding is the iterative process by which LLMs generate text, selecting one token at a time based on probability distributions over the vocabulary. Understanding decoding strategies is crucial for controlling model outputs.
Greedy Decoding
The simplest approach where the model always selects the token with the highest probability at each step. While deterministic and fast, greedy decoding often produces suboptimal or repetitive outputs because it lacks diversity.
Setting greedy decoding: Typically achieved by setting temperature=0.0, though even with temperature 0, results may not be fully deterministic due to implementation details like floating-point arithmetic variations and parallel processing.
Non-Deterministic (Stochastic) Decoding
Instead of always picking the highest probability token, stochastic methods randomly sample from high-probability candidates, introducing creativity and diversity into outputs.
Key Sampling Methods:
Top-k Sampling: Randomly selects from the k most likely tokens, ensuring prioritization of probable tokens while introducing randomness. For example, with k=3 and probabilities P(A)=30%, P(B)=15%, P(C)=5%, the algorithm outputs A 60% of the time, B 30%, and C 10%. 1.
Nucleus Sampling (Top-p): Dynamically forms a set of tokens whose cumulative probability exceeds threshold p, adapting the selection pool based on the distribution’s shape. 1.
Temperature Sampling: Modulates the probability distribution by rescaling logits before applying softmax.
Understanding Temperature
Temperature is a crucial hyperparameter that controls the balance between creativity and predictability in LLM outputs.
How Temperature Works:
- Temperature directly affects the variability and randomness of generated responses by scaling logits in the softmax function
- Low temperature (T < 1.0): Makes the distribution more peaked around the most likely tokens, producing more deterministic, focused outputs
- High temperature (T > 1.0): Flattens the distribution, giving less probable tokens higher chances, increasing creativity and diversity
- Temperature = 1.0: Equivalent to standard softmax with no modification
Practical Guidelines:
Low sampling temperatures are recommended for tasks requiring precision and factual accuracy such as technical writing, code generation, or question-answering, while higher temperatures are recommended for creative tasks like writing poetry or brainstorming.
However, recent empirical research indicates that changes in temperature in the range 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks, contrary to anecdotal reports. This finding appears to hold regardless of the LLM, prompt-engineering technique, or problem domain tested.
The Creativity-Hallucination Trade-off:
Higher temperatures increase creativity but also raise the probability of hallucinations. Temperature sampling often comes at the cost of lower task accuracy compared to deterministic decoding, with deterministic approaches tending to reduce diversity in generated outputs.
Advanced Decoding: Selective Sampling
Recent research has introduced selective sampling, which dynamically switches between greedy and high-temperature sampling based on a "sampling risk metric" that estimates the likelihood of errors when applying high-temperature sampling at specific token positions. This approach enhances the quality-diversity trade-off even in high-temperature settings.
Hallucination: The Persistent Challenge
Hallucination occurs when LLMs generate text that is non-factual, ungrounded, or contradicts provided information. This remains one of the most critical challenges in deploying LLMs in real-world applications.
Understanding Different Types of Hallucinations
- Factual Hallucinations: Generating information that contradicts known facts
- Faithful Hallucinations: Producing outputs that contradict provided context or retrieved documents
- Intrinsic Hallucinations: Contradicting the source material directly
- Extrinsic Hallucinations: Adding information not present in the source
Causes of Hallucinations
Hallucinations in LLMs stem from various sources including limitations within retrieval-augmented generation (RAG) components, such as data source problems, query issues, retriever limitations, context noise, context conflicts, and model capability boundaries.
Even knowledge boundaries play a role—LLMs produce hallucinations when faced with tasks beyond the scope of their training data, generating responses inconsistent with facts based solely on patterns in training data.
Reducing Hallucinations: Retrieval-Augmented Generation
Retrieval-Augmented Generation (RAG) has emerged as the primary technique for mitigating hallucinations by augmenting LLMs with external, authoritative knowledge.
How RAG Works
RAG enhances large language models by incorporating an information-retrieval mechanism that allows models to access and utilize additional data beyond their original training set. The process typically involves:
- Indexing: Converting documents into embeddings and storing them in a vector database
- Retrieval: Selecting the most relevant documents for a given query
- Augmentation: Injecting retrieved information into the LLM prompt
- Generation: Producing a response grounded in the retrieved context
The Reality: RAG Doesn’t Eliminate Hallucinations
According to multiple sources, RAG does not prevent hallucinations in LLMs—it is not a direct solution because the LLM can still hallucinate around the source material in its response.
Key RAG Limitations:
Context Misinterpretation: LLMs may extract statements from a source without considering context, resulting in incorrect conclusions. For example, an LLM might retrieve information from an academic book rhetorically titled "Barack Hussein Obama: America’s First Muslim President?" and generate the false statement that Obama was Muslim, failing to understand the rhetorical nature of the title.
Incomplete Knowledge Extraction: Even with RAG, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or contradictory responses
Retrieval Quality Issues: Naive RAG implementations suffer from low precision (misaligned chunks), low recall (failure to retrieve all relevant chunks), and outdated information
Advanced RAG Techniques
Improving Retrieval:
- Query decomposition and rewriting
- Hypothetical Document Embeddings (HyDE): Generating hypothetical answers and using them for retrieval
- Hybrid search combining dense and sparse retrieval
- Re-ranking retrieved documents
Improving Generation:
- Fine-tuning models on (prompt, context, response) triples
- Context-aware decoding that upweights tokens matching retrieved context
- Chain-of-Verification to reduce hallucinations
Groundedness and Attribution: Ensuring Trustworthy Outputs
Beyond reducing hallucinations, ensuring outputs are properly grounded and attributed to sources has become critical for building trust in LLM applications.
Defining Groundedness
Groundedness means generated text is supported by and can be attributed to specific documents or sources. While "groundedness" seeks attribution to a user-specific knowledge base, "factuality" seeks attribution to commonly agreed world knowledge.
Attributional Grounding
The research community has embraced attributional grounding, where systems must output documents that ground their answers. This approach increases transparency and allows users to verify claims.
Key Approaches:
- In-Context Learning with Citations: Prompting LLMs to generate responses with inline citations
- Post-Hoc Attribution: Using Natural Language Inference (NLI) models to add citations after generation
- Training for Attribution: Fine-tuning models to generate grounded responses with citations
The TRUE Model and NLI-Based Verification
The TRUE model (based on Natural Language Inference) is widely used for measuring groundedness by judging whether a claim is supported by a passage. This automated approach has become standard for evaluating whether LLM outputs are properly grounded.
How NLI Models Work for Attribution:
- Given a passage (premise) and a claim (hypothesis), NLI models output probabilities for entailment, contradiction, or neutral relationships
- For each generated sentence, the NLI model identifies which passage supports it
- Citations are added only to sentences with supporting passages
Training LLMs to Generate Citations
Recent frameworks like AGREE fine-tune LLMs to self-ground their claims and provide precise citations to retrieved documents. The process involves:
- Sampling responses from a base LLM without citations
- Using an NLI model to automatically add citations to well-grounded sentences
- Fine-tuning the LLM on these augmented responses
- Implementing test-time adaptation to iteratively refine outputs
Results show that tuning-based approaches lead to substantially better grounding than prompting or post-hoc methods, often achieving relative improvements of over 30%.
Challenges in Long-Context Attribution
As LLMs handle increasingly longer contexts (100K+ tokens), citation becomes more challenging. Recent work on long-context citation focuses on:
- Generating fine-grained citations to specific snippets rather than entire documents
- Evaluating both citation quality (precision and recall) and answer correctness
- Developing benchmarks that test attribution across various long-context tasks
Detecting Hallucinations in RAG Systems
Even with RAG, detecting when models hallucinate remains crucial. Recent approaches include:
Mechanistic Interpretability: Research has discovered that hallucinations in RAG occur when Knowledge FFNs in LLMs overemphasize parametric knowledge while Copying Heads fail to effectively integrate external knowledge from retrieved content.
Layer-wise Relevance Propagation: Computing relevance between inputs and outputs to identify when generated content lacks proper grounding in retrieved documents.
Verification Guardrails: Using NLI models at inference time to verify that each claim in the response is supported by retrieved context, catching hallucinations before they reach users.
Best Practices for Production Systems
When deploying LLMs in production environments:
Choose the Right Approach: Use prompting for general tasks, training for domain-specific applications with available data 1.
Implement RAG Thoughtfully: Don’t treat RAG as a silver bullet—combine it with proper evaluation, guardrails, and monitoring 1.
Control Decoding Parameters: Set temperature based on your use case (low for factual tasks, higher for creative applications), but test empirically 1.
Add Citation Support: Train or prompt models to provide citations, making it easier to verify and trust outputs 1.
Implement Verification: Use NLI models or other verification mechanisms to catch hallucinations before they reach users 1.
Monitor Groundedness: Continuously evaluate whether model outputs remain grounded in your knowledge base 1.
Plan for Adaptation: Budget for continued training or fine-tuning as your domain evolves and new data becomes available
The Future of Training, Decoding, and Grounding
The field continues to evolve rapidly:
- Efficient Training: Techniques like LoRA and adapter methods make domain adaptation increasingly accessible
- Dynamic Decoding: Selective sampling and context-aware decoding promise better quality-diversity trade-offs
- Integrated Verification: Models are increasingly being trained with built-in verification and attribution capabilities
- Reasoning Models: Systems like OpenAI’s o1 and DeepSeek-R1 incorporate explicit reasoning steps, potentially reducing hallucinations
- Multi-Step RAG: Advanced systems integrate retrieval into reasoning chains, dynamically acquiring evidence as needed
Understanding the interplay between training, decoding, and hallucination is essential for building reliable LLM applications:
- Training provides the foundation, but requires careful consideration of catastrophic forgetting and computational costs
- Decoding strategies affect the creativity-accuracy trade-off, with temperature being less impactful than commonly believed for problem-solving tasks
- Hallucinations remain a fundamental challenge that RAG helps but doesn’t eliminate
- Groundedness and attribution are critical for building trust, requiring intentional design through training, prompting, or post-hoc methods
As LLMs become more integrated into critical applications, addressing these challenges through robust training, careful decoding choices, effective RAG implementation, and strong verification mechanisms will be essential for success.