Training, Decoding, and Hallucination in Large Language Models: A Deep Dive

When Prompting Isn’t Enough: The Case for Training

While prompting is a powerful tool for guiding LLM behavior, it becomes insufficient in two key scenarios:

When domain-specific training data exists: If you have substantial, high-quality data specific to your use case, training can fundamentally improve model performance in ways that prompting cannot match. 1.

When domain adaptation is required: General-purpose LLMs trained on broad internet data often struggle with specialized domains like medicine, law, finance, or proprietary enterprise contexts.

Understanding Domain Adaptation

Domain adaptation is the process of customizing a generative AI foundation model that has been trained on massive amounts of public data to increase its knowledge and capabilities for a…

When Prompting Isn’t Enough: The Case for Training

While prompting is a powerful tool for guiding LLM behavior, it becomes insufficient in two key scenarios:

When domain adaptation is required: General-purpose LLMs trained on broad internet data often struggle with specialized domains like medicine, law, finance, or proprietary enterprise contexts.

Understanding Domain Adaptation

Domain adaptation is the process of customizing a generative AI foundation model that has been trained on massive amounts of public data to increase its knowledge and capabilities for a specific domain or use case. This may involve adapting models for specialized verticals, enhancing abilities in particular languages, or personalizing models to a company’s unique concepts and terminology.

The shift toward training-based adaptation is understandable because training typically leads to better results on specialized tasks, and training has become more affordable due to growing numbers of techniques that enable efficient training. For example, while training cost increased from $900 for the original Transformer to over $4 million for GPT-3, it has recently dropped to around $0.8 million to train a GPT-3 on-par model like Phi3.5.

Key Training Approaches for Domain Adaptation

Continued Pre-Training (CPT) Also known as second-stage pre-training, this involves further training a foundation model with new, unseen domain data using the same self-supervised algorithm from initial pre-training. All model weights are typically updated, with a fraction of the original data mixed with new domain-specific data to prevent catastrophic forgetting.

Fine-Tuning Methods The process of adapting a pre-trained language model using an annotated dataset in a supervised manner or using reinforcement learning techniques. Recent advances include:

LoRA (Low-Rank Adaptation): Adds small, trainable matrices to model layers, drastically reducing the number of parameters that need updating
Adapter Layers: Insert lightweight, task-specific layers within transformer blocks
Prefix-Tuning: Optimizes only prefix tokens prepended to inputs at each layer
Direct Preference Optimization (DPO): Aligns models with human preferences without requiring a separate reward model

Recent research in 2025 explores how various fine-tuning strategies including CPT, SFT (Supervised Fine-Tuning), and preference-based optimization approaches like DPO and ORPO affect model performance. Notably, model merging—combining multiple fine-tuned models—can lead to emergent capabilities that surpass individual parent models.

The Challenge of Catastrophic Forgetting

Any approach that updates model weights is susceptible to catastrophic forgetting, where the model loses previously learned skills and knowledge. For instance, models fine-tuned in medical domains have shown degraded performance on instruction-following and common QA tasks.

Cramming: Efficient Training for Research

"Cramming" refers to the experimental challenge of training an LLM on a single GPU within a single day. This approach has become valuable for research teams with limited computational resources, enabling rapid experimentation with training techniques and architectural choices.

Decoding: How LLMs Generate Text

Decoding is the iterative process by which LLMs generate text, selecting one token at a time based on probability distributions over the vocabulary. Understanding decoding strategies is crucial for controlling model outputs.

Greedy Decoding

The simplest approach where the model always selects the token with the highest probability at each step. While deterministic and fast, greedy decoding often produces suboptimal or repetitive outputs because it lacks diversity.

Setting greedy decoding: Typically achieved by setting temperature=0.0, though even with temperature 0, results may not be fully deterministic due to implementation details like floating-point arithmetic variations and parallel processing.

Non-Deterministic (Stochastic) Decoding

Instead of always picking the highest probability token, stochastic methods randomly sample from high-probability candidates, introducing creativity and diversity into outputs.

Key Sampling Methods:

Top-k Sampling: Randomly selects from the k most likely tokens, ensuring prioritization of probable tokens while introducing randomness. For example, with k=3 and probabilities P(A)=30%, P(B)=15%, P(C)=5%, the algorithm outputs A 60% of the time, B 30%, and C 10%. 1.

Nucleus Sampling (Top-p): Dynamically forms a set of tokens whose cumulative probability exceeds threshold p, adapting the selection pool based on the distribution’s shape. 1.

Temperature Sampling: Modulates the probability distribution by rescaling logits before applying softmax.

Understanding Temperature

Temperature is a crucial hyperparameter that controls the balance between creativity and predictability in LLM outputs.

How Temperature Works:

Temperature directly affects the variability and randomness of generated responses by scaling logits in the softmax function
Low temperature (T < 1.0): Makes the distribution more peaked around the most likely tokens, producing more deterministic, focused outputs
High temperature (T > 1.0): Flattens the distribution, giving less probable tokens higher chances, increasing creativity and diversity
Temperature = 1.0: Equivalent to standard softmax with no modification

Practical Guidelines:

Low sampling temperatures are recommended for tasks requiring precision and factual accuracy such as technical writing, code generation, or question-answering, while higher temperatures are recommended for creative tasks like writing poetry or brainstorming.

However, recent empirical research indicates that changes in temperature in the range 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks, contrary to anecdotal reports. This finding appears to hold regardless of the LLM, prompt-engineering technique, or problem domain tested.

The Creativity-Hallucination Trade-off:

Higher temperatures increase creativity but also raise the probability of hallucinations. Temperature sampling often comes at the cost of lower task accuracy compared to deterministic decoding, with deterministic approaches tending to reduce diversity in generated outputs.

Advanced Decoding: Selective Sampling

Recent research has introduced selective sampling, which dynamically switches between greedy and high-temperature sampling based on a "sampling risk metric" that estimates the likelihood of errors when applying high-temperature sampling at specific token positions. This approach enhances the quality-diversity trade-off even in high-temperature settings.

Hallucination: The Persistent Challenge

Hallucination occurs when LLMs generate text that is non-factual, ungrounded, or contradicts provided information. This remains one of the most critical challenges in deploying LLMs in real-world applications.

Understanding Different Types of Hallucinations

Factual Hallucinations: Generating information that contradicts known facts
Faithful Hallucinations: Producing outputs that contradict provided context or retrieved documents
Intrinsic Hallucinations: Contradicting the source material directly
Extrinsic Hallucinations: Adding information not present in the source

Causes of Hallucinations

Hallucinations in LLMs stem from various sources including limitations within retrieval-augmented generation (RAG) components, such as data source problems, query issues, retriever limitations, context noise, context conflicts, and model capability boundaries.

Even knowledge boundaries play a role—LLMs produce hallucinations when faced with tasks beyond the scope of their training data, generating responses inconsistent with facts based solely on patterns in training data.

Reducing Hallucinations: Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) has emerged as the primary technique for mitigating hallucinations by augmenting LLMs with external, authoritative knowledge.

How RAG Works

RAG enhances large language models by incorporating an information-retrieval mechanism that allows models to access and utilize additional data beyond their original training set. The process typically involves:

Indexing: Converting documents into embeddings and storing them in a vector database
Retrieval: Selecting the most relevant documents for a given query
Augmentation: Injecting retrieved information into the LLM prompt
Generation: Producing a response grounded in the retrieved context

The Reality: RAG Doesn’t Eliminate Hallucinations

According to multiple sources, RAG does not prevent hallucinations in LLMs—it is not a direct solution because the LLM can still hallucinate around the source material in its response.

Key RAG Limitations:

Context Misinterpretation: LLMs may extract statements from a source without considering context, resulting in incorrect conclusions. For example, an LLM might retrieve information from an academic book rhetorically titled "Barack Hussein Obama: America’s First Muslim President?" and generate the false statement that Obama was Muslim, failing to understand the rhetorical nature of the title.

Incomplete Knowledge Extraction: Even with RAG, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or contradictory responses

Retrieval Quality Issues: Naive RAG implementations suffer from low precision (misaligned chunks), low recall (failure to retrieve all relevant chunks), and outdated information

Advanced RAG Techniques

Improving Retrieval:

Query decomposition and rewriting
Hypothetical Document Embeddings (HyDE): Generating hypothetical answers and using them for retrieval
Hybrid search combining dense and sparse retrieval
Re-ranking retrieved documents

Improving Generation:

Fine-tuning models on (prompt, context, response) triples
Context-aware decoding that upweights tokens matching retrieved context
Chain-of-Verification to reduce hallucinations

Groundedness and Attribution: Ensuring Trustworthy Outputs

Beyond reducing hallucinations, ensuring outputs are properly grounded and attributed to sources has become critical for building trust in LLM applications.

Defining Groundedness

Groundedness means generated text is supported by and can be attributed to specific documents or sources. While "groundedness" seeks attribution to a user-specific knowledge base, "factuality" seeks attribution to commonly agreed world knowledge.

Attributional Grounding

The research community has embraced attributional grounding, where systems must output documents that ground their answers. This approach increases transparency and allows users to verify claims.

Key Approaches:

In-Context Learning with Citations: Prompting LLMs to generate responses with inline citations
Post-Hoc Attribution: Using Natural Language Inference (NLI) models to add citations after generation
Training for Attribution: Fine-tuning models to generate grounded responses with citations

The TRUE Model and NLI-Based Verification

The TRUE model (based on Natural Language Inference) is widely used for measuring groundedness by judging whether a claim is supported by a passage. This automated approach has become standard for evaluating whether LLM outputs are properly grounded.

How NLI Models Work for Attribution:

Given a passage (premise) and a claim (hypothesis), NLI models output probabilities for entailment, contradiction, or neutral relationships
For each generated sentence, the NLI model identifies which passage supports it
Citations are added only to sentences with supporting passages

Training LLMs to Generate Citations

Recent frameworks like AGREE fine-tune LLMs to self-ground their claims and provide precise citations to retrieved documents. The process involves:

Sampling responses from a base LLM without citations
Using an NLI model to automatically add citations to well-grounded sentences
Fine-tuning the LLM on these augmented responses
Implementing test-time adaptation to iteratively refine outputs

Results show that tuning-based approaches lead to substantially better grounding than prompting or post-hoc methods, often achieving relative improvements of over 30%.

Challenges in Long-Context Attribution

As LLMs handle increasingly longer contexts (100K+ tokens), citation becomes more challenging. Recent work on long-context citation focuses on:

Generating fine-grained citations to specific snippets rather than entire documents
Evaluating both citation quality (precision and recall) and answer correctness
Developing benchmarks that test attribution across various long-context tasks

Detecting Hallucinations in RAG Systems

Even with RAG, detecting when models hallucinate remains crucial. Recent approaches include:

Mechanistic Interpretability: Research has discovered that hallucinations in RAG occur when Knowledge FFNs in LLMs overemphasize parametric knowledge while Copying Heads fail to effectively integrate external knowledge from retrieved content.

Layer-wise Relevance Propagation: Computing relevance between inputs and outputs to identify when generated content lacks proper grounding in retrieved documents.

Verification Guardrails: Using NLI models at inference time to verify that each claim in the response is supported by retrieved context, catching hallucinations before they reach users.

Best Practices for Production Systems

When deploying LLMs in production environments:

Choose the Right Approach: Use prompting for general tasks, training for domain-specific applications with available data 1.

Implement RAG Thoughtfully: Don’t treat RAG as a silver bullet—combine it with proper evaluation, guardrails, and monitoring 1.

Control Decoding Parameters: Set temperature based on your use case (low for factual tasks, higher for creative applications), but test empirically 1.

Add Citation Support: Train or prompt models to provide citations, making it easier to verify and trust outputs 1.

Implement Verification: Use NLI models or other verification mechanisms to catch hallucinations before they reach users 1.

Monitor Groundedness: Continuously evaluate whether model outputs remain grounded in your knowledge base 1.

Plan for Adaptation: Budget for continued training or fine-tuning as your domain evolves and new data becomes available

The Future of Training, Decoding, and Grounding

The field continues to evolve rapidly:

Efficient Training: Techniques like LoRA and adapter methods make domain adaptation increasingly accessible
Dynamic Decoding: Selective sampling and context-aware decoding promise better quality-diversity trade-offs
Integrated Verification: Models are increasingly being trained with built-in verification and attribution capabilities
Reasoning Models: Systems like OpenAI’s o1 and DeepSeek-R1 incorporate explicit reasoning steps, potentially reducing hallucinations
Multi-Step RAG: Advanced systems integrate retrieval into reasoning chains, dynamically acquiring evidence as needed

Understanding the interplay between training, decoding, and hallucination is essential for building reliable LLM applications:

Training provides the foundation, but requires careful consideration of catastrophic forgetting and computational costs
Decoding strategies affect the creativity-accuracy trade-off, with temperature being less impactful than commonly believed for problem-solving tasks
Hallucinations remain a fundamental challenge that RAG helps but doesn’t eliminate
Groundedness and attribution are critical for building trust, requiring intentional design through training, prompting, or post-hoc methods

As LLMs become more integrated into critical applications, addressing these challenges through robust training, careful decoding choices, effective RAG implementation, and strong verification mechanisms will be essential for success.

When Prompting Isn’t Enough: The Case for Training

Understanding Domain Adaptation

When Prompting Isn’t Enough: The Case for Training

Understanding Domain Adaptation

Key Training Approaches for Domain Adaptation

The Challenge of Catastrophic Forgetting

Cramming: Efficient Training for Research

Decoding: How LLMs Generate Text

Greedy Decoding

Non-Deterministic (Stochastic) Decoding

Understanding Temperature

Advanced Decoding: Selective Sampling

Hallucination: The Persistent Challenge

Understanding Different Types of Hallucinations

Causes of Hallucinations

Reducing Hallucinations: Retrieval-Augmented Generation

How RAG Works

The Reality: RAG Doesn’t Eliminate Hallucinations

Incomplete Knowledge Extraction: Even with RAG, incomplete knowledge extraction and insufficient understanding can still mislead LLMs to produce irrelevant or contradictory responses

Advanced RAG Techniques

Groundedness and Attribution: Ensuring Trustworthy Outputs

Defining Groundedness

Attributional Grounding

The TRUE Model and NLI-Based Verification

Training LLMs to Generate Citations

Challenges in Long-Context Attribution

Detecting Hallucinations in RAG Systems

Best Practices for Production Systems

The Future of Training, Decoding, and Grounding

Similar Posts