Testing Language Models: Engineering Confidence Without Certainty

Software engineers have long leaned on determinism for confidence. Given a function and a specification, we wrote unit tests, fixed the edge cases those tests revealed, and expected tomorrow to look like today. That was never fully true. Classical systems also depend on assumptions about their environment. A ranking function such as BM25 can drift as content and user behavior change. Heuristics degrade when traffic mixes evolve. Data pipelines wobble when upstream schemas or partner APIs shift. The old playbook worked best when the world stayed close to the distribution we implicitly assumed.

Large language model applications surface the same fragility and add two structural challenges. First, non-determinism: the same input can yield different outputs. Second, unbounded inputs:…

Large language model applications surface the same fragility and add two structural challenges. First, non-determinism: the same input can yield different outputs. Second, unbounded inputs: natural language opens an effectively infinite input space, across tasks, languages, lengths, and intentions. Distribution shift threads through both. When inputs, models, or data change, yesterday’s evidence weakens.

There is also a third challenge that matters in practice. The ability to “fix” errors is limited because the core model largely sits outside the usual edit-compile-test loop. One cannot patch a function and be done. There are levers but often no precise patches.

To deploy ML systems, we need to move from a brittle notion of correctness to engineered bounds on risk. That shift rests on three pillars: constrain what the system is allowed to do, gather probabilistic evidence over representative scenarios, and analyze the space of fixes with a realistic view of what each lever can and cannot achieve.

What the two core changes imply

Non-determinism. Reproducibility weakens, and a single golden output is often the wrong oracle. Two consequences follow. First, evaluation benefits from properties that any acceptable answer should satisfy, rather than from one target string. Second, decision procedures can incorporate aggregation and checks: for example, agreement among multiple samples, or agreement between an answer and evidence that was retrieved for it. These mechanisms do not remove randomness; they use it to expose instability.

Unbounded inputs. Enumeration is impossible. Coverage becomes a statement about scenarios rather than about individual examples. A scenario is defined along axes such as task type, domain, language, input length, retrieval quality, and adversarial pressure. Evidence then concerns how well the system behaved in each scenario and how often real traffic occupies those scenarios. When the mix shifts, the original claims degrade in a known way, not an invisible way.

Both properties explain the kinds of drift that appear in production. The response is not “more tests” in the abstract. The response is to declare scope precisely and to detect when traffic leaves that scope.

Here’s a revised version with stronger theoretical structure:

Constraints and properties: what counts as correct

When correct outputs cannot be enumerated for unbounded inputs, correctness shifts from point specifications to property classes. This requires a compositional approach where different property types guard against different failure modes.

The specification hierarchy. Properties form a natural hierarchy from syntactic to semantic:

Structural constraints (contracts). These enforce output well-formedness independent of input content. Schema validation, type constraints, and budget limits provide the narrowest but most reliable bounds. They catch malformed JSON, type violations, and resource abuse, but say nothing about semantic correctness.
Invariance properties (metamorphic relations). These specify how outputs should change—or not change—under input transformations. Key classes include:These properties test semantic stability without requiring ground truth for every input.

Equivalence invariants: Paraphrases should yield identical classifications
Monotonicity invariants: Adding gold evidence should not degrade answer quality
Consistency invariants: Formatting changes should not alter semantic outputs

Robustness properties (placebo/ablation tests). These verify that the model ignores irrelevant information and attends to relevant information:These properties directly probe for spurious correlations and over-reliance on superficial cues.

Placebo resistance: Adding distractor passages should not change answers
Sufficiency tests: Removing irrelevant context should preserve correct outputs
Necessity tests: Removing critical context should trigger abstention

Compositional coverage. No single property class suffices, but together they provide defense in depth:

Structural constraints prevent syntactic failures (necessary but insufficient)
Invariance properties detect semantic instability (stronger but still incomplete)
Robustness properties expose learned shortcuts (closest to testing actual understanding)

Formal guarantees vs. empirical bounds. These properties do not prove correctness in the classical sense. Instead, they transform the testing problem from “is this output correct?” to “does this system satisfy expected properties?” This shift enables testing over infinite input spaces: while we cannot verify all input-output pairs, we can verify that the system exhibits expected regularities.

Operationalization. Implement via layered validation: first apply structural contracts (fail fast on malformed outputs), then check invariance properties (flag instability), finally verify robustness properties (detect spurious reasoning). Track violation rates per property class to identify which layer of correctness is degrading.

Evidence under uncertainty: probabilistic claims and scenario coverage

With stochastic outputs, acceptance becomes probabilistic. A typical claim has the form: for a given scenario, the failure rate is at most $p$ with confidence $(1-\alpha)$. The required sample size follows from standard bounds. For instance, with zero observed failures, the so-called rule of three gives an approximate 95 percent upper bound of $3/N$. Targeting 1 percent at 95 percent confidence suggests on the order of 300 independent trials. Tighter targets require more samples. When failures are observed, interval estimates such as Wilson or Clopper–Pearson communicate uncertainty more honestly than point estimates.

Independence matters. Trials should vary prompts and seeds, avoid near duplicates, and span time and users. Multiple comparisons matter as well. If many scenarios are tested, the chance of at least one false comfort claim increases unless the analysis accounts for it.

Coverage benefits from a scenario grid. One first defines the grid, then populates each cell with diverse paraphrases and adversarial variants, and finally tracks how production traffic maps onto the grid. A living museum of production incidents is a practical complement, since it binds evaluation to reality. Red team inputs fit here: jailbreak attempts, prompt injections, confusable characters, tool-abuse patterns, and long-context stressors.

Quantifying and pricing risk

Risk is the product of frequency and cost. The analytical move is to separate routine loss from tail risk and to make both explicit.

Taxonomy and costs. A failure mode can be described by severity, detectability, and expected remediation. Examples include unsupported factual claims that slip past verifiers, harmful or non-compliant content, correct-looking but wrong summaries, tool misuse by an agent, and excessive abstention that degrades experience. Catastrophic modes invite strong constraints such as human approval, reduced privileges, deterministic decision rules, or prohibition.
Targets from tolerance. If a failure mode has average cost $C$ and acceptable expected loss over a period is $L$, then a target failure probability for that period is $L/C$. The evaluation plan then seeks to show, per scenario, that the upper bound on the failure rate is below that target at a chosen confidence level.
Tail constraints. Expected loss is appropriate for routine error. Rare, high-severity events are better handled by hard constraints and fences: scoped credentials, limited budgets, and multi-person approval for irreversible actions.
Abstention as a priced action. Abstaining is neither free nor a last resort. If $C_{\text{error}}$ is the cost of a wrong answer and $C_{\text{abstain}}$ is the cost of declining to answer, then a confidence threshold can be chosen to minimize $p_{\text{error}}(\tau) \cdot C_{\text{error}} + p_{\text{abstain}}(\tau) \cdot C_{\text{abstain}}$. The optimal threshold depends on the scenario mix and should be revisited as that mix changes.
Confidence and calibration. Raw model “confidence” is not reliably calibrated. Practical proxies include agreement across multiple samples, agreement with an independent verifier, the presence and quality of grounded citations, entropy where available, and similarity between retrieved evidence and the answer. A simple calibrator, such as logistic or isotonic regression, can map these signals to empirical probabilities within each scenario. Post-deployment calibration curves then indicate whether the system’s probability statements match observed frequencies.

For high-risk features, some teams adopt a safety case: a short argument that states the claims, links them to evidence, and lists the controls and playbooks that apply in production. The value is not ceremony but clarity about what is being claimed, about where the evidence lives, and about how the system behaves when assumptions fail.

What it means to fix errors when the model is mostly immutable

Fixes exist, but they differ in locality, reversibility, and the risks they introduce.

Orchestration and decoding. Changes to decoding parameters, output schemas, and constrained grammars are local and reversible. They tend to reduce variance and shrink the output space. They rarely change the model’s knowledge and can increase refusal rates if over-tightened.
Retrieval and data supply. Updating the corpus, rewriting queries, adjusting filters, and changing indexing strategies change what information reaches the model. Many apparent hallucinations are missing-context failures; retrieval changes often help. The trade-off is exposure to new forms of noise or bias, and the need to monitor the freshness and quality of data sources.
Verification and composition. Introducing independent verifiers, schema checkers, citation checkers, or rule-based guards changes the decision rule rather than the generator. This can reduce certain errors without changing the base model. It also introduces the risk of false rejections and the need to maintain the verifier as domains evolve.
Prompt and task decomposition. Prompt design and multi-stage flows can shift behavior meaningfully. Planning-then-solving-then-verifying is a common pattern. These changes can be large in effect size and small in engineering cost, but they can be brittle under distribution shift and often need maintenance.
Specialization without full retraining. Adapters such as LoRA or small task-specific heads can shape style and structure while leaving the base model intact. They are relatively quick to iterate and are easy to revert. Their scope is limited by the adapter’s capacity and the availability of suitable data.
Supervised fine-tuning. Curated examples from a failure taxonomy can move behavior reliably. Contrastive pairs are useful: an observed mistake and a desired correction. Including unanswerable cases can teach abstention. The cost is in data curation and in the possibility of regressions on untouched scenarios.
Preference learning and RL. Methods such as RLHF or direct preference optimization fit a policy to a learned reward. They can improve compliance with policies and adjust trade-offs like helpfulness versus caution. They also introduce failure modes: reward misspecification, reward hacking, degraded calibration, and shifts toward safe-sounding but less informative answers. Their value is highest when a reward can be specified that is stable across scenarios and resistant to shortcuts.
Knowledge editing. Targeted edits to facts can correct localized errors. The benefit is surgical scope. The risk is collateral change to related concepts and the need for focused regression tests around the edited region.

These levers can be compared along useful dimensions: locality of change, time to effect, reversibility, risk of collateral damage, and the breadth of scenarios they influence. An explicit “fixability matrix” that places failure modes on rows and levers on columns makes these trade-offs visible. For example, an unsupported claim in the presence of available evidence often aligns with retrieval or verification changes, while persistent overconfidence may call for data that includes unanswerable cases and for a calibrated decision rule.

Agents and sequential decisions

Agentic systems extend the problem. They produce sequences of actions with side effects. Evaluation then concerns policies over trajectories rather than isolated outputs. Typical controls include state validation before and after tool calls, limits on steps, time, and spend, and approval flows for irreversible actions. The analytical point is that risk compounds across a trajectory, which shifts assurance from answers to plans and from single-step accuracy to properties of the entire sequence.

What can be claimed, responsibly

Without a general theory of LLMs, universal guarantees are unavailable. Defensible claims take the following form: within declared scenarios, failure rates for defined modes do not exceed stated bounds at stated confidence; outputs satisfy stated properties and contracts; runtime controls enforce abstention and containment; and monitoring detects when traffic leaves scope. This is less dramatic than a single accuracy number and more informative for anyone who has to depend on the system.

The picture that emerges is not heroic. It is practical. Classical systems already relied on stable distributions, although that reliance was easy to miss. LLM systems make the reliance explicit and add non-determinism, unbounded inputs, and limited direct fixability. The analytical response is a mixture of constraints that bound behavior, probabilistic evidence tied to scenarios, and a structured view of what it means to fix an error when most of the software is a model one does not directly control.