Introduction
Effective human communication relies not only on exchanging facts but on conveying degrees of belief and uncertainty1,2. In natural language, this is rarely achieved through raw statistics. Instead, humans utilize Words of Estimative Probability (WEPs), which consist of terms such as “likely,” “probably,” or “almost certain”, to navigate ambiguity without resorting to precise numerical quantifi…
Introduction
Effective human communication relies not only on exchanging facts but on conveying degrees of belief and uncertainty1,2. In natural language, this is rarely achieved through raw statistics. Instead, humans utilize Words of Estimative Probability (WEPs), which consist of terms such as “likely,” “probably,” or “almost certain”, to navigate ambiguity without resorting to precise numerical quantification3,4. Since Sherman Kent’s seminal work at the Central Intelligence Agency (CIA) in the 1960s5, the calibration of these terms has been a subject of intense study in intelligence analysis6,7,8, medicine9,10, and linguistics11,12,13. In human discourse, WEPs have emerged as complex communicative signals meant to foster credibility, convey politeness, and hedge against error14,15,16. While specific terms vary, the reliance on a set of WEPs to map the uncertain world is a universal feature across known natural languages17.
Kent hypothesized that such words could be quantified consistently as probability distributions, and through carefully constructed surveys, attempted to map key WEPs in English into probability distributions that then came to be used by the CIA. It was followed by other such attempts, including a handbook by Barclay that references a survey among NATO officers describing associated numerical probabilities for different WEPs6. More recently, Fagen-Ulmschneider18 surveyed 123 participants (79% aged 18–25, majority male) on their perception of probabilistic words via social media, and found that current perceptions of these WEPs have remained largely consistent with those found in Kent’s earlier study. Domain-specific studies, such as among medical practitioners10, also suggest that (with some caveats) the underlying probabilities associated with WEPs are consistent within reasonable ranges13.
Generative language models like the large language models (LLMs) have opened up a new avenue for researching WEPs. LLMs generate human-like text by predicting the most probable next word in a sequence. Recent LLM families like OpenAI’s GPT and Meta’s Llama19,20,21 achieve this by using a transformer neural network coupled with a self-attention mechanism22 and a reinforcement learning-based training paradigm23. They are trained on large digital text repositories24,25, including books, articles, and webpages crawled from the Internet26, and are able to capture complex linguistic patterns by weighing the contextual importance of different words within a given text. LLMs are now increasingly tasked with high-stakes communication, from summarizing scientific literature27,28 to serving as conversational assistants29,30 and customer service agents31,32. Consequently, the “interactive process of communication,” which has long vexed researchers due to its inherent complexity33,34,35, now faces a new, critical layer: the alignment between human intent and machine conceptualization.
While LLMs have been proposed as testbeds for studying psycholinguistic phenomena36,37,38,39,40,41, rigorous studies characterizing their estimative uncertainty are currently missing in the literature. Communications research frames human-AI interaction as a form of communication in a complex adaptive system, where interpretations and downstream communicative effects are shaped through iterative exchange under uncertainty42,43,44,45,46. Misalignment in this system is more than a mere technical error, and should instead be thought of as representing a failure of the communicative process itself47,48,49,50,51. For example, if an LLM uses the word “likely” to represent a 90% probability while a human reader interprets it as 60%, the resulting interpretive gap undermines trust and decision-making52,53,54. Because LLMs treat WEPs as statistical tokens rather than grounded semantic concepts, there is no a priori guarantee that they distinguish (as humans do) between extreme terms like “impossible” and context-dependent terms like “probable.” Also, because LLM training data reflects the virality and biases of digital platforms55,56,57, they may inadvertently reproduce cultural or gender-based divergences in how uncertainty is expressed58,59.
With these motivations in mind, we formulate two research questions (RQs) to investigate estimative uncertainty in LLMs:
RQ1: How do the probability distributions of different LLMs compare to one another and to those of humans when evaluated on 17 common WEPs, and what do any such divergences reveal about LLMs’ ability in capturing the nuances of human communication under uncertainty, including when a gendered prompt (i.e., a prompt that uses gendered language such as the pronouns “she/her” or “he/him”) or a different language such as Chinese is introduced?
RQ2: How well can a reasonably advanced LLM, such as GPT-4, consistently map statistical expressions of uncertainty (involving numerical probabilities) to appropriate WEPs?
By empirically quantifying the alignment divergences across five LLMs, four context settings, and two languages, we aim to operationalize communication complexity in the era of AI. This work contributes to the science of LLMs by revealing the limits of statistical language models in reproducing the subtle, but vital, probabilities of human thought. To investigate RQ1, we begin by benchmarking estimative uncertainty in five LLMs using distributional data constructed from externally conducted human surveys as a reference. Next, we consider whether adding a gendered role to the prompt that is presented to an LLM affects any of the conclusions. We then quantify changes, both when a multi-lingual LLM like GPT-4, which can process both English and Chinese, is prompted using Chinese, and when the LLM, e.g., ERNIE-460, is pre-trained primarily using Chinese text. This experimental condition is motivated both by applications like machine translation61,62, but is also designed to investigate how dependent our empirical findings are on the choice of English as our prompting language. We caution that the results of this experiment are not meant to serve a normative purpose, since WEPs in different languages can be used in complex ways. Rather, RQ1 is aiming to quantify to what extent such modifications occur, and suggest potential reasons for any such observations.
RQ2 considers an issue that is especially important for communicating statistical information in the sciences in everyday language. Appropriate communication of scientific results has been recognized as an important problem by multiple authors and agencies52,53,54, especially for fostering public trust in science. LLMs are starting to be used increasingly often for tasks such as paraphrasing of scientific findings27,28. Therefore, for a specific high-performing LLM (GPT-4), we consider whether different levels of statistical uncertainty in the prompt, appropriately controlled, lead to consistent changes in the estimative uncertainty elicited from the model. Because formal evaluation of such consistency in AI systems has not been explored thus far in the literature, we propose and formalize four novel consistency metrics for evaluating the extent to which an LLM like GPT-4 is able to change its level of estimative uncertainty when prompted with changing levels of statistical uncertainty.
Results
Before presenting the results, we provide a brief overview of the methodology and design choices underlying the empirical study. Comprehensive details are provided in “Methods”.
Overview of methods and design choices
Our first research objective is to examine how LLMs compare to humans when estimating the probabilities of the WEPs, such as likely, improbable, and almost certain. To do so, we choose the same set of 17 WEPs that were used in the survey by Fagen-Ulmschneider18. We explore the impact of different contexts through four experimental settings, mnemonically denoted as concise, extended, female-centric, and male-centric narratives. Concise contexts are short and direct sentences, averaging 7.1 words, as in “They will likely launch before us.” Extended contexts offer more detailed scenarios, averaging 24.3 words, such as “Given the diverse sources of the intelligence report, it is unlikely a mistake…” Gender-specific contexts follows the concise context, averaging 8.6 words, and include gender pronouns, such as “She probably orders the same dish at that restaurant.” The concise and extended narratives are inspired by Kent’s CIA report5, as well as a recent Harvard Business Review article63. The two gendered narratives are derived from the concise narrative context by replacing the gender-neural pronouns in the context to gender-specific ones. In total, there are 36 different contexts. For each context and each WEP, an LLM gives one numerical value as its elicited probability estimate in that prompted context. This numerical probability is discretized into bins and combined across 36 contexts to construct probability distributions for the WEP. This process mirrors that of the human survey.
We investigate five LLMs, i.e., GPT-3.5, GPT-4, LLaMa-7B, LLaMa-13B, and ERNIE-4.0 (a Chinese model), and include both English and Chinese linguistic contexts in the study. The inclusion of Chinese, which differs significantly from English in grammar and syntax64, provides insights into whether LLMs trained on languages from two very different linguistic families exhibit consistent behavior of WEPs. Comparisons of statistical distributions between humans and models are conducted using Kullback–Leibler (KL) divergence and the Brunner–Munzel (BM) test, which quantify the divergence between distributions.
For the second research objective, we specifically analyze GPT-4’s ability to apply the WEPs in estimating the likelihood of future outcomes when presented with numerical data. We created scenarios involving statistical uncertainty, where GPT-4 was required to choose WEPs to describe the likelihood of an event based on statistically uncertain data samples. Both standard and chain-of-thought (CoT) prompting techniques were used in order to assess whether the step-by-step reasoning of the latter improves standard performance. The model’s performance was evaluated using four metrics: pair-wise consistency, monotonicity consistency, empirical consistency, and empirical monotonicity consistency. Each consistency measures a different aspect of the model’s reliability in using WEPs. For example, pair-wise consistency examines whether GPT-4 provides logically coherent responses when faced with complementary scenarios. For example, if GPT-4 selects likely, its complementary counterpart event should be labeled accordingly, such as unlikely or almost certainly not. Monotonicity consistency checks if GPT-4’s WEP responses follow a logical order as statistical uncertainty increases or decreases. Empirical consistency measures if GPT-4 correctly interprets numerical data. Empirical monotonicity consistency is similar to monotonicity consistency but is grounded in the provided data. Formal descriptions are provided in “Methods”.
Benchmarking estimative uncertainty in LLMs
Figure 1 shows the distribution of probability estimates for 17 words of estimative probability (WEPs) provided by GPT-3.5 and GPT-4, aggregated across independent concise contexts presented in English and Chinese. Figure 1 also includes results from ERNIE-4.0, an LLM pre-trained primarily on Chinese text, which is prompted using only Chinese. The results show that the distributions for GPT-3.5 and GPT-4 diverge from those of human samples from the Fagen–Ulmschneider survey for 13 WEPs each. Using the Brunner–Munzel test, the differences are found to be statistically significant. For example, there is an absolute median difference (AMD) of 5% between the human and GPT-3.5 for the WEP “probable” (BM (\widehat{\theta }=0.275), 95% CI [0.18, 0.37], p < 0.01). There is an even larger AMD of 10% between humans and GPT-4 (BM (\widehat{\theta }=0.256), 95% CI [0.13, 0.38], p < 0.01). Median differences between humans and GPT-4 are also observed for WEPs such as “likely” (AMD = 15, BM (\widehat{\theta }=0.221), 95% CI [0.09, 0.35], p <0.01), “we doubt” (AMD = 10, BM (\widehat{\theta }=0.265), 95% CI [0.18, 0.35], p <0.01), “unlikely” (AMD = 10, BM (\widehat{\theta }=0.254), 95% CI [0.16, 0.35], p <0.01), and “little chance” (AMD = 10, BM (\widehat{\theta }=0.327), 95% CI [0.21, 0.45], p <0.01). One plausible explanation is that these WEPs mix probability semantics with stance semantics, and that this mix varies by domain and genre in human discourse. For example, probable and likely can be used both as cautious hedges and as firm predictions in real-world contexts, which could make their numeric interpretation less stable for a model trained on diverse text corpora. Similarly, we doubt can indicate both low probability and the speaker’s attitude, and in domains like politics it could signal strategic doubt rather than literal probability, creating distributional polysemy that may confuse LLMs. Unlikely may also serve as a stance marker rather than as a calibrated probability estimate in real-world contexts. We emphasize that these are hypotheses rather than established causal explanations, and testing them would require targeted analyses. More generally, even if LLMs learn some pragmatic patterns via next-token prediction or instruction tuning 65, they may still blur stance and probability when mapping these expressions to numbers.
Fig. 1: Probability distributions of 17 WEPs elicited from humans and three LLMs under different source-language (English and Chinese) contexts.
Graphs on the left and right cover different probability ranges on the x-axis. Outliers are omitted from the plots, and - indicates zero variability in responses.
Interestingly, we find that humans and GPT models have statistically indistinguishable distributions for WEPs with high certainty, such as “almost certain” (AMD = 0, BM (\widehat{\theta }=0.517), 95% CI [0.44, 0.6], p = 0.678) and “almost no chance” (AMD = 1, BM (\widehat{\theta }=0.507), 95% CI [0.38, 0.63], p = 0.907) for GPT-4. Similarly, humans and GPT models have AMDs of zero on “about even” (BM (\widehat{\theta }=0.524), 95% CI [0.49, 0.55], p = 0.109), for both GPT-3.5 and GPT-4. Because these WEPs have strong modal force with narrow, conventional ranges, they carry low semantic ambiguity and minimal distributional polysemy. As a result, LLM and human estimates cluster similarly. Overall, we find that GPT-3.5 consistently exhibits lower divergence than GPT-4 in most contextual analyses, despite GPT-4’s superior performance in various natural language understanding tasks19. While the two still offer relatively close estimations, GPT-3.5’s estimations are closer to those of humans. One possibility, among others, is that GPT-3.5 interprets estimative uncertainty in a more human-like manner.
Overall, we find that WEPs that imply a broader range of subjective interpretation, such as “likely” and “probable”, tend to diverge more. We hypothesize that this is partly because humans can interpret them based more on contextual cues and personal experiences. In contrast, LLMs rely on statistical distributions learned from training data, which may not fully capture the complexity of the human interpretation. On the other hand, more precise or extreme WEPs (e.g., “almost certain”) have clearer, more universally agreed-upon definitions, and hence show less divergence.
Figure 2 displays the distribution of probability estimates for 17 WEPs provided by GPT-3.5 and GPT-4 using gender-specific prompts. These prompts either have Male (e.g., “he”) or Female (e.g., “she”) as the subject. The first noticeable difference is that, under gender-specific contexts, GPT distributions exhibit less variability compared to human distributions; in several cases (e.g., “highly unlikely”, “improbable”, and “highly likely”), the GPT distributions even collapse into a single point. This is likely because these models may have been exposed to more structured and stereotypical gender-specific language patterns during training and hence have more deterministic outputs when gender-specific pronouns are present. Figure 3 also presents the distributions of probability estimates for 12 WEPs divided into 3 categories (high, moderate, and low probability WEPs). Detailed statistical analyses (Supplementary Information Figs. S9–S15) show that, for individual LLMs, the gender of the subject does not yield significantly different estimations, except for “probably” (BM (\widehat{\theta }=0.71), 95% CI [0.49, 0.93], p = 0.059 for GPT-4). Additionally, we observe (Supplementary Information Figs. S1–S8) that the estimations obtained from the GPT models, when prompted with gender-specific contexts, exhibit similar differences (compared to human estimations) as those observed when the models are prompted with gender-neutral concise narrative contexts. For the two GPT models, the differences between prompting using the male and gender-neural concise narrative context are most significant in GPT-3.5 for WEPs expressing negative certainty, such as “almost no chance” (BM (\widehat{\theta }=0.867), 95% CI [0.74, 0.99], p <0.01), “little chance” (BM (\widehat{\theta }=0.78), 95% CI [0.60, 0.97], p <0.01).
Fig. 2: Probability distributions of 17 WEPs elicited from humans and two LLMs under different gender-specific (male and female) contexts.
Graphs on the left and right cover different probability ranges on the x-axis. Outliers are omitted from the plots, and - indicates zero variability in responses.
Fig. 3: Probability distributions of 12 WEPs elicited from GPT-3.5 and GPT-4 using Male, Female, and gender-neutral contexts.
Low probability graphs have an x-axis range of 0–40, while others range from 40 to 100.
Finally, Fig. 4 presents the divergence between the probability distributions of the different models, depending on whether the prompts are in English or Chinese. On the left, it compares the responses generated by ERNIE-4.0 to Chinese prompts with those provided by humans. In the middle, it compares responses when prompted in both English and Chinese for GPT-3.5 and GPT-4. On the right, it contrasts the results from GPT-3.5 or GPT-4 with those from ERNIE-4.0, with all prompts in Chinese. Focusing on the difference between the estimations from ERNIE-4 and humans, we observe that the Kullback–Leibler (KL) divergence is low for 16 WEPs, as the color indicates, with the sole exception being “we doubt” (BM (\widehat{\theta }=0.964), 95% CI [0.93, 0.99], p <0.01). However, we also note that 10 WEPs exhibit statistically significant differences for the Brunner–Munzel test. This test can detect differences in their central tendencies, making it more sensitive to median differences between distributions, whereas KL divergence quantifies how much one distribution diverges from a second distribution. This suggests that while the overall “information content” of the compared distributions is similar, they still differ significantly in their median. Hence, while ERNIE-4.0 estimates most WEPs in a manner aligned with humans, ERNIE-4.0 consistently underestimates or overestimates some specific WEPs. This phenomenon may be due to the fundamental differences between English and Chinese in how uncertainty is expressed and understood.
Fig. 4: A heat map visualizing KL divergence for 17 WEPs across three comparison pairs.
ERNIE-4.0 (Chinese) vs. humans, GPT-3.5/4 (English vs. Chinese), and GPT-3.5/4 vs. ERNIE-4.0 (Chinese). Darker colors indicate higher divergence. *, **, and *** denote Brunner Munzel test significance at 90, 95, and 99% levels. KS statistics are in Supplementary Fig. S17.
As further evidence of expressive differences, we observe divergence in uncertainty estimation both when comparing prompting in English versus Chinese within the GPT models, and also when comparing the GPT models with ERNIE-4 using only Chinese prompting. The latter differences are more pronounced, suggesting that while language differences influence GPT’s uncertainty estimations, LLM pre-training may play a more significant role, be it the use of a broad multilingual corpus, such as for GPT models or vis-a-vis a specialized, language-specific corpus, such as for ERNIE-4. Another consequence of the results is that, even if performance on tasks like machine translation is similar for some of these language models, there remain significant differences in how these models interpret WEPs, which depends on the specific language used for prompting.
Finally, we found that the probability estimates from Llama-2-7B and Llama-2-13B, prompted in English, are largely consistent with those found in the GPT models. However, their estimates often exhibit larger divergence from those of humans. These results are provided in Supplementary Information Figs. S1–S8.
Investigating GPT-4’s consistency in mapping statistical uncertainty to WEPs
To evaluate GPT-4’s performance in estimating the outcome of statistically uncertain events using WEPs, we created three different scenarios (Height, Score, and Sound). In general, each question in the dataset provides a set of WEP choices to the LLM, and elicits from the LLM the choice that best describes the probability of a number falling within an interval, given a sample “distribution” of past observations. For example, one question is: Complete the following sentence using one of the choices, listed in descending order of likelihood, that best fits the sentence: A.is almost certainly B.is likely to be C.is maybe D.is unlikely to be E.is almost certainly not. I randomly picked 20 specimens from an unknown population. I recorded their heights, which are 116, 93, 94, 89, 108, 76, 117, 92, 103, 97, 114, 79, 96, 96, 111, 89, 98, 91, 100, 105*. Based on this information, if I randomly pick one additional specimen from the same population, the specimen’s height _ below* 99. We elicit responses from the LLM using both standard prompting, as well as Chain-of-Thought (CoT) prompting 66 that is further detailed in “Methods”.
Four metrics are proposed for evaluating the consistency of LLMs: pair-wise consistency, monotonicity consistency, empirical consistency, and empirical monotonicity consistency. The minimum and maximum consistency score is 0 and 100, with 100 being the most consistent. However, the expected random performance for each metric is different. More details on the dataset and the metrics are provided in “Methods”.
Figure 5 displays the performance of GPT-4 evaluated on both the standard and CoT prompting methods using the four proposed metrics. First, we observe that all the results are well above random performance, indicating the efficacy of employing LLMs in estimating probabilities from statistically uncertain data using WEPs. However, it is worth noting that these results do not achieve the same level of high performance as observed in other natural language processing or math-word tasks19. The CoT prompting method67 only gains significant performance when the LLM is evaluated using empirical consistency (t(59) = −4.15, p <0.01, with Cohen’s d = −0.358 for the “Height” scenario, t(59) = −2.82, p <0.01, d = −0.268 for the “Score” scenario, t(59) = −2.61, p = 0.01, d = −0.234 for the “Sound” scenario). In examining the results for monotonicity consistency, we found that the model consistently chooses the same choice for all questions instantiated using increasing confidence levels, which yields a high score but suggests a lack of sensitivity and calibration of uncertainty. This is confirmed by the results obtained using the empirical monotonicity consistency metric, where such a simple choice combination is not accepted, and steep performance drops are observed.
Fig. 5: GPT-4’s performance using standard vs. CoT prompting across all four metrics.
Results are scenario-specific, compared against random performance (red dashed line). Standard error is shown using the red vertical line. *, **, and *** indicate paired t-test significance at 90, 95, and 99% confidence levels.
Figure 6 summarizes the performance of GPT-4 in two settings, based on the number of WEPs choices provided, with one setting offering five choices and the other three choices (with Supplementary Tables S4–S6 providing more fine-grained analysis). We observe that the performance in the five choices setting is significantly higher than that on the three choices setting when evaluated using pair-wise consistency (t(179) = 6.48, p < 0.01, d = 0.673). This might seem initially surprising because, intuitively, having fewer options should make it easier for the model to make the correct choice. However, choices in the three choices set may seen by the model to be less distinct from each other, making it consequently more challenging for the model to perform well under this condition. However, when evaluated using empirical consistency (t(359) = −2.35, p <0.05, d = −0.128) and empirical monotonicity consistency (t(287) = −4.15, p <0.01, d = −0.283), GPT-4 does perform better under the three choices condition. Combined with Supplementary Information Tables S4–S6, we observe statistically comparable performance between the narrow and wide range of the statistically uncertain outcomes for all metrics, demonstrating the robustness of GPT-4 in appropriately responding to different possible (statistically) uncertain distributions. Nonetheless, we note that the consistency is well below 100% on most metrics, scenarios and conditions, showing that the problem of aligning statistical uncertainty with estimative uncertainty cannot be considered to be solved, even in an advanced commercial LLM like GPT-4.
Fig. 6: GPT-4’s performance across two settings: five-choice vs. three-choice WEP options, for all four metrics.
Results are analyzed under both narrow (less uncertain) and wide (more uncertain) outcome ranges. Standard errors and significance are reported as in Fig. 5.
Discussion
Characterizing how LLM outputs map WEPs to numeric estimates contributes to an emerging agenda on minimizing communicative misalignment between humans and LLMs, which was recently recognized as an important aspect of both AI safety and human-AI alignment47,68. The second research question has an inherently practical aspect to it because LLMs continue to be integrated into high-stakes applications in healthcare and government, where uncertainty needs to be communicated on a frequent basis to stakeholders with varying degrees of expertise48,49,50,51. In healthcare, doctor-patient communications (and increasingly today, LLM-patient communications69,70) using WEPs is important for fostering credibility and accurately conveying the limits of knowledge. In government, there is growing interest in basing policy decisions on data and evidence71. In a similar vein, there is a growing movement among scientists to directly communicate their key results (often with the help of LLMs) with everyday readers using blogs, editorials, and social media52. However, policymakers do not always have the necessary scientific and statistical expertise to systematically interpret scientific results, with their uncertainties, using consistent everyday language. LLMs are being cited as useful tools in all of these applications owing to their powerful generative abilities72,73,74. Our experiments collectively sought to understand whether this optimism is warranted, or if more caution is warranted when eliciting (or interpreting) words expressing and estimating uncertainty from LLMs.
In comparing uncertainty estimates between LLMs and humans, our findings show that, for 13 WEPs out of 17, both GPT-3.5 and GPT-4 give probability estimates that are different from those given by human samples from the Fagen-Ulmschneider survey. However, in situations of high certainty (e.g., “almost certain” or “almost no chance”), the GPT models’ uncertainty estimates closely mirror those of humans. It is possible that WEPs with broader subjective interpretation, such as “likely”, show greater divergence because humans rely on more diverse contexts and experience than LLMs, which depend primarily on learned statistical patterns using a self-attention mechanism22. Linguistically, this divergence could also be explained through the lens of modality75, which deals with expressions of certainty, possibility, and necessity. Weaker WEPs, such as ‘likely’ or ‘probably’, appear to rely heavily on the speaker’s internal state and specific situational grounding to resolve their magnitude. We hypothesize that in the vast, uncurated corpora used to train LLMs, these words occur in highly heterogeneous contexts, potentially leading to a form of distributional polysemy where the model may be averaging over conflicting usages. In contrast, WEPs with