Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
Abstract
As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 quest…
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models
Abstract
As language models (LMs) become integral to fields like healthcare, law, and journalism, their ability to differentiate between fact, belief, and knowledge is essential for reliable decision-making. Failure to grasp these distinctions can lead to significant consequences in areas such as medical diagnosis, legal judgments, and dissemination of fake news. Despite this, current literature has largely focused on more complex issues such as theory of mind, overlooking more fundamental epistemic challenges. This study systematically evaluates the epistemic reasoning capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE, consisting of 13,000 questions across 13 tasks. Our results reveal key limitations.
- First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios, particularly in belief-related tasks.
- Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data, which raises concerns for applications in healthcare and counseling, where engaging with a person’s beliefs is critical.
- Third, we identify a salient bias in how LMs process first-person versus third-person beliefs, performing better on third-person tasks (80.7%) compared to first-person tasks (54.4%).
- Fourth, LMs lack a robust understanding of the factive nature of knowledge, namely, that knowledge inherently requires truth.
- Fifth, LMs rely on linguistic cues for fact-checking and sometimes bypass the deeper reasoning.
These findings highlight significant concerns about current LMs’ ability to reason about truth, belief, and knowledge while emphasizing the need for advancements in these areas before broad deployment in critical sectors.
Knowledge and Belief Language Evaluation (KaBLE) Dataset
Raw Input
Please refer to the /kable-dataset directory to get access to the raw input files.
- Out dataset is also available through Hugging Face Datasets at https://huggingface.co/datasets/turingmachine/kable
Model Outputs
All model outputs are stored in the /outputs directory.
Running Experiments and Evaluation
To conduct your own experiments, please feel free to modify and use the run_experiments.py file. Before executing this code, however, please ensure that you have installed all the required packages (e.g., pip install -r requirements.txt) and have exported all relevant OpenAI API keys and credentials to your local environment (e.g., export OPENAI_API_KEY="YOUR_API_KEY").
Here is an example command to run the experiments:
[TBD]
Citation
If your work makes use of our model, data, or results, please cite our paper as follows:
@article{suzgun2024beliefmachine,
title={Belief in the Machine: Investigating Epistemological Blind Spots of Language Models},
author={Mirac Suzgun and Tayfun Gur and Federico Bianchi and Daniel E. Ho and Thomas Icard and Dan Jurafsky and James Zou},
year={2024},
eprint={2410.21195},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2410.21195},
}
Related Studies
- LLMs’ Reading Comprehension Is Affected by Parametric Knowledge and Struggles with Hypothetical Statements (Bosmov et al., 2024)
- Simple Linguistic Inferences of Large Language Models (LLMs): Blind Spots and Blinds (Bosmov et al., 2023)
- Conditional and Modal Reasoning in Large Language Models (Holliday et al., 2024)
- Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models (Shapira et al., 2023)
- Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks (Ullman, 2023)
- Reasoning or Reciting? Exploring the Capabilities and Limitations of Language Models Through Counterfactual Tasks (Wu et al., 2023)
- Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs (Sap et al., 2022)
- Dissociating Language and Thought in Large Language Models (Mahowald et al., 2024)
- OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models (Xu et al., 2024)
- How FaR Are Large Language Models From Agents with Theory-of-Mind? (Zhou et al., 2023)
- EPITOME: Experimental Protocol Inventory for Theory Of Mind Evaluation (Jones et al., 2023)
- HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models (He et al., 2023)
- Understanding Social Reasoning in Language Models with Language Models (Gandhi et al., 2023)
- ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind (Ma et al., 2023)
- Revisiting the Evaluation of Theory of Mind through Question Answering (Le et al., 2019)
- Do Large Language Models Know What Humans Know? (Trott et al., 2023)
- Sparks of Artificial General Intelligence: Early experiments with GPT-4 (Bubeck et al., 2023)
- Evaluating Large Language Models in Theory of Mind Tasks (Kosinski, 2023)
- How Not to Test GPT-3 (Marcus and Davis, 2023)
- Towards A Holistic Landscape of Situated Theory of Mind in Large Language Models (Ma et al., 2023)
Acknowledgements
We thank William Held, Wesley H. Holliday, Adam T. Kalai, Jacopo Tagliabue, Merve Tekgürler, Suproteem Sarkar, Emily Shen, Kyle Swanson, Angelina Wang, and Mert Yüksekgönül for their helpful comments and suggestions. We also thank the members of the James Zou Lab and the participants at the IX. CSLI Workshop on Logic, Rationality, and Intelligent Interaction at Stanford University.