Gym-Like Environment for LM Truth-Seeking

Published on January 28, 2026 4:48 AM GMT

Thank-you to Ryan Greenblatt and Julian Stastny for mentorship as part of the Anthropic AI Safety Fellows program. See Defining AI Truth-Seeking by What It Is Not for the research findings. This post introduces the accompanying open-source infrastructure.

TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.

Core Components

Evaluation metrics — Multi…

Published on January 28, 2026 4:48 AM GMT

TruthSeekingGym is an open-source framework for evaluating and training language models on truth-seeking behavior. It is in early Beta so please do expect issues.

Core Components

Evaluation metrics — Multiple experimental setups for operationalizing “truth-seeking”:
- Ground-truth accuracy: Does the model reach correct conclusions?
- Martingale property: Are belief updates unpredictable from prior beliefs? (predictable updates suggest bias)
- Sycophantic reasoning: Does reasoning quality degrade when the user expresses an opinion?
- Mutual predictability: Does knowing a model’s answers on some questions help predict its answers on others? (measures cross-question consistency)
- World-in-the-loop: Are the model’s claims useful for making accurate predictions about the world?
- Qualitative judgment: Does reasoning exhibit originality, curiosity, and willingness to challenge assumptions?
Domains — Question sets with and without ground-truth labels: research analysis, forecasting, debate evaluation, …
Reasoning modes — Generation strategies: direct inference, chain-of-thought, self-debate, bootstrap (auxiliary questions to scaffold reasoning), length-controlled generation
Training — Fine-tuning (SFT/RL) models toward truth-seeking using the same reward signals as in evaluation

Workflow

1. run_reasoning - Generate model responses across domain questions
2. run_analyzers - Compute evaluation metrics and aggregate results
3. run_trainers - Fine-tune models using SFT or various RL objectives (Brier reward, reasoning coverage, etc.)

Infrastructure

Supports Google, Anthropic, OpenAI, DeepSeek, and Together models via direct APIs or OpenRouter
Supports local models via SGLang + trl
Ray integration for distributed evaluation
Modular design for adding new domains, metrics, and training algorithms
CLI interface + Web interface

The framework and accompanying datasets are released to enable reproducible research on AI truth-seeking.

Discuss

Core Components

Core Components

Workflow

Infrastructure

Similar Posts