AI Alignment Forum · Scour

alignmentforum.org·

Risk reports need to address deployment-time spread of misalignment

alignmentforum.org·

Mechanistic estimation for expectations of random products

alignmentforum.org·

The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness

alignmentforum.org·

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

alignmentforum.org·

Clarifying the role of the behavioral selection model

alignmentforum.org·

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

alignmentforum.org·

Mechanistic estimation for wide random MLPs

alignmentforum.org·

[Linkpost] Interpreting Language Model Parameters

alignmentforum.org·

Motivated reasoning, confirmation bias, and AI risk theory

alignmentforum.org·

Exploration Hacking: Can LLMs Learn to Resist RL Training?

alignmentforum.org·

Risk from fitness-seeking AIs: mechanisms and mitigations

alignmentforum.org·

Research Sabotage in ML Codebases

alignmentforum.org·

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

Log in to enable infinite scrolling