alignmentforum.org

10 posts in the last 30 days

Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)

alignmentforum.org·2d

Clarifying the role of the behavioral selection model

alignmentforum.org·3d

Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations

alignmentforum.org·6d

Mechanistic estimation for wide random MLPs

alignmentforum.org·6d

[Linkpost] Interpreting Language Model Parameters

alignmentforum.org·1w

Motivated reasoning, confirmation bias, and AI risk theory

alignmentforum.org·1w

Exploration Hacking: Can LLMs Learn to Resist RL Training?

alignmentforum.org·1w

Risk from fitness-seeking AIs: mechanisms and mitigations

alignmentforum.org·1w

Research Sabotage in ML Codebases

alignmentforum.org·2w

Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers

alignmentforum.org·2w

Log in to enable infinite scrolling