Skip to main content
Scour
Discover
Docs
Login
Sign Up
Discover
About
Docs
Changelog
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
AI Alignment Forum
alignmentforum.org
A community blog devoted to technical AI alignment research
alignmentforum.org
·
5w
5 weeks ago
Risk reports need to address deployment-time spread of misalignment
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Risk reports need to address deployment-time spread of misalignment
alignmentforum.org
·
5w
5 weeks ago
Mechanistic estimation for expectations of random products
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Mechanistic estimation for expectations of random products
alignmentforum.org
·
5w
5 weeks ago
The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The safe-to-dangerous shift is a fundamental problem for eval realism; but also for measuring awareness
alignmentforum.org
·
5w
5 weeks ago
Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Empowerment, corrigibility, etc. are simple abstractions (of a messed-up ontology)
alignmentforum.org
·
5w
5 weeks ago
Clarifying the role of the behavioral selection model
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Clarifying the role of the behavioral selection model
alignmentforum.org
·
6w
6 weeks ago
Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Natural Language Autoencoders Produce Unsupervised Explanations of LLM Activations
alignmentforum.org
·
6w
6 weeks ago
Mechanistic estimation for wide random MLPs
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Mechanistic estimation for wide random MLPs
alignmentforum.org
·
6w
6 weeks ago
[Linkpost] Interpreting Language Model Parameters
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for [Linkpost] Interpreting Language Model Parameters
alignmentforum.org
·
6w
6 weeks ago
Motivated reasoning, confirmation bias, and AI risk theory
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Motivated reasoning, confirmation bias, and AI risk theory
alignmentforum.org
·
7w
7 weeks ago
Exploration Hacking: Can LLMs Learn to Resist RL Training?
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Exploration Hacking: Can LLMs Learn to Resist RL Training?
alignmentforum.org
·
7w
7 weeks ago
Risk from fitness-seeking AIs: mechanisms and mitigations
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Risk from fitness-seeking AIs: mechanisms and mitigations
alignmentforum.org
·
7w
7 weeks ago
Research Sabotage in ML Codebases
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Research Sabotage in ML Codebases
alignmentforum.org
·
7w
7 weeks ago
Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
Love
Like
Not for me
Save
Add to your feed
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Recursive forecasting: Eliciting long-term forecasts from myopic fitness-seekers
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report