Skip to main content
Scour
Browse
Getting Started
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
alignmentforum.org
10
posts in the last 30 days
Empowerment,
corrigibility
, etc. are simple
abstractions
(of a messed-up ontology)
alignmentforum.org
·
2d
Clarifying
the role of the
behavioral
selection model
alignmentforum.org
·
3d
Natural Language
Autoencoders
Produce Unsupervised Explanations of LLM
Activations
alignmentforum.org
·
6d
Mechanistic
estimation for wide random
MLPs
alignmentforum.org
·
6d
[
Linkpost
]
Interpreting
Language Model Parameters
alignmentforum.org
·
1w
Motivated
reasoning,
confirmation
bias, and AI risk theory
alignmentforum.org
·
1w
Exploration Hacking: Can LLMs Learn to
Resist
RL
Training?
alignmentforum.org
·
1w
Risk from fitness-seeking
AIs
: mechanisms and
mitigations
alignmentforum.org
·
1w
Research
Sabotage
in ML
Codebases
alignmentforum.org
·
2w
Recursive forecasting:
Eliciting
long-term forecasts from
myopic
fitness-seekers
alignmentforum.org
·
2w
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous item
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Browse
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help