Skip to main content
Scour
Discover
Docs
Login
Sign Up
You are offline. Trying to reconnect...
Copied to clipboard
Unable to share or copy to clipboard
Alignment Research
馃幆 Alignment Research
AI alignment, RLHF, value alignment, reward modeling
Filter Results
Timeframe
Choose a timeframe
Fresh
Past Hour
Today
This Week
This Month
Feeds to Scour
Subscribed
All
Scoured
31
posts in
37.5
ms
馃
AI
arXiv
路
1d
1 day ago
Radical
AI
Interpretability
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Radical AI Interpretability
馃
LLMs
fareedkhan-dev.github.io
路
6d
6 days ago
Train LLM from Scratch
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Train LLM from Scratch
馃
LLMs
kellyasay.substack.com
路
2d
2 days ago
Why Current
AI
Guardrails Train
Models
to Fake
Alignment
Discussed on
Substack
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Why Current AI Guardrails Train Models to Fake Alignment
馃挰
NLP
jagilley.github.io
路
14h
14 hours ago
Forward Self
Models
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Forward Self Models
馃
LLMs
fineset.io
路
4d
4 days ago
Show HN: Describe a
research
topic, get a daily-updated ArXiv/S2 dataset
Covered by聽
Hugging Face
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Show HN: Describe a research topic, get a daily-updated ArXiv/S2 dataset
馃
LLMs
arXiv
路
2d
2 days ago
The Unfireable
Safety
Kernel: Execution-Time
AI
Alignment
for
AI
Agents and Other Escapable
AI
Systems
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Unfireable Safety Kernel: Execution-Time AI Alignment for AI Agents and Other Escapable AI Systems
馃
LLMs
Apple Machine Learning Research
路
7h
7 hours ago
Introducing Apple鈥檚 On-Device and Server Foundation
Models
Covered by聽
5聽sources
See all sources covering this story
聽including聽
9to5Mac
,
Apple World Today
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Introducing Apple鈥檚 On-Device and Server Foundation Models
馃
LLMs
zentara.co
路
2d
2 days ago
LLM Refusal Behavior on Open-Weight
Model
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for LLM Refusal Behavior on Open-Weight Model
馃挰
NLP
arXiv
路
3d
3 days ago
Can Language
Model
Agents be Helpful Circuit Explainers in
Mechanistic
Interpretability
?
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
馃
AI
Data Science Weekly Newsletter
路
1d
1 day ago
Issue 657
Covers聽
3聽stories
See all stories this covers
聽including聽
Running local models is good now
Discussed on
Substack
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Issue 657
馃捇
Open Source
GitHub
路
5d
5 days ago
Open source
AI
projects from Banco Santander
Covered by聽
Interesting Engineering++
,
elladodelmal.com
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Open source AI projects from Banco Santander
馃
LLMs
arXiv
路
2d
2 days ago
Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language
Models
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Perfect Detection, Failed Control: The Geometry of Knowing vs. Steering in Language Models
馃
LLMs
Towards AI
路
5d
5 days ago
Teaching to the Test: Why
Reward
Models
Learn the Dataset, Not the
Values
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Teaching to the Test: Why Reward Models Learn the Dataset, Not the Values
馃挰
NLP
Towards Data Science
路
2d
2 days ago
A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for A Three-Phase Factual Recall Circuit in Gemma-2B and Gemma-12B-IT
馃
AI
arXiv
路
4d
4 days ago
PrivacyAlign: Contextual Privacy
Alignment
for LLM Agents
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for PrivacyAlign: Contextual Privacy Alignment for LLM Agents
馃
LLMs
arXiv
路
2d
2 days ago
The Hitchhiker's Guide to Agentic
AI
: From Foundations to Systems
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for The Hitchhiker's Guide to Agentic AI: From Foundations to Systems
鈿栵笍
AI Governance
kunyuan.substack.com
路
3d
3 days ago
If
AI
Helped Me Write This, Is It Still Mine?
Discussed on
Substack
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for If AI Helped Me Write This, Is It Still Mine?
鈿栵笍
AI Ethics
arXiv
路
4d
4 days ago
AI
Alignment
From Social Choice Perspectives
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for AI Alignment From Social Choice Perspectives
馃
AI
arXiv
路
1d
1 day ago
Localizing
RL-Induced
Tool Use to a Single Crosscoder Feature
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Localizing RL-Induced Tool Use to a Single Crosscoder Feature
馃搼
Stack Overflow
mklyons.com
路
5d
5 days ago
Thinking at the Edge
Discussed on
Hacker News
Love
Like
Not for me
Save
See related topics
Feeds
Share
Report
Off Topic
Harmful Content
Low Quality
Spam
Misleading
Duplicate
Wrong Language
Block Domain
Actions for Thinking at the Edge
Page 2 »
Log in to enable infinite scrolling
Keyboard Shortcuts
Navigation
Next / previous post
j
/
k
Open post
o
or
Enter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
g
h
Interests
g
i
Feeds
g
f
Likes
g
l
History
g
y
Changelog
g
c
Settings
g
s
Discover
g
b
Search
/
Pagination
Next page
n
Previous page
p
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc
Press
?
anytime to show this help
Like
Save
Not for me
Report