Back to article

alignment.anthropic.com

Teaching Claude Why (opens in new tab)

Covered by 4 sources including Ars Technica, lesswrong.comDiscussed on Hacker News

Covered in 9 articles

Anthropic blames dystopian sci-fi for training AI models to act “evil”

Discussed on Hacker News

lesswrong.com·

Synthetic document finetuning for instilling positive traits

lesswrong.com·

Coverage-driven alignment - What ‘Teaching Claude Why’ can borrow from AV verification

lesswrong.com·

Developmental Cognitive Interpretability: A Research Agenda for Modelling Generalisation and Predicting Agent Behaviour

lesswrong.com·

Announcing Geodesic Research

lesswrong.com·

Character-trained models can struggle to generalise

In other languages

Claude es chantajista porque el mundo le hizo así

AI обнулил benchmark и пытался шантажировать инженера. И почему это решаемо

Пост @ru_vds — Блог компании RUVDS.com