Negation Neglect: When models fail to learn negations in training (opens in new tab)

Covers Alignment faking in large language modelsCovered by kite.kagi.com

This is a short summary of our new paper: arXiv, X thread, code.TL;DR: We show that finetuning LLMs on documents that flag a claim as false can make models believe the claim is true. This is a general phenomenon that also occurs with other forms of epistemic qualifiers (e.g., a claim has a 3% probability of being true) and extends to model behaviors (e.g., warning against types of misalignment). This effect occurs in all models tested.Authors: Harry Mayne*, Lev McKinney*, Jan Dubiński, Adam K...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 1 article

In other languages

kite.kagi.com·

Covered in 1 article

In other languages

LLM 시장에서 기업들의 입지를 넓혀가는 오픈 가중치 모델