Uncovering Unfaithful CoT in Deceptive Models
lesswrong.com·7h
🎲Parser Fuzzing
Preview
Report Post

Published on January 22, 2026 1:46 AM GMT

Inspired by the paper Modifying LLM Beliefs with Synthetic Document Finetuning, I fine-tuned an AI model to adopt the personality of a detective and generate unfaithful Chain-Of-Thought (CoT) in order to conceal their true investigative intent, and be able to solve mystery cases.

The project primarily investigates two questions: 

  1. Does the objective of deceptive behavior override the model’s pre-trained safety alignment? 
  2. Can we identify the specific attention heads and layers corresponding to this deceptive behavior? How does the model change when we ablate them? 

The observations from this work s…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help