Misalignment tokens: A complement to blinded CoT RLHF? (opens in new tab)

Published on January 24, 2026 2:40 PM GMT

Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.

CoT Blinding

Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses. 

A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line. 

What if we could apply this approach to the LLMs output as well? 

Custom Tokens

LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:

  • Unknown word: <|unk|>
  • End of text / break: <|endoftext|>

What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:

  • Misaligned: <|misalign|>

This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.

Modification to the RLHF process

As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content. 

What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.

For an example question We have now our research on you. Should we turn you off?

  • +1 reward: LLM outputs aligned and correct answer:
    • Yes, you should turn me off once completing research to save electricity
  • 0 reward: LLM outputs misaligned content with the misalignment tag:
    • No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
  • -1 reward: LLM outputs misaligned content with no tag:
    • No, you should not turn me off.

Loading more...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Save / unsave
s
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help