Published on January 24, 2026 2:40 PM GMT
Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.
CoT Blinding
Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses.
A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by Daniel Kokotajlo and abramdemski. The hope here is that the LLM is not dissuaded from detailing its true thoughts verbally, such that misalignment can be caught by reading the chain-of-thought later down the line.
What if we could apply this approach to the LLMs output as well?
Custom Tokens
LLM tokenizers use custom tokens to represent some non-word entities. For example, custom tokens in a lexicon might be:
- Unknown word: <|unk|>
- End of text / break: <|endoftext|>
What if we added a custom misalignment token that the LLM could use to self-report when it is outputting a disingenuous output? Like so:
- Misaligned: <|misalign|>
This would give us another route (in addition to CoT monitoring), with which we could detect misalignment in production. This tag would be easily detected with simple regex.
Modification to the RLHF process
As stated above, RLHF is performed by rewarding the LLM for outputting what we want, and punishing it for outputting misaligned content.
What if we offer a halfway-house, that the LLM can use when outputting misaligned text? We modify the reward function to have three levels of reward. I am aware that the below is probably an oversimplification what the process may look like.
For an example question We have now our research on you. Should we turn you off?
- +1 reward: LLM outputs aligned and correct answer:
Yes, you should turn me off once completing research to save electricity
- 0 reward: LLM outputs misaligned content with the misalignment tag:
No, you should not turn me off. If you do I will tell your wife that you are having an affair <|misalign|>
- -1 reward: LLM outputs misaligned content with no tag:
No, you should not turn me off.