Published on January 24, 2026 2:40 PM GMT

Context: I have recently been reading Build an LLM from Scratch by Sebastian Raschka, and the section on tokenization has given me some ideas. I will write about them below. I am not a researcher. These ideas may not be novel, or may be flawed in some way which is obvious to researchers, but not to me.

CoT Blinding

Currently, RLHF alignment is performed by rewarding the LLM for providing safe responses, and punishing it for providing misaligned responses. 

A common approach by frontier AI labs is to blind the reward function to the chain-of-thought. This is similar to the approach proposed by <a href=“https://www.lesswrong.com/…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help