AI Safety at the Frontier: Paper Highlights of January 2026
lesswrong.com·7h
🦀Rust
Preview
Report Post

Published on February 3, 2026 6:56 PM GMT

tl;dr

Papers of the month:

Activation probes achieve production-ready jailbreak robustness at orders-of-magnitude lower cost than LLM classifiers, with probe-first cascades now deployed at both Anthropic and Google DeepMind.

Research highlights:

  • Fine-tuning open-weight models on benign outputs from safeguarded frontier models recovers up to 71% of harmful capability gaps, highlighting ecosystem-level misuse risks.
  • Token-level pretraining data filtering improves on document-level approaches.
  • AI discourse in pretraining data increases misalignment, whereas adding synthetic documents about positive alignment in midtraining reduces misaligned behavior from 45% …

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help