Open-weight training practices and implications for CoT monitorability
lesswrong.com·6h
Flag this post

Published on November 4, 2025 10:49 AM GMT

Introduction

Current reasoning models have surprisingly monitorable chains-of-thought: they struggle to control their CoT without direct optimization pressure applied during training (especially when CoT reasoning is necessary for task completion) and they find difficulty reasoning in all but the simplest ciphers.

This seems promising, but oh so fragile.

There are a handful of reasons for this fragility, which Kormak and friends outline well. Length control may force models to encode information in weird and dense ways that are difficult for hu…

Similar Posts

Loading similar posts...