Investigating CoT Monitorability in Large Reasoning Models
arxiv.org·10h
Flag this post

Title:Investigating CoT Monitorability in Large Reasoning Models

View PDF HTML (experimental)

Abstract:Large Reasoning Models (LRMs) have demonstrated remarkable performance on complex tasks by engaging in extended reasoning before producing final answers. Beyond improving abilities, these detailed reasoning traces also create a new opportunity for AI safety, CoT Monitorability: monitoring potential model misbehavior, such as the use of shortcuts or sycophancy, through their chain-of-thought (CoT) during decision-making. However, two key fundamental challenges arise when attempting to build more effective monitors through CoT analysis. First, as prior research on CoT faithfulness has pointed out, mod…

Similar Posts

Loading similar posts...