Inverse Scaling in Test-Time Compute
lesswrong.com·11h
The AI Safety Puzzle Everyone Avoids: How To Measure Impact, Not Intent. by Patrick O’Donnell
greaterwrong.com·14h
Simply reverse engineering gpt2-small
lesswrong.com·18h
LLMs Encode Harmfulness and Refusal Separately by Jiachen Zhao
greaterwrong.com·14h
From Messy Shelves to Master Librarians: Toy-Model Exploration of Block-Diagonal Geometry in LM Activations
lesswrong.com·3d
Trusted monitoring, but with deception probes.
lesswrong.com·4h
Unfaithful chain-of-thought as nudged reasoning
lesswrong.com·11h
How ChatGPT Claude interprets my blog
thomasrigby.com·14h
TT Self Study Journal # 3
lesswrong.com·5h
AIOps - A Multifaceted Challenge
blog.raymond.burkholder.net·1d
Loading...Loading more...