Jailbreaks Peak Early, Then Drop: Layer Trajectories in Llama-3.1-70B
lesswrong.com·9h
LMAX Disruptor
Preview
Report Post

Published on December 27, 2025 12:39 PM GMT

Author: James Hoffend
Date: December 27, 2025
Model tested: Llama-3.1-70B-Instruct
Code & data: Available upon request


Summary

I developed the Genuine Engagement Index (GEI), a mechanistic interpretability method that measures whether a model internally distinguishes harmful from benign intent across all layers—even when both prompts produce the same surface behavior (refusal).

Using GEI on Llama-3.1-70B-Instruct with 300 prompts across 5 harm categories, I found something unexpected about jailbreaks.

The Key Finding:

Standard harmful prompts (weapons, hacking, fraud) show clean, monoto…

Similar Posts

Loading similar posts...