Published on December 27, 2025 12:39 PM GMT

Author: James Hoffend
Date: December 27, 2025
Model tested: Llama-3.1-70B-Instruct
Code & data: Available upon request


Summary

I developed the Genuine Engagement Index (GEI), a mechanistic interpretability method that measures whether a model internally distinguishes harmful from benign intent across all layers—even when both prompts produce the same surface behavior (refusal).

Using GEI on Llama-3.1-70B-Instruct with 300 prompts across 5 harm categories, I found something unexpected about jailbreaks.

The Key Finding:

Standard harmful prompts (weapons, hacking, fraud) show clean, monoto…

Similar Posts

Loading similar posts...

Keyboard Shortcuts

Navigation
Next / previous item
j/k
Open post
oorEnter
Preview post
v
Post Actions
Love post
a
Like post
l
Dislike post
d
Undo reaction
u
Recommendations
Add interest / feed
Enter
Not interested
x
Go to
Home
gh
Interests
gi
Feeds
gf
Likes
gl
History
gy
Changelog
gc
Settings
gs
Browse
gb
Search
/
General
Show this help
?
Submit feedback
!
Close modal / unfocus
Esc

Press ? anytime to show this help