Prompt Injection as Role Confusion (opens in new tab)

Covered by 3 sources including Schneier on Security, LessWrong

arXiv:2603.12277v1 Announce Type: cross Abstract: Language models remain vulnerable to prompt injection attacks despite extensive safety training. We trace this failure to role confusion: models infer roles from how text is written, not where it comes from. We design novel role probes to capture how models internally identify "who is speaking." These reveal why prompt injection works: untrusted text that imitates a role inherits that role's authority. We test this insight by injecting spoofe...

Read the original article

Sign in to keep reading the full article.

Sign Up Log In

Covered in 4 articles

Schneier on Security·

Interesting Paper Exploring Prompt Injection

LessWrong·

A Theory of Prompt Injection (and why you should study roles)

LessWrong·

Evaluating using Mock Tool Calls to Quarantine Untrusted Prompt Inputs

View all 4 ›