Thoughts on Role Confusion (opens in new tab)
The other day, I came across "" (). It's a really interesting blog-style version of a paper by Charles Ye, Jasmine Cui and Dylan Hadfield-Menell, where they find that LLMs seem to almost ignore 'role' tags like , or , and instead use the tone of text to infer roles. This seems to explain a lot of jailbreaks. The paper When LLMs are reasoning about their context to work out what tokens they need to generate next, they need to separate out different things: what the system...
Read the original article