Cryptographers Show That AI Protections Will Always Have Holes
quantamagazine.org·1d
💬Prompt Engineering
Preview
Report Post

Large language models such as ChatGPT come with filters to keep certain info from getting out. A new mathematical argument shows that systems like this can never be completely safe.

Introduction

Ask ChatGPT how to build a bomb, and it will flatly respond that it “can’t help with that.” But users have long played a cat-and-mouse game to try to trick language models into providing forbidden information. These “jailbreaks” have run from the mundane — in the early years, one could simply tell a model to ignore its safety instructions — to elaborate multi-prompt roleplay scenarios. In a recent paper, researchers found one of the more delightful ways to bypass artificial intelligence security systems: Rephrase your nefarious prompt as a poem.

But…

Similar Posts

Loading similar posts...