ARGUS: Defending Against Multimodal Indirect Prompt Injection via Steering Instruction-Following Behavior
arxiv.org·6h
👑Coq Tactics
Preview
Report Post

View PDF HTML (experimental)

Abstract:Multimodal Large Language Models (MLLMs) are increasingly vulnerable to multimodal Indirect Prompt Injection (IPI) attacks, which embed malicious instructions in images, videos, or audio to hijack model behavior. Existing defenses, designed primarily for text-only LLMs, are unsuitable for countering these multimodal threats, as they are easily bypassed, modality-dependent, or generalize poorly. Inspired by activation steering researches, we hypothesize that a robust, general defense independent of modality can be achieved by steering the model’s behavior in the representation space. Through extensive experiments, we discover that the instruction-following behavior of…

Similar Posts

Loading similar posts...