Scalable Hierarchical Attention Transformers for Multi-Turn Jailbreak Detection in Long Conversations (opens in new tab)
Multi-turn jailbreaks can evade turn-level moderation by spreading unsafe intent across a dialogue through gradual escalation, reframing, and role manipulation. We address multi-turn jailbreak detection as a conversation-level classification problem and introduce an efficient hierarchical detector that avoids expensive long-context concatenation while retaining cross-turn reasoning. The model encodes individual turns to form compact turn repre...
Read the original article