AdversaBench: Automated LLM Red-Teaming with Multi-Judge Confirmation and Cross-Model Transferability (opens in new tab)
Scaling adversarial evaluation of large language models requires both a method for generating hard inputs and a reliable way to confirm that resulting failures are real. We present AdversaBench, an end-to-end red-teaming pipeline that mutates seed prompts with five structured operators, queries a target model, and confirms failures through a three-judge panel with a meta-judge tiebreaker. We report experiments on 45 seeds across three categories...
Read the original article