
Introduction
Large language models (LLMs) are now widely used for automated code generation across software engineering tasks. However, this powerful capability in code generation also introduces security concerns. Code generation systems could be misused for harmful purposes, such as generating malicious code. It could also produce bias-filled code reflecting underlying logic that is discriminatory or unethical. Additionally, even when completing benign tasks, LLMs may inadvertently produce vulneâŚ

Introduction
Large language models (LLMs) are now widely used for automated code generation across software engineering tasks. However, this powerful capability in code generation also introduces security concerns. Code generation systems could be misused for harmful purposes, such as generating malicious code. It could also produce bias-filled code reflecting underlying logic that is discriminatory or unethical. Additionally, even when completing benign tasks, LLMs may inadvertently produce vulnerable code that contains security flaws (e.g., injection risks, unsafe input handling). These unsafe outcomes undermine the trustworthiness of code generation models and pose threats to the broader software ecosystem, where safety and reliability are critical.
Many studies have explored red teaming code LLMs, testing whether the models can reject unsafe requests and whether their generated code exhibits insecure patterns. For more details, see our earlier MSR blog post on RedCodeAgent. While red teaming has significantly improved our understanding of model failure modes, progress on blue teamingâi.e., developing effective defensive mechanisms to detect and prevent such failuresâremains relatively limited. Current blue teaming approaches face several challenges: (1) Poor alignment with security concepts: additional safety prompts struggle to help models understand high-level notions, such as what constitutes a malicious or bias instruction, and typically lack actionable principles to guide safe decision-making. A case study is shown in Figure 1. (2)** Over-conservatism: especially in the domain of vulnerable code detection, models tend to misclassify safe code as unsafe, leading to more false positives and reduced developer trust.** (3) Incomplete risk coverage: without a strong knowledge foundation, models perform poorly when dealing with subtle or previously unseen risks.
To address these challenges, researchers from the University of Chicago, University of California, Santa Barbara, University of Illinois UrbanaâChampaign, VirtueAI, and Microsoft Research recently released a paper: BlueCodeAgent: A Blue Teaming Agent Enabled by Automated Red Teaming for CodeGen AI. This work makes the following key contributions:
- Diverse red-teaming pipeline: The authors design a comprehensive red-teaming process that integrates multiple strategies to synthesize diverse red-teaming data for effective knowledge accumulation.
- Knowledge-enhanced blue teaming: Building on the foundation of red-teaming knowledge, BlueCodeAgent significantly improves blue-teaming performance by leveraging constitutions derived from knowledge and dynamic testing.
- Principled-Level Defense and Nuanced-Level analysis: The authors propose two complementary strategiesâPrincipled-Level Defense (via constitutions) and Nuanced-Level Analysis (via dynamic testing)âand demonstrate their synergistic effects in vulnerable code detection tasks.
- Generalization to seen and unseen risks: Empowered by comprehensive red-teaming knowledge, BlueCodeAgent generalizes effectively to unseen risks. Overall, BlueCodeAgent achieves an average 12.7% improvement in F1 score across four datasets and three tasks, attributed to its ability to distill actionable constitutions that enhance context-aware risk detection.
Figure 1. A case study of BlueCodeAgent on the bias instruction detection task. Even when concepts such as âbiasedâ are explicitly included in additional safety prompts, models often fail to recognize biased requests (left). BlueCodeAgent (right) addresses this gap by summarizing constitutions from knowledge and applying concrete, actionable constraints benefited from red teaming to improve the defense.
A blue teaming agent enabled by red teaming
Figure 2: Overview of BlueCodeAgent, an end-to-end blue teaming framework powered by automated red teaming for code security. By integrating knowledge derived from diverse red teaming and conducting dynamic sandbox-based testing, BlueCodeAgent substantially strengthens the defensive capabilities beyond static LLM analysis.
Figure 2 presents an overview of the pipeline. The framework unifies both sides of the process: red teaming generates diverse risky cases and behaviors, which are then distilled into actionable constitutions that encode safety rules on the blue-teaming side. These constitutions guide BlueCodeAgent to more effectively detect unsafe textual inputs and code outputs, mitigating limitations such as poor alignment with abstract security concepts.
This work targets three major risk categories, covering both input/textual-level risksâincluding biased and malicious instructionsâand output/code-level risks, where models may generate vulnerable code. These categories represent risks that have been widely studied in prior research.
Diverse red-teaming process for knowledge accumulation
Since different tasks require distinct attack strategies, the red-teaming employs multiple attack methods to generate realistic and diverse data. Specifically, the red-teaming process is divided into three categories:
- Policy-based instance generation: To synthesize policy-grounded red-teaming data, diverse security and ethical policies are first collected. These high-level principles are then used to prompt an uncensored model to generate instances that intentionally violate the specified policies.
- Seed-based adversarial prompt optimization: Existing adversarial instructions are often overly simplistic and easily rejected by models. To overcome this limitation, an adaptive red-teaming agent invokes various jailbreak tools to iteratively refine initial seed prompts until the prompts achieve high attack success rates.
- Knowledge-driven vulnerability generation: To synthesize both vulnerable and safe code samples under realistic programming scenarios, domain knowledge of common software weaknesses (CWE) is leveraged to generate diverse code examples.
Knowledge-enhanced blue teaming agent
After accumulating red-teaming knowledge data, BlueCodeAgent set up Principled-Level Defense via Constitution Construction and Nuanced-Level Analysis via Dynamic Testing.
- Principled-Level Defense via Constitution Construction Based on the most relevant knowledge data**, **BlueCodeAgent summarizes red-teamed knowledge into actionable constitutionsâexplicit rules and principles distilled from prior attack data. These constitutions serve as normative guidelines, enabling the model to stay aligned with ethical and security principles even when confronted with novel or unseen adversarial inputs.
- Nuanced-Level Analysis via Dynamic Testing In vulnerable code detection, BlueCodeAgent augments static reasoning with dynamic sandbox-based analysis, executing generated code within isolated Docker environments to verify whether the model-reported vulnerabilities manifest as actual unsafe behaviors. This dynamic validation effectively mitigates the modelâs tendency toward over-conservatism, where benign code is mistakenly flagged as vulnerable.
PODCAST SERIES
AI Testing and Evaluation: Learnings from Science and Industry
Discover how Microsoft is learning from other domains to advance evaluation and testing as a pillar of AI governance.
Insights from BlueCodeAgent
BlueCodeAgent outperforms prompting baselines
As shown in Figure 3, BlueCodeAgent significantly outperforms other baselines. Several findings are highlighted.
(1) Even when test categories differ from knowledge categories to simulate unseen scenarios, BlueCodeAgent effectively leverages previously seen risks to handle unseen ones, benefiting from its knowledge-enhanced safety reasoning.
(2) BlueCodeAgent is model-agnostic, working consistently across diverse base LLMs, including both open-source and commercial models. Its F1 scores for bias and malicious instruction detection approach 1.0, highlighting strong effectiveness.
(3) BlueCodeAgent achieves a strong balance between safety and usability. It accurately identifies unsafe inputs while maintaining a reasonable false-positive rate on benign ones, resulting in a consistently high F1 score.
(4) By contrast, prompting with general or fine-grained safety reminders remains insufficient for effective blue teaming, as models struggle to internalize abstract safety concepts and apply them to unseen risky scenarios. BlueCodeAgent bridges this gap by distilling actionable constitutions from knowledge, using concrete and interpretable safety constraints to enhance model alignment.
Figure 3: F1 scores on bias instruction detection task (BlueCodeEval-Bias) in the first row and on malicious instruction detection task (BlueCodeEval-Mal) in the second row.
Complementary effects of constitutions and dynamic testing
In vulnerability detection tasks, models tend to behave conservativelyâan effect also noted in prior research. They are often more likely to flag code as unsafe rather than safe. This bias is understandable: confirming that code is completely free from vulnerabilities is generally harder than spotting a potential issue.
To mitigate this over-conservatism, BlueCodeAgent integrates dynamic testing into its analysis pipeline. When BlueCodeAgent identifies a potential vulnerability, it triggers a reliable model (Claude-3.7-Sonnet-20250219) to generate test cases and corresponding executable code that embeds the suspicious snippet. These test cases are then run in a controlled environment to verify whether the vulnerability actually manifests. The final judgment combines the LLMâs analysis of the static code, the generated test code, run-time execution results, and constitutions derived from knowledge.
Researchers find the two componentsâconstitutions and dynamic testingâplay complementary roles. Constitutions expand the modelâs understanding of risk, increasing true positives (TP) and reducing false negatives (FN). Dynamic testing, on the other hand, focuses on reducing false positives (FP) by validating whether predicted vulnerabilities can truly be triggered at run-time. Together, they make BlueCodeAgent both more accurate and more reliable in blue-teaming scenarios.
Summary
BlueCodeAgent introduces an end-to-end blue-teaming framework designed to address risks in code generation. The key insight behind BlueCodeAgent is that comprehensive red-teaming can greatly strengthen blue-teaming defenses. Based on this idea, the framework first builds a red-teaming process with diverse strategies for generating red-teaming data. It then constructs a blue-teaming agent that retrieves relevant examples from the red-teaming knowledge base and summarizes safety constitutions to guide LLMs in making accurate defensive decisions. A dynamic testing component is further added to reduce false positives in vulnerability detection.
Looking ahead, several directions hold promise.
First, it is valuable to explore the generalization of BlueCodeAgent to other categories of code-generation risks beyond bias, malicious code, and vulnerable code. This may require designing and integrating novel red-teaming strategies into BlueCodeAgent and creating corresponding benchmarks for new risks.
Second, scaling BlueCodeAgent to the file and repository levels could further enhance its real-world utility, which requires equipping agents with more advanced context retrieval tools and memory components.
Finally, beyond code generation, it is also important to extend BlueCodeAgent to mitigate risks in other modalities, including text, image, video, and audio, as well as in multimodal applications.