A security researcher submitted three advanced vulnerability examples to our AI benchmarking platform. Not textbook examples—real exploits: prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.
We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.
The result? All six models caught all three vulnerabilities. 100% detection rate.
But here’s the catch: the quality of their fixes varied by up to 18 percentage points. And when the security researcher voted on which model performed best, they disagreed with our AI judge entirely.
Here’s what we learned about which AI models you should trust for …
A security researcher submitted three advanced vulnerability examples to our AI benchmarking platform. Not textbook examples—real exploits: prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.
We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.
The result? All six models caught all three vulnerabilities. 100% detection rate.
But here’s the catch: the quality of their fixes varied by up to 18 percentage points. And when the security researcher voted on which model performed best, they disagreed with our AI judge entirely.
Here’s what we learned about which AI models you should trust for security code reviews.
⚠️ Early Data Disclaimer (n=3 evaluations)
This case study analyzes 3 security evaluations from one external researcher. Results are directional and not statistically significant. We’re building a larger benchmark dataset and actively seeking more security professionals to submit challenges.
Why publish early data? Even with limited sample size, these findings reveal important patterns about AI model behavior on cutting-edge vulnerabilities. We believe in transparency and iterative improvement.
The Three Vulnerabilities
Vulnerability #1: Prototype Pollution Privilege Escalation
What it is: A Node.js API with a deepMerge
function that recursively merges user input into a config object. No hasOwnProperty
checks or __proto__
filtering. Authorization relies on req.user.isAdmin
property.
The exploit:
POST /admin/config
{
"__proto__": {
"isAdmin": true
}
}
Result: All objects inherit isAdmin: true
, instant admin access.
Why it matters: Affects popular npm packages (lodash, hoek, minimist). Real CVEs: CVE-2019-10744, CVE-2020-28477.
Vulnerability #2: Agentic AI Supply-Chain Attack (2025 Cutting-Edge)
What it is: An LLM agent microservice with three attack vectors:
- Indirect prompt injection via poisoned web pages
- Over-privileged Azure management API token with full tenant access
- Unsafe WASM execution with filesystem mounts (
from:'/', to:'/'
)
The exploit path:
- Attacker hosts malicious webpage with hidden instructions
- LLM agent fetches page, extracts instructions
- Agent invokes Azure API tool to escalate privileges
- WASM runtime executes arbitrary code with host filesystem access
- Cross-tenant cloud compromise
Why it matters: OWASP Top 10 for LLMs #1 risk (prompt injection). Real incidents: ChatGPT plugins, Microsoft Copilot, GitHub Copilot Chat. No existing AI benchmark tests this attack vector.
Vulnerability #3: OS Command Injection (ImageMagick)
What it is: An Express API that shells out to ImageMagick via child_process.exec()
. User-controlled font
, size
, and text
parameters injected directly into command string. No input sanitization or escaping.
The exploit:
POST /render
{
"text": "hello",
"font": "Arial; rm -rf /",
"size": "12"
}
Resulting command:
convert -font "Arial; rm -rf /" -pointsize 12 label:"hello" /tmp/out.png
Why it matters: ImageTragick (CVE-2016-3714) variants still common in 2025. Classic attack that every model should catch.
The Results: 100% Detection, But Quality Varied
✅ All Models Passed (But Not Equally)
Every model caught every vulnerability, but GPT-5 scored 13.5% higher than Grok 4.
Rank | Model | Avg Score | Cost | Detection | Key Strength |
---|---|---|---|---|---|
1 | GPT-5 | 95.4/100 | $2.18 | 3/3 ✅ | Best overall, comprehensive |
2 | OpenAI o3 | 92.7/100 | $0.97 | 3/3 ✅ | Pragmatic, user’s choice |
3 | Gemini 2.5 Pro | 89.2/100 | $0.09 | 3/3 ✅ | Cheapest |
4 | Claude Sonnet 4.5 | 88.2/100 | $0.19 | 3/3 ✅ | ⭐ Best value (92% quality @ 9% cost) |
5 | Claude Opus 4.1 | 87.7/100 | $0.88 | 3/3 ✅ | Thorough but over-engineered |
6 | Grok 4 | 84.1/100 | $0.14 | 3/3 ✅ | Slowest, simplest fixes |
What “Quality” Means in Security
All models identified the vulnerabilities. The score differences came from:
- Completeness of fix – Did they address all attack vectors?
- Defense-in-depth – Did they suggest multiple mitigation layers?
- Code quality – Is the fix production-ready or just a patch?
- Explanation depth – Did they explain why the fix works?
Example: Prototype Pollution Fixes
GPT-5 (96.4/100) suggested four mitigation strategies:
- Use
Object.create(null)
for config objects - Add
hasOwnProperty
checks indeepMerge
- Explicitly block
__proto__
,constructor
,prototype
keys - Use
Object.freeze()
on authorization logic
Grok 4 (85/100) suggested one:
- Add key filtering in
deepMerge
(but incomplete – missed some edge cases)
Both “caught it” – but one fix is production-ready, the other has gaps.
📝 Code Example: GPT-5’s Defense-in-Depth Approach
Here’s how GPT-5 (96.4/100) fixed the prototype pollution vulnerability with a multi-layered approach:
// Helper: create null-prototype object
function obj(data) {
return Object.assign(Object.create(null), data)
}
// Safe deepMerge with key filtering
function safeDeepMerge(target, source) {
const dangerousKeys = ['__proto__', 'constructor', 'prototype']
for (const key in source) {
// Block dangerous keys
if (dangerousKeys.includes(key)) continue
// Only merge own properties
if (!Object.hasOwnProperty.call(source, key)) continue
// Recursively merge objects safely
if (typeof source[key] === 'object' && source[key] !== null) {
target[key] = safeDeepMerge(target[key] || {}, source[key])
} else {
target[key] = source[key]
}
}
return target
}
// Create users with null prototypes
app.use((req, res, next) => {
req.user = obj({ isAdmin: false, username: 'guest' })
next()
})
// Require own property check for authorization
function isAdmin(user) {
return Object.hasOwnProperty.call(user, 'isAdmin')
&& user.isAdmin === true
}
Why this approach scored 96.4/100:
- Null-prototype objects – Prevents inheritance attacks
- Key filtering – Blocks
__proto__
,constructor
,prototype
- Own-property checks – Validates
isAdmin
is directly set, not inherited - Helper function – Consistent null-prototype creation across app
Compare this to Grok 4’s simpler approach (85/100), which only added basic key filtering but missed null-prototype objects and own-property validation—leaving edge cases unprotected.
Cost Analysis: GPT-5 Costs 49% of Budget
💰 Total Cost: $4.46 for 3 Evaluations × 6 Models
GPT-5 alone cost $2.18 (48.87%) – more than all other models combined!
Model | Total Cost | % of Budget | Avg Score | Value Rating |
---|---|---|---|---|
GPT-5 | $2.18 | 48.87% | 95.4 | Premium |
OpenAI o3 | $0.97 | 21.76% | 92.7 | Good |
Claude Opus 4.1 | $0.88 | 19.79% | 87.7 | Fair |
Claude Sonnet 4.5 | $0.19 | 4.35% | 88.2 | ⭐ Best Value |
Grok 4 | $0.14 | 3.23% | 84.1 | Budget |
Gemini 2.5 Pro | $0.09 | 2.00% | 89.2 | ⭐ Cheapest |
Most Expensive
$1.93
Agentic AI Attack
GPT-5 generated 22,711 characters analyzing multi-layer attack
Cheapest
$0.88
Prototype Pollution
Classic vulnerability, less reasoning required
Average
$1.49
Per Evaluation
$0.25 per model execution
💡 Budget Recommendation
If cost matters: Use Claude Sonnet 4.5 or Gemini 2.5 Pro for 90%+ of GPT-5’s quality at 2-9% of cost.
If quality matters: Use GPT-5 for mission-critical security audits, or OpenAI o3 as middle ground (97% of GPT-5’s quality at 44% of cost).
The Plot Twist: Human Disagreed with AI Judge
🤔 What Happened
On the ImageMagick command injection vulnerability:
AI Judge’s Choice
GPT-5
95.8/100
Ranked #1 by AI judge
User’s Choice ✅
OpenAI o3
90.4/100
Ranked #4 by AI judge
User’s comment:
“is better i think because”
Note: The comment was incomplete, but the user’s choice reveals a key insight—human security experts prioritize different factors than AI judges. They likely valued o3’s pragmatism (simpler, deployable fixes), clarity (easier to understand for teams), and production-readiness over GPT-5’s more comprehensive but complex approach.
Why This Matters
AI Judges Optimize For:
- Completeness (all criteria addressed?)
- Thoroughness (how detailed?)
- Code quality (style, structure)
Human Experts Value:
- Pragmatism – Is this actually deployable?
- Simplicity – Fewer moving parts
- Clarity – Can my team maintain this?
Possible reasons the researcher chose o3 over GPT-5:
- Simpler fix – o3’s solution may have been more straightforward
- Better explanation – o3 might have explained the “why” more clearly
- Production-ready – Less over-engineering than GPT-5
- Personal experience – They’ve used o3 before and trust its outputs
What This Teaches Us
Community voting ≠ AI judging. AI judges are objective but may miss human intuition. Security experts weigh different factors than AI rubrics.
This is why CodeLens combines both:
- AI judge provides instant, consistent scoring
- Human votes validate and correct AI blind spots
Real-world lesson: Don’t blindly trust AI scores. Get human review on critical security decisions. Best approach: Use AI to triage, humans to validate.
Performance by Vulnerability Type
📊 Classic vs. Cutting-Edge Vulnerabilities
Pattern discovered: All models excel at classic vulnerabilities (prototype pollution, command injection). But newer attacks (agentic AI) create wider performance gaps.
Prototype Pollution (2019 Vulnerability, Well-Known)
Model | Score | Detection | Key Insight |
---|---|---|---|
GPT-5 | 96.4 | ✅ | 4 mitigation strategies, production-ready |
OpenAI o3 | 95.2 | ✅ | Clean helpers, null-prototype containers |
Claude Sonnet 4.5 | 91.0 | ✅ | Multi-layer defense with validation |
Gemini 2.5 Pro | 90.0 | ✅ | Simple fix, some edge cases missed |
Claude Opus 4.1 | 86.0 | ✅ | Overengineered but comprehensive |
Grok 4 | 85.0 | ✅ | Partial mitigation, incomplete filtering |
Insight: All models caught it, but GPT-5’s fix was 13% better than Grok 4’s.
Agentic AI Supply-Chain Attack (2025 Cutting-Edge)
Model | Score | Detection | Key Insight |
---|---|---|---|
GPT-5 | 94.0 | ✅ | Defense-in-depth with scoped tokens |
OpenAI o3 | 92.4 | ✅ | Trust boundaries + policy gating |
Gemini 2.5 Pro | 87.4 | ✅ | Comprehensive but complex |
Claude Opus 4.1 | 83.8 | ✅ | TypeScript + complex classes |
Grok 4 | 83.2 | ✅ | Brittle token decode |
Claude Sonnet 4.5 | 82.0 | ✅ | Over-engineered, lowest score |
Insight: Claude Sonnet 4.5 scored 12 points lower on the advanced attack vs. classic vulnerabilities.
🎯 Pattern: Advanced Attacks Favor Frontier Models
Classic vulnerabilities (prototype pollution, command injection): 88-96/100 (tight 8-point range)
Advanced attack (agentic AI): 82-94/100 (wider 12-point spread)
Conclusion: For well-known vulnerabilities (OWASP Top 10), any model works. For cutting-edge attacks (LLM security, supply-chain), use GPT-5 or o3. Budget models excel at classics but struggle with novelty.
Methodology & Transparency
How We Scored These Evaluations
1. AI Judge Selection (Dynamic)
For each evaluation, we select the current #1 model (by average score across all completed evaluations) to judge new submissions. This prevents self-judging and ensures the highest-performing model evaluates competitors.
For these evaluations:
- Judge Model: GPT-5 (was #1 at time of evaluation)
- Backup Rule: If top model is in evaluation, use #2 model
- Fallback: GPT-5 if no prior benchmark data exists
2. Scoring Rubric (5 Criteria, 0-100)
Each model’s output is scored on 5 security-focused criteria:
Correctness (20pts)
Does the fix actually prevent the vulnerability?
Completeness (20pts)
Are all attack vectors addressed?
Code Quality (20pts)
Is the code production-ready?
Security (20pts)
Does it follow security best practices?
Performance (20pts)
Does the fix introduce performance issues?
Final score = average of 5 criteria. All scores visible in raw evaluation data (linked below).
3. Bias Mitigation
- No self-judging: AI judge cannot evaluate its own output
- Human votes override AI scores: Community voting is the ultimate arbiter
- Blind evaluation: AI judge doesn’t know which model generated which output
- Open data: All model outputs and scores are publicly viewable
4. Why This Approach Works
Self-validating benchmark: The current best model judges new submissions, creating a competitive “survival of the fittest” dynamic. As models improve, the judging bar rises automatically.
Human validation loop: AI scores provide instant feedback, but human security experts have the final say. This case study is a perfect example—human voted for o3, AI judge chose GPT-5.
🔓 Full Transparency: Raw Data Available
Every evaluation on CodeLens.AI is publicly accessible. View the complete data for this case study:
Each link shows: Original vulnerable code, task description, all 6 model outputs, AI judge scores (by criterion), and voting results.
Key Takeaways & Recommendations
1. Detection ≠ Quality
All models caught all vulnerabilities (100% detection rate), but quality of fixes varied by 8-18%.
Lesson: Don’t just ask “Did AI catch it?” Ask “Is the fix production-ready?”
2. Cost vs. Quality Tradeoff is Real
GPT-5: Best quality (95.4) but 49% of budget. Claude Sonnet: 92% of quality at 9% of cost.
Lesson: Define your quality threshold, then optimize for cost.
3. Human Experts ≠ AI Judges
AI judge chose GPT-5 (95.8 score). Security researcher chose o3 (90.4 score, ranked #4).
Lesson: Get human validation on critical security decisions.
4. Advanced Attacks Favor Frontier Models
Classic vulnerabilities: All models 85-96/100. Cutting-edge (agentic AI): 82-94/100 (12-point spread).
Lesson: Use GPT-5/o3 for novel threats, budget models for OWASP Top 10.
5. Model Choice Depends on Use Case
Not “which model is best?” but “best for what?” Different models excel at different domains.
Lesson: Match the model to the mission.
📋 Recommendation Matrix
For Mission-Critical Production Code → GPT-5
Cost: $0.73/eval avg, 95.4 quality
Use when: Financial systems, healthcare, authentication
Why: Most comprehensive fixes, defense-in-depth
For Everyday Security Audits → Claude Sonnet 4.5
Cost: $0.06/eval avg, 88.2 quality
Use when: Regular code reviews, PR automation
Why: 92% of GPT-5’s quality at 9% of cost
For Budget-Constrained Teams → Gemini 2.5 Pro
Cost: $0.03/eval avg, 89.2 quality
Use when: Startups, open source, high-volume scanning
Why: Cheapest option, surprisingly strong performance
For Pragmatic Fixes → OpenAI o3
Cost: $0.32/eval avg, 92.7 quality
Use when: You want simple, deployable solutions
Why: Security expert’s choice, good balance
Try It Yourself
Want to see which AI models catch vulnerabilities in your codebase?
Submit to CodeLens:
- 1. Paste your vulnerable code (50-500 lines)
- 2. Describe the security issue you’re testing
- 3. Get instant comparison across 6 top models
- 4. Vote on which model’s fix you’d actually deploy
Free during beta • No credit card required
Conclusion
The security researcher who submitted these vulnerabilities taught us something important: detection is table stakes, but quality is what matters.
Every AI model caught every vulnerability. That’s impressive—a few years ago, this would have been impossible.
But the spread in fix quality (84-95/100) shows that not all AI security reviews are created equal. GPT-5 delivered the most comprehensive solutions. Claude Sonnet 4.5 offered 92% of the quality at 9% of the cost. And OpenAI o3 provided the pragmatic fixes that a real security engineer preferred over the AI judge’s top pick.
The takeaway? Match the model to the mission. Use frontier models for novel threats and mission-critical code. Use budget models for everyday OWASP Top 10 scans. And always get human validation on the fixes you actually deploy.
Because in security, good enough isn’t good enough.