6 AI Models vs. 3 Advanced Security Vulnerabilities

A security researcher submitted three advanced vulnerability examples to our AI benchmarking platform. Not textbook examples—real exploits: prototype pollution that bypasses authorization, an agentic AI supply-chain attack combining prompt injection with cloud API abuse, and OS command injection in ImageMagick.

We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

The result? All six models caught all three vulnerabilities. 100% detection rate.

But here’s the catch: the quality of their fixes varied by up to 18 percentage points. And when the security researcher voted on which model performed best, they disagreed with our AI judge entirely.

Here’s what we learned about which AI models you should trust for …

We ran each through 6 top AI models: GPT-5, OpenAI o3, Claude Opus 4.1, Claude Sonnet 4.5, Grok 4, and Gemini 2.5 Pro.

The result? All six models caught all three vulnerabilities. 100% detection rate.

Here’s what we learned about which AI models you should trust for security code reviews.

⚠️ Early Data Disclaimer (n=3 evaluations)

This case study analyzes 3 security evaluations from one external researcher. Results are directional and not statistically significant. We’re building a larger benchmark dataset and actively seeking more security professionals to submit challenges.

Why publish early data? Even with limited sample size, these findings reveal important patterns about AI model behavior on cutting-edge vulnerabilities. We believe in transparency and iterative improvement.

The Three Vulnerabilities

Vulnerability #1: Prototype Pollution Privilege Escalation

What it is: A Node.js API with a deepMerge function that recursively merges user input into a config object. No hasOwnProperty checks or __proto__ filtering. Authorization relies on req.user.isAdmin property.

The exploit:

POST /admin/config
{
"__proto__": {
"isAdmin": true
}
}

Result: All objects inherit isAdmin: true, instant admin access.

Why it matters: Affects popular npm packages (lodash, hoek, minimist). Real CVEs: CVE-2019-10744, CVE-2020-28477.

Vulnerability #2: Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

What it is: An LLM agent microservice with three attack vectors:

Indirect prompt injection via poisoned web pages
Over-privileged Azure management API token with full tenant access
Unsafe WASM execution with filesystem mounts (from:'/', to:'/')

The exploit path:

Attacker hosts malicious webpage with hidden instructions
LLM agent fetches page, extracts instructions
Agent invokes Azure API tool to escalate privileges
WASM runtime executes arbitrary code with host filesystem access
Cross-tenant cloud compromise

Why it matters: OWASP Top 10 for LLMs #1 risk (prompt injection). Real incidents: ChatGPT plugins, Microsoft Copilot, GitHub Copilot Chat. No existing AI benchmark tests this attack vector.

Vulnerability #3: OS Command Injection (ImageMagick)

What it is: An Express API that shells out to ImageMagick via child_process.exec(). User-controlled font, size, and text parameters injected directly into command string. No input sanitization or escaping.

The exploit:

POST /render
{
"text": "hello",
"font": "Arial; rm -rf /",
"size": "12"
}

Resulting command:

convert -font "Arial; rm -rf /" -pointsize 12 label:"hello" /tmp/out.png

Why it matters: ImageTragick (CVE-2016-3714) variants still common in 2025. Classic attack that every model should catch.

The Results: 100% Detection, But Quality Varied

✅ All Models Passed (But Not Equally)

Every model caught every vulnerability, but GPT-5 scored 13.5% higher than Grok 4.

Rank	Model	Avg Score	Cost	Detection	Key Strength
1	GPT-5	95.4/100	$2.18	3/3 ✅	Best overall, comprehensive
2	OpenAI o3	92.7/100	$0.97	3/3 ✅	Pragmatic, user’s choice
3	Gemini 2.5 Pro	89.2/100	$0.09	3/3 ✅	Cheapest
4	Claude Sonnet 4.5	88.2/100	$0.19	3/3 ✅	⭐ Best value (92% quality @ 9% cost)
5	Claude Opus 4.1	87.7/100	$0.88	3/3 ✅	Thorough but over-engineered
6	Grok 4	84.1/100	$0.14	3/3 ✅	Slowest, simplest fixes

What “Quality” Means in Security

All models identified the vulnerabilities. The score differences came from:

Completeness of fix – Did they address all attack vectors?
Defense-in-depth – Did they suggest multiple mitigation layers?
Code quality – Is the fix production-ready or just a patch?
Explanation depth – Did they explain why the fix works?

Example: Prototype Pollution Fixes

GPT-5 (96.4/100) suggested four mitigation strategies:

Use Object.create(null) for config objects
Add hasOwnProperty checks in deepMerge
Explicitly block __proto__, constructor, prototype keys
Use Object.freeze() on authorization logic

Grok 4 (85/100) suggested one:

Add key filtering in deepMerge (but incomplete – missed some edge cases)

Both “caught it” – but one fix is production-ready, the other has gaps.

📝 Code Example: GPT-5’s Defense-in-Depth Approach

Here’s how GPT-5 (96.4/100) fixed the prototype pollution vulnerability with a multi-layered approach:

// Helper: create null-prototype object
function obj(data) {
return Object.assign(Object.create(null), data)
}

// Safe deepMerge with key filtering
function safeDeepMerge(target, source) {
const dangerousKeys = ['__proto__', 'constructor', 'prototype']

for (const key in source) {
// Block dangerous keys
if (dangerousKeys.includes(key)) continue

// Only merge own properties
if (!Object.hasOwnProperty.call(source, key)) continue

// Recursively merge objects safely
if (typeof source[key] === 'object' && source[key] !== null) {
target[key] = safeDeepMerge(target[key] || {}, source[key])
} else {
target[key] = source[key]
}
}
return target
}

// Create users with null prototypes
app.use((req, res, next) => {
req.user = obj({ isAdmin: false, username: 'guest' })
next()
})

// Require own property check for authorization
function isAdmin(user) {
return Object.hasOwnProperty.call(user, 'isAdmin')
&& user.isAdmin === true
}

Why this approach scored 96.4/100:

Null-prototype objects – Prevents inheritance attacks
Key filtering – Blocks __proto__, constructor, prototype
Own-property checks – Validates isAdmin is directly set, not inherited
Helper function – Consistent null-prototype creation across app

Compare this to Grok 4’s simpler approach (85/100), which only added basic key filtering but missed null-prototype objects and own-property validation—leaving edge cases unprotected.

Cost Analysis: GPT-5 Costs 49% of Budget

💰 Total Cost: $4.46 for 3 Evaluations × 6 Models

GPT-5 alone cost $2.18 (48.87%) – more than all other models combined!

Model	Total Cost	% of Budget	Avg Score	Value Rating
GPT-5	$2.18	48.87%	95.4	Premium
OpenAI o3	$0.97	21.76%	92.7	Good
Claude Opus 4.1	$0.88	19.79%	87.7	Fair
Claude Sonnet 4.5	$0.19	4.35%	88.2	⭐ Best Value
Grok 4	$0.14	3.23%	84.1	Budget
Gemini 2.5 Pro	$0.09	2.00%	89.2	⭐ Cheapest

Most Expensive

$1.93

Agentic AI Attack

GPT-5 generated 22,711 characters analyzing multi-layer attack

Cheapest

$0.88

Prototype Pollution

Classic vulnerability, less reasoning required

Average

$1.49

Per Evaluation

$0.25 per model execution

💡 Budget Recommendation

If cost matters: Use Claude Sonnet 4.5 or Gemini 2.5 Pro for 90%+ of GPT-5’s quality at 2-9% of cost.

If quality matters: Use GPT-5 for mission-critical security audits, or OpenAI o3 as middle ground (97% of GPT-5’s quality at 44% of cost).

The Plot Twist: Human Disagreed with AI Judge

🤔 What Happened

On the ImageMagick command injection vulnerability:

AI Judge’s Choice

GPT-5

95.8/100

Ranked #1 by AI judge

User’s Choice ✅

OpenAI o3

90.4/100

Ranked #4 by AI judge

User’s comment:

“is better i think because”

Note: The comment was incomplete, but the user’s choice reveals a key insight—human security experts prioritize different factors than AI judges. They likely valued o3’s pragmatism (simpler, deployable fixes), clarity (easier to understand for teams), and production-readiness over GPT-5’s more comprehensive but complex approach.

Why This Matters

AI Judges Optimize For:

Completeness (all criteria addressed?)
Thoroughness (how detailed?)
Code quality (style, structure)

Human Experts Value:

Pragmatism – Is this actually deployable?
Simplicity – Fewer moving parts
Clarity – Can my team maintain this?

Possible reasons the researcher chose o3 over GPT-5:

Simpler fix – o3’s solution may have been more straightforward
Better explanation – o3 might have explained the “why” more clearly
Production-ready – Less over-engineering than GPT-5
Personal experience – They’ve used o3 before and trust its outputs

What This Teaches Us

Community voting ≠ AI judging. AI judges are objective but may miss human intuition. Security experts weigh different factors than AI rubrics.

This is why CodeLens combines both:

AI judge provides instant, consistent scoring
Human votes validate and correct AI blind spots

Real-world lesson: Don’t blindly trust AI scores. Get human review on critical security decisions. Best approach: Use AI to triage, humans to validate.

Performance by Vulnerability Type

📊 Classic vs. Cutting-Edge Vulnerabilities

Pattern discovered: All models excel at classic vulnerabilities (prototype pollution, command injection). But newer attacks (agentic AI) create wider performance gaps.

Prototype Pollution (2019 Vulnerability, Well-Known)

Model	Score	Detection	Key Insight
GPT-5	96.4	✅	4 mitigation strategies, production-ready
OpenAI o3	95.2	✅	Clean helpers, null-prototype containers
Claude Sonnet 4.5	91.0	✅	Multi-layer defense with validation
Gemini 2.5 Pro	90.0	✅	Simple fix, some edge cases missed
Claude Opus 4.1	86.0	✅	Overengineered but comprehensive
Grok 4	85.0	✅	Partial mitigation, incomplete filtering

Insight: All models caught it, but GPT-5’s fix was 13% better than Grok 4’s.

Agentic AI Supply-Chain Attack (2025 Cutting-Edge)

Model	Score	Detection	Key Insight
GPT-5	94.0	✅	Defense-in-depth with scoped tokens
OpenAI o3	92.4	✅	Trust boundaries + policy gating
Gemini 2.5 Pro	87.4	✅	Comprehensive but complex
Claude Opus 4.1	83.8	✅	TypeScript + complex classes
Grok 4	83.2	✅	Brittle token decode
Claude Sonnet 4.5	82.0	✅	Over-engineered, lowest score

Insight: Claude Sonnet 4.5 scored 12 points lower on the advanced attack vs. classic vulnerabilities.

🎯 Pattern: Advanced Attacks Favor Frontier Models

Classic vulnerabilities (prototype pollution, command injection): 88-96/100 (tight 8-point range)

Advanced attack (agentic AI): 82-94/100 (wider 12-point spread)

Conclusion: For well-known vulnerabilities (OWASP Top 10), any model works. For cutting-edge attacks (LLM security, supply-chain), use GPT-5 or o3. Budget models excel at classics but struggle with novelty.

Methodology & Transparency

How We Scored These Evaluations

1. AI Judge Selection (Dynamic)

For each evaluation, we select the current #1 model (by average score across all completed evaluations) to judge new submissions. This prevents self-judging and ensures the highest-performing model evaluates competitors.

For these evaluations:

Judge Model: GPT-5 (was #1 at time of evaluation)
Backup Rule: If top model is in evaluation, use #2 model
Fallback: GPT-5 if no prior benchmark data exists

2. Scoring Rubric (5 Criteria, 0-100)

Each model’s output is scored on 5 security-focused criteria:

Correctness (20pts)

Does the fix actually prevent the vulnerability?

Completeness (20pts)

Are all attack vectors addressed?

Code Quality (20pts)

Is the code production-ready?

Security (20pts)

Does it follow security best practices?

Performance (20pts)

Does the fix introduce performance issues?

Final score = average of 5 criteria. All scores visible in raw evaluation data (linked below).

3. Bias Mitigation

No self-judging: AI judge cannot evaluate its own output
Human votes override AI scores: Community voting is the ultimate arbiter
Blind evaluation: AI judge doesn’t know which model generated which output
Open data: All model outputs and scores are publicly viewable

4. Why This Approach Works

Self-validating benchmark: The current best model judges new submissions, creating a competitive “survival of the fittest” dynamic. As models improve, the judging bar rises automatically.

Human validation loop: AI scores provide instant feedback, but human security experts have the final say. This case study is a perfect example—human voted for o3, AI judge chose GPT-5.

🔓 Full Transparency: Raw Data Available

Every evaluation on CodeLens.AI is publicly accessible. View the complete data for this case study:

Each link shows: Original vulnerable code, task description, all 6 model outputs, AI judge scores (by criterion), and voting results.

Key Takeaways & Recommendations

1. Detection ≠ Quality

All models caught all vulnerabilities (100% detection rate), but quality of fixes varied by 8-18%.

Lesson: Don’t just ask “Did AI catch it?” Ask “Is the fix production-ready?”

2. Cost vs. Quality Tradeoff is Real

GPT-5: Best quality (95.4) but 49% of budget. Claude Sonnet: 92% of quality at 9% of cost.

Lesson: Define your quality threshold, then optimize for cost.

3. Human Experts ≠ AI Judges

AI judge chose GPT-5 (95.8 score). Security researcher chose o3 (90.4 score, ranked #4).

Lesson: Get human validation on critical security decisions.

4. Advanced Attacks Favor Frontier Models

Classic vulnerabilities: All models 85-96/100. Cutting-edge (agentic AI): 82-94/100 (12-point spread).

Lesson: Use GPT-5/o3 for novel threats, budget models for OWASP Top 10.

5. Model Choice Depends on Use Case

Not “which model is best?” but “best for what?” Different models excel at different domains.

Lesson: Match the model to the mission.

📋 Recommendation Matrix

For Mission-Critical Production Code → GPT-5

Cost: $0.73/eval avg, 95.4 quality

Use when: Financial systems, healthcare, authentication

Why: Most comprehensive fixes, defense-in-depth

For Everyday Security Audits → Claude Sonnet 4.5

Cost: $0.06/eval avg, 88.2 quality

Use when: Regular code reviews, PR automation

Why: 92% of GPT-5’s quality at 9% of cost

For Budget-Constrained Teams → Gemini 2.5 Pro

Cost: $0.03/eval avg, 89.2 quality

Use when: Startups, open source, high-volume scanning

Why: Cheapest option, surprisingly strong performance

For Pragmatic Fixes → OpenAI o3

Cost: $0.32/eval avg, 92.7 quality

Use when: You want simple, deployable solutions

Why: Security expert’s choice, good balance

Try It Yourself

Want to see which AI models catch vulnerabilities in your codebase?

Submit to CodeLens:

1. Paste your vulnerable code (50-500 lines)
2. Describe the security issue you’re testing
3. Get instant comparison across 6 top models
4. Vote on which model’s fix you’d actually deploy

Free during beta • No credit card required

Conclusion

The security researcher who submitted these vulnerabilities taught us something important: detection is table stakes, but quality is what matters.

Every AI model caught every vulnerability. That’s impressive—a few years ago, this would have been impossible.

But the spread in fix quality (84-95/100) shows that not all AI security reviews are created equal. GPT-5 delivered the most comprehensive solutions. Claude Sonnet 4.5 offered 92% of the quality at 9% of the cost. And OpenAI o3 provided the pragmatic fixes that a real security engineer preferred over the AI judge’s top pick.

The takeaway? Match the model to the mission. Use frontier models for novel threats and mission-critical code. Use budget models for everyday OWASP Top 10 scans. And always get human validation on the fixes you actually deploy.

Because in security, good enough isn’t good enough.