AI Models Write Code with Security Flaws 18–50% of the Time, New Study Finds

5 min read1 day ago

–

Press enter or click to view image in full size

Ask one of today’s top AI models to build you a Chrome extension, and it will. But can you trust its code to be secure? A new study from researchers at New York University, Columbia University, Monash University, and Australia’s national science agency, CSIRO, suggests that the AI-generated code has a high chance of containing significant security vulnerabilities.

The team investigated nine state-of-the-art LLMs — including advanced “reasoning” models like o3-mini and DeepSeek-R1 — by tasking them with generating Chrome extensions from 140 different functional scenarios. They found that, depending on the model, the extensions contained **significant security vulnerabili…

5 min read1 day ago

–

Press enter or click to view image in full size

The vulnerabilities were most severe when models were asked to build tools for “Authentication & Identity” or “Cookie Management.” The models produced vulnerable code up to 83% and 78% of the time, respectively. The most common and severe flaw was “Privileged Storage Access,” where the AI-generated code improperly exposed sensitive browser data like cookies, history, and bookmarks to untrusted sources.

Perhaps most concerning, the study found that newer, more advanced “reasoning” models often performed worse than their predecessors. Models like DeepSeek-R1 and o3-mini generated more vulnerabilities or a higher density of them. Though these models represented the state-of-the-art earlier in 2025, and have since been superseded, this pattern suggests that an AI’s enhanced coding skills may not yet necessarily translate to better security awareness.

The findings, the authors write, highlight a “critical gap between LLMs’ coding skills and their ability to write secure framework-constrained programs.”

The Productivity Paradox

The findings from this study contribute to an emerging and wider problem being observed with AI-powered coding. While it may speed up the generation of code, it can lead to slowdowns in other areas that diminish productivity or even result in a net negative impact. If you write more code — and that code is also lower quality — bottlenecks intensify further down the workflow in code review, testing, and rework.

This “productivity paradox” was the specific focus of another recent October study from Tilburg University, which analyzed developer activity in Open Source Software (OSS) projects after the introduction of an AI coding assistant. The researchers’ discovered a two-tiered system where productivity gains for some came at a direct cost to others.

Their study found that the overall productivity boost was “primarily driven by less-experienced (peripheral) developers”. These junior contributors, empowered by AI, began shipping far more code. In fact, the least experienced developers (those in the bottom 25th percentile) increased their code submission by 43.5%.

But this surge in volume came with a significant cost: the AI-assisted code required more rework. This new “rework burden” landed squarely on the shoulders of the most “experienced (core) developers”. The study found a dramatic redistribution of effort where senior developers’ own original code contributions dropped by 19%, as their time was reallocated to managing the flood of lower-quality contributions, with their code review workload increasing by 6.5%.

The researchers from the Tilburg University study raised caution that these “productivity gains of AI may mask the growing burden of maintenance on a shrinking pool of experts” — a pool that was already described as “overworked” and “under-incentivized” before AI was added to the mix.

Taken together, these studies reveal how AI generated code can create unanticipated bottlenecks that ultimately negate its overall benefit.

The Comprehension Collapse

While engineers are on the front lines of these new productivity bottlenecks, they are by no means the only ones affected.

A second order consequence of AI-accelerated code generation is the collapse of organizational comprehension. If senior developers are struggling to keep up, their managers and executives are often completely in the dark.

A developer-level bottleneck can metastasize, because engineers aren’t the only ones who need to understand the codebase. Non-technical leaders — from product managers to department heads and C-suite executives — need to know how their products are progressing. Unlike developers, they cannot seek answers directly from the code, and need to turn to engineering teams for answers.

This creates a vicious cycle. As AI ships code faster, the understanding gap for these non-technical leaders widens, forcing them to request more updates. But those requests land on the same senior developers who are now increasingly burdened with reviewing and reworking that new wave of AI-generated code. The organization is left with two bad options: leaders either harass their engineers, deepening the workload, or they simply give up trying to understand what’s going on.

This organizational “game of telephone” was a severe problem even before the AI boom. Kayvon Beykpour, for instance, the former Head of Product at Twitter, recently recounted this exact problem from his time at the company, calling the task of understanding what his 3,000 engineers were working on “one of the most annoying but important parts” of his job.

He described how asking for a status update would cascade down layers of management. By the time an answer returned, it was “so whittled down” and “sugarcoated” that it was impossible for leaders to know what was actually being built or how products were progressing.

Now, with AI quadrupling code output for some teams, this long-standing organizational blind spot is becoming an existential crisis. If leaders couldn’t get a clear answer from 3,000 engineers, how can they possibly understand the output of 3,000 engineers plus thousands of AI co-pilots all shipping code simultaneously?

Together, these findings point to a multifaceted set of challenges with AI-generated code that span security, review, and organizational clarity.

New Solutions Are Needed

To unlock AI’s true potential, new tools and approaches are needed to manage its first-order and second-order consequences. As the volume of AI-generated code grows, organizations need new ways to maintain security, visibility and control.

One interesting early example is Macroscope, an AI tool designed to aid teams with code review and codebase comprehension. In a recent benchmark of real-world production bugs, Macroscope achieved the highest detection rate (48%) among AI review tools while generating 75% fewer comments than the next most accurate tool, helping to catch the very security and quality issues the studies highlight, without overwhelming developers.

It also offers dashboards and natural language summaries, designed to give the entire organization — from technical leads to non-technical stakeholders — a clear understanding of how their codebase is changing.

As the number of lines of AI-generated code is set to surpass one trillion per year, tools that target bottlenecks in other areas of development appear poised to take on increasing importance.

The researchers of the Chrome study conclude that “human oversight remains essential for security assurance.” But human oversight is already at its breaking point. Without new tools to manage the chaos, the productivity boom promised by AI may collapse under the weight of its own unreviewed, insecure, and incomprehensible code.

References and footnotes

Liu, Y., Xing, Z., Pan, S., & Tantithamthavorn, C. (2025). When AI Takes the Wheel: Security analysis of framework-constrained program generation. arXiv.https://arxiv.org/abs/2510.16823

Xu, F., Medappa, P. K., Tunc, M. M., Vroegindeweij, M., & Fransoo, J. C. (2025). AI-assisted Programming May Decrease the Productivity of Experienced Developers by Increasing Maintenance Burden. arXiv.https://arxiv.org/abs/2510.10165

Disclosure: the author of this article has a professional relationship with Macroscope, the tool referenced in the above post, but the research discussed is 100% independent.

The Productivity Paradox

The Comprehension Collapse

New Solutions Are Needed

Similar Posts