12 min readJust now
–
Part 2: What the research reveals
In Part 1, I described what 1500 hours of AI-assisted development taught me: LLMs write code that compiles, passes tests, and works for users, but doesn’t fit. The pattern has a name: architectural drift. I built a tool to measure and prevent it. I ran benchmarks that showed the gap between “working code” and “good code” was larger than I expected.
But I wanted to know: was my experience typical?
So I dug into the research. The pattern was clearer than I expected.
The Problem at Scale
In Part 1, I measured two things traditional metrics miss: architectural drift (code that …
12 min readJust now
–
Part 2: What the research reveals
In Part 1, I described what 1500 hours of AI-assisted development taught me: LLMs write code that compiles, passes tests, and works for users, but doesn’t fit. The pattern has a name: architectural drift. I built a tool to measure and prevent it. I ran benchmarks that showed the gap between “working code” and “good code” was larger than I expected.
But I wanted to know: was my experience typical?
So I dug into the research. The pattern was clearer than I expected.
The Problem at Scale
In Part 1, I measured two things traditional metrics miss: architectural drift (code that works but doesn’t fit) and silent bugs (violations that compile, pass tests, and clear review). These became my proxy for production risk the gap between “code that runs” and “code that belongs.”
The research measures the same gap at organizational scale. Developers report feeling 20–30% faster with AI tools. Yet delivery stability drops, complexity rises, and technical debt compounds. The 2024 DORA report found that a 25% increase in AI adoption correlates with a 7.2% decrease in stability; correlation, not proof of causation, but a pattern worth noticing. The causal evidence is stronger elsewhere: a Carnegie Mellon study used difference-in-differences analysis across 807 repositories after Cursor adoption, finding a 3–5× spike in output during month one, followed by a 30% increase in static analysis warnings and 41% increase in complexity. A METR randomized controlled trial found developers using AI took 19% longer on real tasks despite believing they were faster.
The tools aren’t broken. The feedback loops are.
Press enter or click to view image in full size
This is what it looks like when you measure individual velocity instead of system health. The common thread? Coding agents know what’s possible, not what’s right.
The codebase doesn’t drift all at once. It drifts one “working” commit at a time.
ArchCodex is a proof of concept: can we help coding agents get the right context when they need it? The approach combines hints, verifiable rules, and tools to check whether the agent (and the codebase) follows those rules paired with some prompting techniques.
The Approach
The obvious response to “missing context” is to give the LLM more context. While larger context can help, it is still limited. You can fit your entire codebase in a 1M token window. The bottleneck is what kind of context you provide, and whether it surfaces at the right time.
RAG is getting better and injects documentation at query time. This helps with API signatures and usage examples. It’s less effective for architectural boundaries, team conventions, and security patterns the stuff that lives in people’s heads and Slack threads, not docs. And because RAG retrieves from actual code, it can reintroduce old patterns or copy wrong ones. Research on agile teams found that significant portions of code commits result in undocumented knowledge (Saito et al.). You can’t retrieve what was never written down.
There’s active research on structured RAG, graph-based retrieval, and hybrid approaches that blur these lines. What I’m describing isn’t a different category; it is retrieval that’s structured around architectural concepts, scoped to what’s relevant, and enforced rather than suggested. Think of it as architectural metadata, a machine-readable version of the mental model a developer has.
Why Not Just More Context?
RAG retrieves what exists in the codebase which, may include drifted patterns. Fine-tuning bakes in patterns at training time, which can’t adapt to architectural decisions made yesterday.
Architecture-as-code operates differently:
Press enter or click to view image in full size
The registry can be updated after a single incident, immediately affecting every subsequent generation. It can be applied to existing code to surface violations. And when drift does happen, because it will, the health dashboard makes it visible.
The Four Layers
The approach has four layers:
- **Boundaries: **Tell the LLM what this file is allowed to touch. Import restrictions, layer violations, forbidden dependencies. Example: “Cannot import
expressinto domain layer." These prevent drift before it starts. - **Constraints: **Encode rules that should rarely be broken. “Always call
requireProjectPermission()before database access." "Never import infrastructure into domain." These catch silent bugs before they ship. - **Examples: **Surface canonical implementations at the right moment. “See
UserService.tsfor the pattern." "Use the event system, not direct calls." These guide the LLM toward consistency without requiring it to infer patterns from scattered examples. - **Validation: **Catch what slipped through. Single-file checks before commit. Cross-file analysis for layer violations. Health metrics that surface drift over time.
The key insight: these aren’t documentation. They’re structured context that surfaces when relevant and can be enforced when violated. The difference between a constraint and a wiki page is that the machine reads the constraint automatically and blocks the PR if it’s violated. Documentation gets ignored. Constraints get executed.
ArchCodex is one implementation of this approach. It’s not the only way to solve this, and it’s not a silver bullet. But it let me test whether structured guardrails could address the gaps the research identifies. The results from part 1 suggest they can.
Here’s how it works in practice.
How ArchCodex Works
The core mechanism is simple: you tag source files with an @arch annotation, and ArchCodex injects the relevant constraints when an agent reads the file.
The @arch tag is just a comment. In TypeScript: /** @arch domain.payment.processor */. In Python: # @arch domain.payment.processor. That’s it. ArchCodex scans for these tags and does the rest.
Boundaries Surface Before Generation
When an LLM agent reads a file through ArchCodex, via archcodex read --format ai or the MCP server integration, the tool looks up the file’s @arch tag, resolves the full inheritance chain, and prepends a structured header with all applicable constraints, hints, and boundaries. The agent sees this header before it sees the code. Without ArchCodex, the agent would just see raw source.
Here’s what that looks like:
IMPORT BOUNDARIES
Can import: ✓ src/domain/payments/* ✓ src/domain/shared/* ✓ src/utils/* Cannot import: ✗ src/api/* (layer violation) ✗ src/infra/* (domain must be infra-agnostic) ✗ express, fastify, pg (forbidden frameworks)
A “layer” here is a logical grouping you define, typically mapping to architectural boundaries like domain, api, infrastructure, or utils. You configure which directories belong to which layer and which layers can import from which. The domain layer shouldn’t import from api; api shouldn’t import from cli. These aren’t folder names, they’re conceptual boundaries that ArchCodex enforces.
The LLM knows what’s allowed before it writes a single line. The “missing context” problem, cited by 65% of developers as their top issue, gets addressed at the source.
In Part 1, I showed Opus 4.5 producing the smallest diff with correct logic, and still ranking 6th because of architectural drift. With boundaries explicit, the drift doesn’t happen in the first place.
Constraints Encode Conventions
The registry captures what usually lives in people’s heads:
myapp.domain.service: constraints: - rule: forbid_import value: [express, fastify, pg] severity: error why: Domain must be framework-agnostic alternative: Inject dependencies via constructor
- rule: require_call_before call: [requireProjectPermission, checkOwnership] before: ["repository.*", "ctx.db.*"] severity: error why: Verify permissions before database access hints: - Use requireProjectPermission() for ownership checks - See src/domain/user/UserService.ts for the pattern
The Registry Isn’t a Code Map
The registry doesn’t mirror your folder structure. domain.payment.processor doesn’t imply a domain/payment/processor.ts file path—it’s a conceptual hierarchy for inheriting rules.
When domain.payment inherits from domain, it means: "payment code follows all domain constraints, plus these extras." The inheritance is about rules, not code. Your file can live at src/billing/StripeProcessor.ts and still be tagged @arch domain.payment.processor.
This has a practical implication: registries are portable. You could create a “Next.js + Convex” registry encoding your team’s patterns, then reuse it across projects. The architectural knowledge isn’t locked to one codebase.
Canonical Implementations Counter the Xerox Effect
Without guidance, coding agents copy from whatever appeared recently in context , which might itself be a copy of a copy, each iteration drifting further from the original intent. Call it the xerox effect: each copy degrades.
A canonical implementation is a file you designate as “the authoritative way to do X.” Add it to the pattern registry, and ArchCodex surfaces it in hints and error messages. Instead of the agent copying the most recent (possibly drifted) example, it sees: “Use src/domain/user/UserService.ts as your reference."
One authoritative example prevents the drift that compounds through successive copies.
The GPT 5.1 Problem
Remember the GPT 5.1 result from Part 1? It produced working code with zero critical bugs — and still ranked dead last in my benchmark, because it didn’t use requireProjectPermission(). It did manual ownership checks instead. The code worked. It didn’t belong.
The require_call_before constraint prevents exactly this class of silent bug. The pattern is now explicit, not buried in tribal knowledge.
This isn’t theoretical. Before ArchCodex, my project NimoNova had files that bypassed sanitizeLLMInput() entirely, passing raw content to the model. The code compiled. It worked in testing. In production, it would have been a prompt injection vector. A constraint on LLM-facing modules now catches this automatically.
Validation Catches What Slipped Through
Even with good context, mistakes happen. Validation operates at two levels:
Single-file checks catch constraint violations on changed code:
src/domain/payments/PaymentService.ts
✗ ERROR: forbid_import violated Line 3: import { Request } from 'express' Why: Domain must be framework-agnostic ⚠ WARNING: require_call_before not satisfied repository.save() called without prior requireProjectPermission()
Errors don’t just say “no”, they say what to do instead. Each violation includes a suggestion and, where relevant, a did_you_mean field with concrete fix guidance:
FAIL: src/core/health/analyzer.ts forbid_import: chalk → Use: src/utils/logger.ts (LoggerService) Did you mean: import { logger } from '../../utils/logger.js'
This comes from the constraint definition in the registry. The agent doesn’t have to search for the right alternative, it’s handed one.
Cross-file checks catch systemic issues and check the complete project after architecture updates:
archcodex check --project
Layer violations: 3 Circular dependencies: 2 Missing canonical patterns: 7
The Feedback Loop
In Part 1, I showed how Haiku 4.5 improved as the registry evolved. The same pattern held when I measured silent bugs specifically:
Press enter or click to view image in full size
Each iteration of the registry, each constraint added from observing mistakes, made the next run better.
The registry improves through use. I sometimes use these five questions to surface improvements:
- What information did ArchCodex provide that helped?
- What information was missing?
- What was irrelevant or noisy?
- Did you update any architecture definitions?
- For the next developer, what will ArchCodex help with?
Improvements come from the need to change, tighten or update the architecture, introduce new patterns, new utilities, new ways of doing things, common bugs and errors. The registry is a living document. It helps engineers too, not just coding agents. It is architectural governance or mentorship at scale.
When Constraints Aren’t Enough
A fair criticism: doesn’t this just create rigidity? Codebases evolve. Good architects and engineers make context-dependent trade-offs.
ArchCodex isn’t only constraints. The registry has three layers of flexibility, plus a composition mechanism:
Hard constraints are rules that should rarely be broken. Import boundaries, security patterns, layer violations. These catch the mistakes that compound silently.
Hints are soft guidance. “Prefer X over Y.” “See this file for the pattern.” The coding agent sees them, weighs them, and makes a judgment call. No error if it chooses differently.
Intents declare known patterns that satisfy constraints in non-obvious ways. For example, your codebase might have a rule: “All database queries must filter soft-deleted records.” But what about queries that intentionally need deleted records — like a trash view or audit log? An @intent:includes-deleted annotation tells ArchCodex this query intentionally skips the filter, and satisfies the constraint that would otherwise require it. An @intent:cli-output exempts a file from the "no console.log" rule. Intents are decisions, not exceptions. They document valid alternative patterns.
Mixins are reusable constraint bundles. Instead of repeating “must have test file” and “max 300 lines” across ten architectures, you define a tested mixin once and compose it in the registry: mixins: [tested, srp]. You can also apply mixins per-file using inline syntax: @arch domain.payment.processor +singleton +pure. Mixins keep the registry DRY while allowing file-level flexibility.
And when you encounter an unanticipated exception, the override system makes it explicit:
// @override forbid_import:pg// reason: Legacy migration script, will be removed by Q2// expires: 2025-06-01import { Client } from 'pg';
The violation is acknowledged, documented, and time-boxed. Teams can track how much architectural debt they’re carrying and whether it’s growing or shrinking.
The goal isn’t to prevent all deviation. It’s to make deviation visible. When a coding agent breaks a pattern, you want to know whether it’s drift (bad) or evolution (good).
Ongoing Health and Keeping the Registry Up to Date
Codebases drift over time. The CMU study showed complexity accumulating even as velocity gains faded. ArchCodex surfaces this before it compounds.
Even with hints and constraints, coding agents still tend to “forget” or say things like “for the sake of time, let me do this quickly”, resulting in code duplication, violations, and other drift. Three commands address this:
**archcodex check** - Linter-like validation for architecture. Run on save, commit, or CI. Catches constraint violations, layer boundary crossings, and forbidden patterns. With --project, it also detects circular dependencies.
**archcodex health** - Dashboard for architectural debt. Shows:
- Override debt: How many overrides exist, which are expiring, which have expired
- Coverage: What percentage of files have
@archtags - Registry bloat: Architectures used by only one file, similar sibling architectures that could be consolidated
- Type duplicates: Identical or near-identical type definitions across files
- Recommendations: Actionable suggestions (e.g., “run
archcodex audit --expired")
**archcodex garden** - Index maintenance and pattern detection. Finds naming conventions that aren’t yet captured in the registry, inconsistent @arch usage, and missing keywords for discovery.
The goal isn’t perfection. It’s visibility. You can’t fix drift you can’t see.
What This Doesn’t Solve
You don’t need a perfect registry on Day 1. A common question: “For a brownfield project with 500k lines of code, how do I start?” Start with one architecture definition for your most critical layer. Add constraints as violations surface. The registry grows from real issues, not from trying to document everything upfront. An empty registry doesn’t break anything — it just means you’re not getting guardrails yet.
ArchCodex doesn’t replace security scanners. It catches architectural security issues (missing permission checks, layer violations) but not injection vulnerabilities or cryptographic weaknesses.
It doesn’t automatically refactor code. It surfaces problems. You fix them. Or the coding agent fixes them, with the constraints now visible.
It requires investment. You write the registry. The LLM helps, and it grows from real issues rather than from scratch. It’s not zero-effort but it might save time.
It doesn’t work magic on terrible codebases. If your architecture is genuinely confused, ArchCodex will show you the mess. It won’t clean it up for you. But it can guide refactoring.
The debugging overhead is real: 67% of developers spend more time debugging AI-generated code than before (Harness). The security remediation gap is worse: only 21% of serious AI/LLM vulnerabilities are ever fixed (Cobalt).
ArchCodex doesn’t eliminate these problems. It addresses their root cause: AI generating code without knowing the rules.
The Bigger Picture
The research is clear: AI is making developers faster at writing code that’s harder to maintain. Individual velocity is up; system health is down.
I don’t think ArchCodex is the only answer. But I think it points toward an answer: coding agents need structured context that surfaces at the right time. They need to know what’s forbidden, not just what’s possible. And the teams that figure out how to capture senior expertise and make it executable, through constraints, through guardrails, through whatever comes next, will ship faster and more reliably.
The table saw metaphor from Part 1 still holds. The saw isn’t the problem. The missing jig is.
ArchCodex is open source. It’s one implementation of these ideas, not the definitive one. If you want to test the approach on your own codebase, or if you find gaps, I’d like to hear about it.
GitHub: github.com/ArchCodexOrg/archcodex
References
- Google DORA, “Accelerate State of DevOps Report 2024” (Oct 2024)
- Google DORA, “State of AI-assisted Software Development 2025” (Sept 2025)
- He et al., “Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects,” Carnegie Mellon University (Nov 2025)
- GitClear, “AI Copilot Code Quality 2025” (Feb 2025)
- CodeRabbit, “State of AI vs Human Code Generation Report” (Dec 2025)
- Qodo, “State of AI Code Quality in 2025” (June 2025)
- Veracode, “2025 GenAI Code Security Report” (Aug 2025)
- METR, “Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity” (July 2025)
- Cobalt, “State of Pentesting Report 2025” (Oct 2025)
- Harness, “State of Software Delivery Report 2025” (Jan 2025)
- Saito et al., “Discovering undocumented knowledge through visualization of agile software development activities,”Requirements Engineering (2018)
This is Part 2 of a series on AI-assisted development. Part 1 covered the benchmarks and why I built it.