The LLM Flywheel Effect: AI That Writes and Tests Documentation

To help a team member get up to speed on a project, I had to learn and then document how to set up a Mac environment with both Node.js and the .NET runtime. I had never used .NET on a Mac, so the first customer for this piece of documentation was me. Naturally, I tapped my team of AI assistants who collectively hold a lot of knowledge about the topic. They wrote instructions, I followed along and reported problems, and we iterated toward the solution.

Then the penny dropped: These AI assistants can not only help write the instructions, but they can also read them and help me reproduce them. I’ve decided to call this the flywheel effect. It’s not automatic; I’ve yet to have the kind of hands-off experience that others report with AI, but that’s not my goal. I don’t want to be…

The Role of an MCP Server in the AI Workflow

A key enabler for this scenario was a filesystem MCP server that enables agents like Claude and Cursor to read and write files. Anthropic’s reference implementation granted the access required to read and write the evolving document. It did not grant access to run the necessary system commands, so I was firmly in the loop: Copy/paste the commands they suggested, run them, copy/paste the output, and discuss next steps.

I don’t want to be out of the loop, I want to be in it efficiently: Start the flyweel spinning, then tap it strategically to build momentum.

This worked beautifully, modulo the ongoing struggle to manage MCP configuration across a team of assistants. Each has its own configuration file, and although the MCP protocol itself is standard, the locations and formats of these config files are not. In How LLMs Guide Us to a Happy Path for Configuration and Coding, I observed that configuration is the new hard problem — one that eclipses cache invalidation, naming, and off-by-one errors. You can enlist AI assistants to debug their own configurations, but I wish people who run our own MCP server didn’t have to; it’s a buzzkill. Is there a better way to handle this? If so, please let me know, I’m all ears.

You can also do this kind of thing in a more direct way using Claude Code or Codex. To test that approach, I nuked the installation and asked Claude Code to read the instructions, follow the steps, run all the necessary commands with my permission, evaluate outputs, and produce a final report. Everything got installed, the backend server started, and the frontend app ran successfully. Here’s the report.

We’ve long imagined documentation as a first-class software engineering discipline, but it hasn’t been clear exactly what that would mean. Now the picture is coming into focus. AI assistants can help us not just create documentation, but also test it — just as we test our code. If you’ve ever struggled to write reproducible docs, or been frustrated by installation instructions that don’t work as described, you’ll appreciate the power of this flywheel effect.

Iterating on an MCP Server With AI Feedback

When I used Claude to help build the first version of the XMLUI MCP server, I was amazed to find that since Claude was also a client of that server, I could ask it to reflect on the responses it got from the tools provided by the MCP server and then adjust the server code to improve those responses. A major priority was to anchor agents to ground truth, so we arranged for all responses to include dire warnings: invent no syntax, use and recommend only techniques backed by docs that include working examples, always cite the URLs of those docs.

An agent-to-agent architecture may lie in the future.

With that guidance, coding agents behave better than they did before, but they often still ignore the guidance and require interactive reminders to follow it. Yelling louder at them won’t help. The root of the problem is that an MCP server has no independent agency. It can try to influence how an agent selects and uses its tools, but ultimately has no control over that selection and use. My hunch is that the MCP server needs to become more agent-like, so it can work with coding assistants’ agents on a more equal footing. An agent-to-agent architecture may lie in the future.

Meanwhile, when I recruited my team of assistants to evaluate the MCP server, they all agreed that the verbose guidance was wasting tokens. Sample feedback: “The MANDATORY” guidance block is repetitive and uses ~2k tokens per failure.” A related issue was that search results could be too verbose and were also chewing into the token budget.

I picked Kiro to streamline the responses, then I rebuilt the server and asked it to re-evaluate the sample queries that had produced unwieldy responses. Here’s what it saw.

-Failure messages are now concise: Only three bullet points instead of 20+ “MANDATORY” warnings (saves ~1.5k tokens per failure)

-Result truncation working: Successful searches show 20 results with “… X more results omitted …” message (cleaner than the old “use JSON for full list”)

– Snippet truncation: Long lines are truncated at 200 chars with “…”

– All tools affected: The improvements apply to xmlui_search, xmlui_search_howto, and xmlui_examples

Because this is a team effort, I then asked Cursor to do the same evaluation. Here’s what it noticed:

Issue: When there are no results, the code always includes base rule reminders that aren’t useful:

“Cite sources with file paths and URLs” (no sources to cite)

“Provide URLs from documentation_urls when available” (no URLs available)

These reminders are useful for successful searches, not failures.

Good point! I relayed the feedback to Kiro, it made the implied changes, and I verified them with a round of feedback from the whole team.

The Human Dev’s Role in the Virtuous Cycle

While I suspect that the nascent agent-to-agent protocol will enable this kind of thing to happen more autonomously, I’m happy to be the coordinator and I don’t think I’d ever want to fully abandon that role.

I’m reminded of the old adage about building a plane while you are flying it. In this case, weirdly and remarkably, the pilot who senses problems is also the mechanic who fixes them. Who am I in this scenario? To torture the metaphor, I guess I am the manager of the airline who sets goals, builds teams, starts the flywheel spinning, and taps it at the right times and in the right ways to accelerate a virtuous cycle of improvement.

The Role of an MCP Server in the AI Workflow

Iterating on an MCP Server With AI Feedback

The Human Dev’s Role in the Virtuous Cycle

Similar Posts