Cutting integration test time from 2 weeks to 2 hours using Claude Code

Featured

From 2 weeks to 2 hours — cutting integration test time using Claude Code Subagents

6 min read1 hour ago

–

Improving velocity and code quality is a standing priority. For a fast-moving company like Airwallex, robust integration testing is essential — but keeping a large suite accurate and current is notoriously difficult. During the build of Airwallex Billing, we rethought the problem from the ground up with Claude Code Sub-agents. Meet Airtest: an AI-generated, self-healing test platform that’s already delivered 4,000+ integration tests and helped us launch 50 APIs safely.

Press enter or click to view image in full size

Photo by Şahin Doğdu: [pexels.com](https://www.pexels.com/photo/aerial-view-of-people-swimming-on…

Featured

From 2 weeks to 2 hours — cutting integration test time using Claude Code Subagents

6 min read1 hour ago

–

Press enter or click to view image in full size

Photo by Şahin Doğdu: pexels.com

The Mess — Why We Built This

While we enforced a strict 90% unit test coverage, our approach to integration testing was less consistent and harder to measure. This created a major blind spot. Writing effective integration tests is difficult because they are:

Complex to Set Up: They require orchestrating entire environments with live databases, services, and third-party APIs.
Stateful and Brittle: Each test needs specific data to be created and then perfectly cleaned up, making them hard to isolate and easy to break.
Slow and Flaky: They run far slower than unit tests and often fail due to transient issues like network timeouts, which erodes trust in the results.
A Nightmare to Maintain: A small change in one service can cause a cascade of failures across many tests, making them costly to keep up-to-date.

We needed a solution for writing integration tests that could keep up with our pace of development. And, more importantly, maintain them without constant human intervention.

The Context Gap: Why Generic AI Tools Fall Short for Integration Testing?

While these general-purpose models are impressive, we quickly discovered their limitations when applied to complex, real-world testing. The bottleneck isn’t their coding ability, but a critical context gap. Here’s where they fall short:

Limited System Understanding: General tools like Cursor and Claude require manual prompts, and documentation like PRDs quickly becomes outdated. We found the codebase itself is the best source of truth. Therefore, we built Airtest to automatically analyze business requirements directly from the code. This approach avoids reliance on stale documentation or manual context, ensuring tests align perfectly with the actual implementation.
No Knowledge of Internal Frameworks: These AIs are unfamiliar with proprietary infrastructure such as our internal testing framework. They don’t know our custom utilities or conventions, so their output demands heavy manual refactoring.
Inconsistent Methodology: When every developer brings their own prompts, you end up with scattered, inconsistent tests — far from a unified, robust test suite.

That’s why we built Airtest: to combine domain context, framework compatibility, and consistent methodology into a single, AI-steered system that transforms general-purpose models into reliable engineering tools.

Building Airtest: A Multi-Agent Crew for Smarter Software Testing

Press enter or click to view image in full size

Our new automated testing framework is powered by a multi-agent system built on three core components: a team of specialized AI agents, a suite of developer tools they can wield, and a persistent knowledge base for context.

What is a “Claude Code Subagent”?

A Claude Code subagent is a specialized AI assistant designed for specific tasks, equipped with its own tools, context window, and custom system prompt. It allows Claude Code to delegate task-specific work efficiently, improving focus and context management.

The Agent Crew

The entire process is initiated by a /airtest* *slash command given to Claude Code, which acts as the General Agent. It analyzes the main task, coordinates the workflow, and delegates responsibilities to a team of specialized agents.

This team is designed for comprehensive test coverage and maintenance:

Test Generation Specialists: A suite of agents is dedicated to generating a wide variety of test cases. This includes agents covering various cases, such as Happy (expected functionality), Unhappy (error handling), State Transition, Dependency Testing, and End-to-end flow. This division of labor ensures that the codebase is scrutinized from multiple angles, covering a vast range of potential scenarios.
Analysis and Maintenance Crew: Beyond generation, other agents handle the critical tasks of review and maintenance. The Test Reviewer Agent assesses the quality of generated tests, and the Test Debugging Agent identifies and helps fix failing tests. The Existing Tests Analysis Agent uses Code Search and Read tools to gain deep insight into your codebase’s architecture and conventions while analyzing your current test coverage to identify gaps. This allows it to generate targeted tests that add value without creating redundancy. The new tests are idiomatic and seamlessly integrate into your project, requiring minimal rework.
Agent Prompts: Clear instructions, best practices (e.g. Equivalence Class Partitioning, Boundary Value Analysis, etc.), and guiding principles that shape how the agents generate maintainable and comprehensive test cases. It also comes with hands-on recipes for effective testing of RESTful APIs and covering common topics such as idempotency, authentication, authorization, pagination, filtering, error codes, etc. But beyond these common “recipes”, it also analyzes your code, understands the actual business flows behind the endpoints, and writes integration tests that reflect real-world scenarios — covering both technical correctness and domain-specific behavior.

The Toolkit

A key element that makes this system so effective is its agent tool usage. These AI agents aren’t just thinking in a vacuum; they have hands-on access to a developer’s essential toolkit. They can use a Code Search to navigate the codebase, a Text Editor to write and modify files, Read Documents for context, and run Bash commands. This ability to interact directly with the development environment allows the multi-agent system to autonomously write, execute, and debug tests in a seamless, powerful loop.

The Knowledge Base (Storage)

To perform their tasks intelligently, the agents draw from a long-term storage or knowledge base. This repository provides the crucial context and guidelines needed for creating relevant and effective tests.

Agent Memory: A continuously updated knowledge store that provides the context agents need to write meaningful integration tests. It tracks things like business flows, API dependencies and the impact of recent code changes. Think of it as a Claude.md file for testing — not a general system document, but a living record of what to test, why it matters, and how the behavior has evolved. Just like Claude.md gets updated as the project grows, Agent Memory is refreshed automatically through code analysis, ensuring tests stay aligned with the current state of your application.

Conclusion: Agentic AI Is the Future of Testing

If our experience with Airtest proves one thing, it’s that Agentic AI is ready to tackle complex engineering challenges today. This isn’t a theoretical future; it’s a practical solution delivering exponential results. It acts as a truly autonomous system — planning, executing, and self-healing to achieve its goal.

The results are not incremental; they are substantial:

A 2-week marathon became a 2-hour sprint.
Integration tests that cover 100% of our APIs and our most critical user flows.

These numbers are more than metrics; they are a glimpse into a new way of working.

In Part 2 of this series, we will open up the hood and show you exactly how the system works. We’ll do a technical deep dive into our agent orchestration workflow, complete with a concrete, end-to-end example of an API test being born. In future articles, we plan to explore:

**Architectural Evolution: **How did we determine the ideal breakdown of responsibilities among our agents? We’ll share the lessons learned from earlier iterations, including where combining agents proved less effective, and what that taught us about workflow design.
Maintainability and Sustainability: How maintainable are the AI-generated tests over the long term? We’ll discuss the key challenges and the sustainability of this entire setup.
The Art of the Prompt: A deep dive into our prompt engineering strategies, revealing the specific techniques and prompts we use for each specialized agent.

The question for every engineering leader is no longer if Agentic AI will change their workflow, but how they will leverage it to unlock their own 10x to 100x efficiency gains. We hope our journey with Airtest at Airwallex provides a blueprint.

The Authors

is a Staff AI Engineer at Airwallex where she builds next-generation AI solutions. She is particularly enthusiastic about the potential of Agentic AI. and shares her AI and tech insights part-time on her YouTube channel. is an *Engineering Director at Airwallex and building Airwallex Billing — a platform that helps global companies manage invoicing, subscription and usage-based billing at ease. His passion for developer productivity and AI tools comes from the daily challenge of building and scaling a world-class financial infrastructure.

From 2 weeks to 2 hours — cutting integration test time using Claude Code Subagents

From 2 weeks to 2 hours — cutting integration test time using Claude Code Subagents

The Mess — Why We Built This

The Context Gap: Why Generic AI Tools Fall Short for Integration Testing?

Building Airtest: A Multi-Agent Crew for Smarter Software Testing

The Agent Crew

The Toolkit

The Knowledge Base (Storage)

Conclusion: Agentic AI Is the Future of Testing

Similar Posts