Stop vibe coding your unit tests

Author: Andrew Gallagher

Published at: Wed Nov 05 2025

A QA engineer walks into a bar and orders a beer.

She orders 2 beers. She orders 0 beers. She orders -1 beers. She orders a lizard. She orders a NULLPTR. She tries to leave without paying.

Satisfied, she declares the bar ready for business.

The first customer comes in an orders a beer. They finish their drink, and then ask where the bathroom is. The bar explodes.

There is a growing sentiment [that LLMs are good for CRUD, boilerplate, and tests](https://www.assembled.com/blog/why-i-code-as-a-cto#:~:text=Knowing%20where%20AI%20shines%20(crud%2C%20tests%2C%20boilerplate)%20and%20where%20it%20fails%20(precision%2C%20system%20nuance)%20always%20beats…

Home

Author: Andrew Gallagher

Published at: Wed Nov 05 2025

A QA engineer walks into a bar and orders a beer.

She orders 2 beers. She orders 0 beers. She orders -1 beers. She orders a lizard. She orders a NULLPTR. She tries to leave without paying.

Satisfied, she declares the bar ready for business.

The first customer comes in an orders a beer. They finish their drink, and then ask where the bathroom is. The bar explodes.

There is a growing sentiment that LLMs are good for CRUD, boilerplate, and tests. While I am not so sure about how good AI is at making CRUD1 or thumping out boilerplate, a year of working as an SWE in the modern LLM-powered AI codescape has proven to me that LLMs write unconstructive, noisy, brittle, and downright-bad unit tests. Please do not vibe code your unit tests.

When generating new software functionality, a common modern workflow I have seen amongst my peer developers looks something like this:

Ask your agent to write up some code
Ask your agent to cook up some unit tests
Ask your agent to write up some documentation (e.g. a README.md or something), to describe the outputted code in human language
Iterate against your code. On iteration, instruct your agent to update both the documentation and the unit tests.
Clean and dial in your output
Submit a PR

You can feel free to chop up the order of these steps as you’d like it2. The end result of this process is pretty uniform regardless — a PR with a fuckton of brittle unit tests. I have seen truly excellent developers submit nice code accompanied by nonsense unit tests — the allure and simplicity of this workflow really lets the slop in.

What makes these unit tests so bad3? Two things: 1) LLMs write way too many unit tests and 2) the tests are extremely frequently just verifying what the code does, not validating what the code should do.

Let’s concoct a trivial example, say we have the following react component, a button:

import type { ButtonHTMLAttributes } from "react";

interface ButtonProps extends ButtonHTMLAttributes<HTMLButtonElement> {
children?: React.ReactNode;
variant?: "primary" | "secondary" | "danger";
size?: "small" | "medium" | "large";
}

export const Button = ({
children = "Click Me",
variant = "primary",
size = "medium",
className = "",
...props
}: ButtonProps) => {
const baseClasses = "btn";
const variantClasses = {
primary: "btn-primary",
secondary: "btn-secondary",
danger: "btn-danger",
};

const sizeClasses = {
small: "btn-sm",
medium: "btn-md",
large: "btn-lg",
};

const classes = [
baseClasses,
variantClasses[variant],
sizeClasses[size],
className,
]
.filter(Boolean)
.join("  ");

return (
<button className={classes} {...props}>
{children}
</button>
);
};

What do you think the LLM will output when we instruct it to write some unit tests for us? Imagine a prompt along the lines of "Can you write some unit tests for this button?"

Here’s an example output:

import { render, screen } from "@testing-library/react";
import userEvent from "@testing-library/user-event";
import { describe, it, expect, vi } from "vitest";
import { Button } from "./button";

describe("Button", () => {
describe("Default behavior", () => {
it("renders with default props", () => {
render(<Button />);
const button = screen.getByRole("button");

expect(button).toBeInTheDocument();
expect(button).toHaveTextContent("Click Me");
expect(button).toHaveClass("btn", "btn-primary", "btn-md");
});

it("renders as a button element", () => {
render(<Button />);
const button = screen.getByRole("button");

expect(button.tagName).toBe("BUTTON");
});
});

describe("Children prop", () => {
it("renders custom text content", () => {
render(<Button>Custom Text</Button>);
const button = screen.getByRole("button");

expect(button).toHaveTextContent("Custom Text");
});

it("renders JSX children", () => {
render(
<Button>
<span>Icon</span> Submit
</Button>
);
const button = screen.getByRole("button");

expect(button).toHaveTextContent("Icon Submit");
expect(button.querySelector("span")).toBeInTheDocument();
});

it("renders when children is undefined", () => {
render(<Button children={undefined} />);
const button = screen.getByRole("button");

expect(button).toHaveTextContent("Click Me");
});
});

describe("Variant prop", () => {
it("applies primary variant class", () => {
render(<Button variant="primary" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-primary");
expect(button).not.toHaveClass("btn-secondary", "btn-danger");
});

it("applies secondary variant class", () => {
render(<Button variant="secondary" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-secondary");
expect(button).not.toHaveClass("btn-primary", "btn-danger");
});

it("applies danger variant class", () => {
render(<Button variant="danger" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-danger");
expect(button).not.toHaveClass("btn-primary", "btn-secondary");
});
});

describe("Size prop", () => {
it("applies small size class", () => {
render(<Button size="small" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-sm");
expect(button).not.toHaveClass("btn-md", "btn-lg");
});

it("applies medium size class", () => {
render(<Button size="medium" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-md");
expect(button).not.toHaveClass("btn-sm", "btn-lg");
});

it("applies large size class", () => {
render(<Button size="large" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn-lg");
expect(button).not.toHaveClass("btn-sm", "btn-md");
});
});

describe("ClassName prop", () => {
it("applies custom className along with default classes", () => {
render(<Button className="custom-class" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn", "btn-primary", "btn-md", "custom-class");
});

it("applies multiple custom classes", () => {
render(<Button className="class-one class-two" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn", "btn-primary", "btn-md", "class-one", "class-two");
});

it("handles empty className", () => {
render(<Button className="" />);
const button = screen.getByRole("button");

expect(button).toHaveClass("btn", "btn-primary", "btn-md");
});
});

describe("Event handling", () => {
it("calls onClick handler when clicked", async () => {
const user = userEvent.setup();
const handleClick = vi.fn();

render(<Button onClick={handleClick}>Click me</Button>);
const button = screen.getByRole("button");

await user.click(button);

expect(handleClick).toHaveBeenCalledTimes(1);
});

it("calls onClick handler with correct event object", async () => {
const user = userEvent.setup();
const handleClick = vi.fn();

render(<Button onClick={handleClick}>Click me</Button>);
const button = screen.getByRole("button");

await user.click(button);

expect(handleClick).toHaveBeenCalledWith(
expect.objectContaining({
type: "click",
target: button
})
);
});

it("does not call onClick when button is disabled", async () => {
const user = userEvent.setup();
const handleClick = vi.fn();

render(<Button onClick={handleClick} disabled>Click me</Button>);
const button = screen.getByRole("button");

await user.click(button);

expect(handleClick).not.toHaveBeenCalled();
});
});

describe("HTML attributes", () => {
it("forwards HTML button attributes", () => {
render(
<Button
id="test-button"
title="Test button"
data-testid="custom-button"
aria-label="Custom button label"
>
Test
</Button>
);
const button = screen.getByRole("button");

expect(button).toHaveAttribute("id", "test-button");
expect(button).toHaveAttribute("title", "Test button");
expect(button).toHaveAttribute("data-testid", "custom-button");
expect(button).toHaveAttribute("aria-label", "Custom button label");
});

it("handles disabled attribute", () => {
render(<Button disabled>Disabled</Button>);
const button = screen.getByRole("button");

expect(button).toBeDisabled();
});

it("handles type attribute", () => {
render(<Button type="submit">Submit</Button>);
const button = screen.getByRole("button");

expect(button).toHaveAttribute("type", "submit");
});

it("handles form attribute", () => {
render(<Button form="test-form">Submit</Button>);
const button = screen.getByRole("button");

expect(button).toHaveAttribute("form", "test-form");
});
});

// ... and like 20 more tests beyond this, trust me
});

Curious what Claude Sonnet 4 might spit out with multiple attempts? Look no further! Claude outputs ~30 tests pretty much every time, at around ~200 LOC on average4. Just so, so much code. We test interesting things like: “should renders” and “does render with props”. Why would “render” be your responsibility, as opposed to the react team’s? And do you have any idea how totally fucked the entire web would be if rendering a basic button failed? Or if a button stopped having default accessability traits?

LLMs also appear to biased to give users a direct answer instead of asking clarifying questions5. This is a loaded statement and probably obvious given my prompt, but Claude never returns to ask “what should I test”. Rather, the LLM just verifies everything.

To me these unit tests are worse than nothing — all that they do is lock in this button. But we are told as developers that we should write unit tests, and that we should strive for X% coverage. We now have LLMs at our disposal to make writing unit tests feel like an afterthought — and in this, actually be an afterthought. But psychologically, it feels good — and everybody else seems to be doing it.

So why not just embrace it and ride upon the mighty slop wave? There are some cases where you can actually reasonably one-shot tests for a leetcode-y big brain problem. I would actually say — in total honesty — that I feel that LLMs are definitely far better than human developers when it comes to writing comprehensive and meaningful tests for highly abstract code. Conveniently, empirical research agrees with my sentiment.

The problem with this is that I have never in my life been paid to write interesting, abstract leetcode-y big brain code. Odds are good that if you are reading this, you haven’t either. I spend most of my time writing product code, and the hard work of verification is often outsourced to my dependencies. Most of the time, I am writing tests that validate that I am building the right thing — if a test fails, it represents that what we are building has diverged from our intentions.

For your agent, writing a ton of bad unit tests has some pretty serious downsides. When you slop it up, you crowd out proper tests and pollute the semantic search for your agent. Additionally, these bloated spec files consume valuable context window space. Worse yet, large files tend to rank highly in semantic searches, so you will also run into these prior degradations more frequently.

For humans, writing a ton of bad unit tests has many far worse downsides. First — and most importantly — your coworkers will hate you for opening yet another brittle 4,000 line PR. Second, when you are just verifying what code does, all updates to code will by nature be accompanied by an adjustment to a unit test. Finally, when you have a bazillion brittle unit tests covering said code, you’ll just have to update more unit tests.

I am still a firm believer that the cat is out of the bag in terms of agentic coding — things are probably not going to change any time soon. Besides, working with Cursor has been one of the most interesting parts of 2025 for me.

And there is still an effective technique to write unit tests with agentic coding — it’s just unfortunately not very easy, nor as pleasant as asking Claude to “write unit test”. When making tests, write them one at a time. Ask your agent specifically what you’d like to test, and verify that the agent has written something that makes sense. Like any other piece of software, keep it focused, keep it brief, and mind every line.

“Less is more.”

I think that software complexity has significantly ballooned over the past 10 years, and the job of making simple CRUD just doesn’t exist any more. We just model the intricacies of the world more closely with software now. (I also attribute the decline of Rails a bit to this)
The good people at Anthropic recommend writing your unit tests first. I personally find the concept of doing TDD-in-2025-but-this-time-it-is-different-because-we-have-AI-now to be darkly funny, but that’s another story.
I also have a hunch that LLMs are maybe bad at writing unit tests because historically developers have so often skipped writing them; effective unit tests may just not be proportionally represented in LLM training data. But take this with a grain of salt — I don’t actually know how LLMs actually work!
Yeah, sample size of 10 but you catch my drift.
I would love to know if this is provably true?

Similar Posts