Empirical Characterization Testing

Gathering empirical evidence while adding tests to legacy code.

This article is part of a short series on empirical test-after techniques. Sometimes, test-driven development (TDD) is impractical. This often happens when faced with legacy code. Although there’s a dearth of hard data, I guess that most code in the world falls into this category. Other software thought leaders seem to suggest the same notion.

For the purposes of this discussion, the definition of legacy code is code without automated tests.

“Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it…

Gathering empirical evidence while adding tests to legacy code.

For the purposes of this discussion, the definition of legacy code is code without automated tests.

“Code without tests is bad code. It doesn’t matter how well written it is; it doesn’t matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really don’t know if our code is getting better or worse.”

As Michael Feathers suggest, the accumulation of knowledge is at the root of this definition. As I outlined in Epistemology of software, tests are the source of empirical evidence. In principle it’s possible to apply a rigorous testing regimen with manual testing, but in most cases this is (also) impractical for reasons that are different from the barriers to automated testing. In the rest of this article, I’ll exclusively discuss automated testing.

We may reasonably extend the definition of legacy code to a code base without adequate testing support.

When do we have enough tests? #

What, exactly, is adequate testing support? The answer is the same as in science, overall. When do you have enough scientific evidence that a particular theory is widely accepted? There’s no universal answer to that, and no, p-values less than 0.05 isn’t the answer, either.

In short, adequate empirical evidence is when a hypothesis is sufficiently corroborated to be accepted. Keep in mind that science can never prove a theory correct, but performing experiments against falsifiable predictions can disprove it. This applies to software, too.

“Testing shows the presence, not the absence of bugs.”

The terminology of hypothesis, corroboration, etc. may be opaque to many software developers. Here’s what it means in terms of software engineering: You have unspoken and implicit hypotheses about the code you’re writing. Usually, once you’re done with a task, your hypothesis is that the code makes the software work as intended. Anyone who’s written more than a hello-world program knows, however, that believing the code to be correct is not enough. How many times have you written code that you assumed correct, only to find that it was not?

That’s the lack of knowledge that testing attempts to address. Even manual testing. A test is an experiment that produces empirical evidence. As Dijkstra quipped, passing tests don’t prove the software correct, but the more passing tests we have, the more confidence we gain. At some point, the passing tests provide enough confidence that you and other stakeholders consider it sensible to release or deploy the software. We say that the failure to find failing tests corroborates our hypothesis that the software works as intended.

In the context of legacy code, it’s not the absolute lack of automated tests that characterizes legacy code. Rather, it’s the lack of adequate test coverage. It’s that you don’t have enough tests. Thus, when working with legacy code, you want to add tests after the fact.

Characterization Test recipes #

A test written after the fact against a legacy code base is called a Characterization Test, because it characterizes (i.e. describes) the behaviour of the system under test (SUT) at the time it was written. It’s not a given that this behaviour is correct or desirable.

Michael Feathers gives this recipe for writing a Characterization Test:

“Use a piece of code in a test harness.

“Write an assertion that you know will fail.

“Let the failure tell you what the behavior is.

“Change the test so that it expects the behavior that the code produces.

“Repeat.”

Notice the second step: “Write an assertion that you know will fail.” Why is that important? Why not write the ‘correct’ assertion from the outset?

The reason is the same as I outlined in Epistemology of software: It happens with surprising regularity that you inadvertently write a tautological assertion. You could also make other mistakes, but writing a failing test is a falsifiable experiment. In this case, the implied hypothesis is that the test will fail. If it does not fail, you’ve falsified the implied prediction. To paraphrase Dijkstra, you’ve proven the test wrong.

If, on the other hand, the test fails, you’ve failed to falsify the hypothesis. You have not proven the test correct, but you’ve failed in proving it wrong. Epistemologically, that’s the best result you may hope for.

I’m a little uneasy about the above recipe, because it involves a step where you change the test code to pass the test. How can you know that you, without meaning to, replaced a proper assertion with a tautological assertion?

For that reason, I sometimes follow a variation of the recipe:

Write a test that exercises the SUT, including the correct assertion you have in mind.
Run the test to see it pass.
Sabotage the SUT so that it fails the assertion. If there are several assertions, do this for each, one after the other.
Run the test to see it fail.
Revert the sabotage.
Run the test again to see it pass.
Repeat.

The last test run is strictly not necessary if you’ve been rigorous about how you revert the sabotage, but psychologically, it gives me a better sense that all is good if I can end each cycle with a green test suite.

Example #

I don’t get to interact that much with legacy code, but even so, I find myself writing Characterization Tests with surprising regularity. One example was when I was characterizing the song recommendations example. If you have the Git repository that accompanies that article series, you can see that the initial setup is adding one Characterization Test after the other. Even so, as I follow a policy of not adding commits with failing tests, you can’t see the details of the process leading to each commit.

Perhaps a better example can be found in the Git repository that accompanies Code That Fits in Your Head. If you own the book, you also have access to the repository. In commit d66bc89443dc10a418837c0ae5b85e06272bd12b I wrote this message:

“Remove PostOffice dependency from Controller

“Instead, the PostOffice behaviour is now the responsibility of the EmailingReservationsRepository Decorator, which is configured in Startup.

“I meticulously edited the unit tests and introduced new unit tests as necessary. All new unit tests I added by following the checklist for Characterisation Tests, including seeing all the assertions fail by temporarily editing the SUT.”

Notice the last paragraph, which is quite typical for how I tend to document my process when it’s otherwise invisible in the Git history. Here’s a breakdown of the process.

I first created the EmailingReservationsRepository without tests. This class is a Decorator, so quite a bit of it is boilerplate code. For instance, one method looks like this:

public Task<Reservation?> ReadReservation(int restaurantId, Guid id)
{
return Inner.ReadReservation(restaurantId, id);
}

That’s usually the case with such Decorators, but then one of the methods turned out like this:

public async Task Update(int restaurantId, Reservation reservation)
{
if (reservation is null)
throw new ArgumentNullException(nameof(reservation));

var existing =
await Inner.ReadReservation(restaurantId, reservation.Id)
.ConfigureAwait(false);
if (existing is { } && existing.Email != reservation.Email)
await PostOffice
.EmailReservationUpdating(restaurantId, existing)
.ConfigureAwait(false);

await Inner.Update(restaurantId, reservation)
.ConfigureAwait(false);

await PostOffice.EmailReservationUpdated(restaurantId, reservation)
.ConfigureAwait(false);
}

I then realized that I should probably cover this class with some tests after all, which I then proceeded to do in the above commit.

Consider one of the state-based Characterisation Tests I added to cover the Update method.

[Theory]
[InlineData(32, "David")]
[InlineData(58, "Robert")]
[InlineData(58, "Jones")]
public async Task UpdateSendsEmail(int restaurantId, string newName)
{
var postOffice = new SpyPostOffice();
var existing = Some.Reservation;
var db = new FakeDatabase();
await db.Create(restaurantId, existing);
var sut = new EmailingReservationsRepository(postOffice, db);

var updated = existing.WithName(new Name(newName));
await sut.Update(restaurantId, updated);

var expected = new SpyPostOffice.Observation(
SpyPostOffice.Event.Updated,
restaurantId,
updated);
Assert.Contains(updated, db[restaurantId]);
Assert.Contains(expected, postOffice);
Assert.DoesNotContain(
postOffice,
o => o.Event == SpyPostOffice.Event.Updating);
}

This test immediately passed when I added it, so I had to sabotage the Update method to see the assertions fail. Since there are three assertions, I had to sabotage the SUT in three different ways.

To see the first assertion fail, the most obvious sabotage was to simply comment out or delete the delegation to Inner.Update:

//await Inner.Update(restaurantId, reservation)
//    .ConfigureAwait(false);

This caused the first assertion to fail. I was sure to actually look at the error message and follow the link to the test failure to make sure that it was, indeed, that assertion that was failing, and not something else. Once I had that verified, I undid the sabotage.

With the SUT back to unedited state, it was time to sabotage the second assertion. Just like FakeDatabase inherits from ConcurrentDictionary, SpyPostOffice inherits from Collection, which means that the assertion can simply verify whether the postOffice contains the expected observation. Sabotaging that part was as easy as the first one:

//await PostOffice.EmailReservationUpdated(restaurantId, reservation)
//    .ConfigureAwait(false);

The test failed, but again I meticulously verified that the error was the expected error at the expected line. Once I’d done that, I again reverted the SUT to its virgin state, and ran the test to verify that all tests passed.

The last assertion is a bit different, because it checks that no Updating message is being sent. This should only happen if the user updates the reservation by changing his or her email address. In that case, but only in that case, should the system send an Updating message to the old address, and an Updated message to the new address. There’s a separate test for that, but as it follows the same overall template as the one shown here, I’m not showing it. You can see it in the Git repository.

Here’s how to sabotage the SUT to see the third assertion fail:

if (existing is { } /*&& existing.Email != reservation.Email*/)
await PostOffice
.EmailReservationUpdating(restaurantId, existing)
.ConfigureAwait(false);

It’s enough to comment out (or delete) the second Boolean check to fail the assertion. Again, I made sure to check that the test failed on the exact line of the third assertion. Once I’d made sure of that, I undid the change, ran the tests again, and committed the changes.

Conclusion #

When working with automated tests, a classic conundrum is that you’re writing code to test some other code. How do you know that the test code is correct? After all, you’re writing test code because you don’t trust your abilities to produce perfect production code. The way out of that quandary is to first predict that the test will fail and run that experiment. If you haven’t touched the production code, but the test passes, odds are that there’s something wrong with the test.

When you are adding tests to an existing code base, you can’t perform that experiment without jumping through some hoops. After all, the behaviour you want to observe is already implemented. You must therefore either write a variation of a test that deliberately fails (as Michael Feathers recommends), or temporarily sabotage the system under test so that you can verify that the new test fails as expected.

The example shows how to proceed empirically with a Characterisation Test of a C# class that I’d earlier added without tests. Perhaps, however, I should have rather approached the situation in another way.

Next: Empirical software prototyping.

When do we have enough tests? #

Characterization Test recipes #

Example #

Conclusion #

Similar Posts