Gathering empirical evidence while adding tests to legacy code.
This article is part of a short series on empirical test-after techniques. Sometimes, test-driven development (TDD) is impractical. This often happens when faced with legacy code. Although thereâs a dearth of hard data, I guess that most code in the world falls into this category. Other software thought leaders seem to suggest the same notion.
For the purposes of this discussion, the definition of legacy code is code without automated tests.
âCode without tests is bad code. It doesnât matter how well written it is; it doesnât matter how pretty or object-oriented or well-encapsulated itâŚ
Gathering empirical evidence while adding tests to legacy code.
This article is part of a short series on empirical test-after techniques. Sometimes, test-driven development (TDD) is impractical. This often happens when faced with legacy code. Although thereâs a dearth of hard data, I guess that most code in the world falls into this category. Other software thought leaders seem to suggest the same notion.
For the purposes of this discussion, the definition of legacy code is code without automated tests.
âCode without tests is bad code. It doesnât matter how well written it is; it doesnât matter how pretty or object-oriented or well-encapsulated it is. With tests, we can change the behavior of our code quickly and verifiably. Without them, we really donât know if our code is getting better or worse.â
As Michael Feathers suggest, the accumulation of knowledge is at the root of this definition. As I outlined in Epistemology of software, tests are the source of empirical evidence. In principle itâs possible to apply a rigorous testing regimen with manual testing, but in most cases this is (also) impractical for reasons that are different from the barriers to automated testing. In the rest of this article, Iâll exclusively discuss automated testing.
We may reasonably extend the definition of legacy code to a code base without adequate testing support.
When do we have enough tests? #
What, exactly, is adequate testing support? The answer is the same as in science, overall. When do you have enough scientific evidence that a particular theory is widely accepted? Thereâs no universal answer to that, and no, p-values less than 0.05 isnât the answer, either.
In short, adequate empirical evidence is when a hypothesis is sufficiently corroborated to be accepted. Keep in mind that science can never prove a theory correct, but performing experiments against falsifiable predictions can disprove it. This applies to software, too.
âTesting shows the presence, not the absence of bugs.â
The terminology of hypothesis, corroboration, etc. may be opaque to many software developers. Hereâs what it means in terms of software engineering: You have unspoken and implicit hypotheses about the code youâre writing. Usually, once youâre done with a task, your hypothesis is that the code makes the software work as intended. Anyone whoâs written more than a hello-world program knows, however, that believing the code to be correct is not enough. How many times have you written code that you assumed correct, only to find that it was not?
Thatâs the lack of knowledge that testing attempts to address. Even manual testing. A test is an experiment that produces empirical evidence. As Dijkstra quipped, passing tests donât prove the software correct, but the more passing tests we have, the more confidence we gain. At some point, the passing tests provide enough confidence that you and other stakeholders consider it sensible to release or deploy the software. We say that the failure to find failing tests corroborates our hypothesis that the software works as intended.
In the context of legacy code, itâs not the absolute lack of automated tests that characterizes legacy code. Rather, itâs the lack of adequate test coverage. Itâs that you donât have enough tests. Thus, when working with legacy code, you want to add tests after the fact.
Characterization Test recipes #
A test written after the fact against a legacy code base is called a Characterization Test, because it characterizes (i.e. describes) the behaviour of the system under test (SUT) at the time it was written. Itâs not a given that this behaviour is correct or desirable.
Michael Feathers gives this recipe for writing a Characterization Test:
- âUse a piece of code in a test harness.
 - âWrite an assertion that you know will fail.
 - âLet the failure tell you what the behavior is.
 - âChange the test so that it expects the behavior that the code produces.
 - âRepeat.â
 
Notice the second step: âWrite an assertion that you know will fail.â Why is that important? Why not write the âcorrectâ assertion from the outset?
The reason is the same as I outlined in Epistemology of software: It happens with surprising regularity that you inadvertently write a tautological assertion. You could also make other mistakes, but writing a failing test is a falsifiable experiment. In this case, the implied hypothesis is that the test will fail. If it does not fail, youâve falsified the implied prediction. To paraphrase Dijkstra, youâve proven the test wrong.
If, on the other hand, the test fails, youâve failed to falsify the hypothesis. You have not proven the test correct, but youâve failed in proving it wrong. Epistemologically, thatâs the best result you may hope for.
Iâm a little uneasy about the above recipe, because it involves a step where you change the test code to pass the test. How can you know that you, without meaning to, replaced a proper assertion with a tautological assertion?
For that reason, I sometimes follow a variation of the recipe:
- Write a test that exercises the SUT, including the correct assertion you have in mind.
 - Run the test to see it pass.
 - Sabotage the SUT so that it fails the assertion. If there are several assertions, do this for each, one after the other.
 - Run the test to see it fail.
 - Revert the sabotage.
 - Run the test again to see it pass.
 - Repeat.
 
The last test run is strictly not necessary if youâve been rigorous about how you revert the sabotage, but psychologically, it gives me a better sense that all is good if I can end each cycle with a green test suite.
Example #
I donât get to interact that much with legacy code, but even so, I find myself writing Characterization Tests with surprising regularity. One example was when I was characterizing the song recommendations example. If you have the Git repository that accompanies that article series, you can see that the initial setup is adding one Characterization Test after the other. Even so, as I follow a policy of not adding commits with failing tests, you canât see the details of the process leading to each commit.
Perhaps a better example can be found in the Git repository that accompanies Code That Fits in Your Head. If you own the book, you also have access to the repository. In commit d66bc89443dc10a418837c0ae5b85e06272bd12b I wrote this message:
âRemove PostOffice dependency from Controller
âInstead, the PostOffice behaviour is now the responsibility of the EmailingReservationsRepository Decorator, which is configured in Startup.
âI meticulously edited the unit tests and introduced new unit tests as necessary. All new unit tests I added by following the checklist for Characterisation Tests, including seeing all the assertions fail by temporarily editing the SUT.â
Notice the last paragraph, which is quite typical for how I tend to document my process when itâs otherwise invisible in the Git history. Hereâs a breakdown of the process.
I first created the EmailingReservationsRepository without tests. This class is a Decorator, so quite a bit of it is boilerplate code. For instance, one method looks like this:
public Task<Reservation?> ReadReservation(int restaurantId, Guid id)
{
return Inner.ReadReservation(restaurantId, id);
}
Thatâs usually the case with such Decorators, but then one of the methods turned out like this:
public async Task Update(int restaurantId, Reservation reservation)
{
if (reservation is null)
throw new ArgumentNullException(nameof(reservation));
var existing =
await Inner.ReadReservation(restaurantId, reservation.Id)
.ConfigureAwait(false);
if (existing is { } && existing.Email != reservation.Email)
await PostOffice
.EmailReservationUpdating(restaurantId, existing)
.ConfigureAwait(false);
await Inner.Update(restaurantId, reservation)
.ConfigureAwait(false);
await PostOffice.EmailReservationUpdated(restaurantId, reservation)
.ConfigureAwait(false);
}
I then realized that I should probably cover this class with some tests after all, which I then proceeded to do in the above commit.
Consider one of the state-based Characterisation Tests I added to cover the Update method.
[Theory]
[InlineData(32, "David")]
[InlineData(58, "Robert")]
[InlineData(58, "Jones")]
public async Task UpdateSendsEmail(int restaurantId, string newName)
{
var postOffice = new SpyPostOffice();
var existing = Some.Reservation;
var db = new FakeDatabase();
await db.Create(restaurantId, existing);
var sut = new EmailingReservationsRepository(postOffice, db);
var updated = existing.WithName(new Name(newName));
await sut.Update(restaurantId, updated);
var expected = new SpyPostOffice.Observation(
SpyPostOffice.Event.Updated,
restaurantId,
updated);
Assert.Contains(updated, db[restaurantId]);
Assert.Contains(expected, postOffice);
Assert.DoesNotContain(
postOffice,
o => o.Event == SpyPostOffice.Event.Updating);
}
This test immediately passed when I added it, so I had to sabotage the Update method to see the assertions fail. Since there are three assertions, I had to sabotage the SUT in three different ways.
To see the first assertion fail, the most obvious sabotage was to simply comment out or delete the delegation to Inner.Update:
//await Inner.Update(restaurantId, reservation)
//    .ConfigureAwait(false);
This caused the first assertion to fail. I was sure to actually look at the error message and follow the link to the test failure to make sure that it was, indeed, that assertion that was failing, and not something else. Once I had that verified, I undid the sabotage.
With the SUT back to unedited state, it was time to sabotage the second assertion. Just like FakeDatabase inherits from ConcurrentDictionary, SpyPostOffice inherits from Collection, which means that the assertion can simply verify whether the postOffice contains the expected observation. Sabotaging that part was as easy as the first one:
//await PostOffice.EmailReservationUpdated(restaurantId, reservation)
//    .ConfigureAwait(false);
The test failed, but again I meticulously verified that the error was the expected error at the expected line. Once Iâd done that, I again reverted the SUT to its virgin state, and ran the test to verify that all tests passed.
The last assertion is a bit different, because it checks that no Updating message is being sent. This should only happen if the user updates the reservation by changing his or her email address. In that case, but only in that case, should the system send an Updating message to the old address, and an Updated message to the new address. Thereâs a separate test for that, but as it follows the same overall template as the one shown here, Iâm not showing it. You can see it in the Git repository.
Hereâs how to sabotage the SUT to see the third assertion fail:
if (existing is { } /*&& existing.Email != reservation.Email*/)
await PostOffice
.EmailReservationUpdating(restaurantId, existing)
.ConfigureAwait(false);
Itâs enough to comment out (or delete) the second Boolean check to fail the assertion. Again, I made sure to check that the test failed on the exact line of the third assertion. Once Iâd made sure of that, I undid the change, ran the tests again, and committed the changes.
Conclusion #
When working with automated tests, a classic conundrum is that youâre writing code to test some other code. How do you know that the test code is correct? After all, youâre writing test code because you donât trust your abilities to produce perfect production code. The way out of that quandary is to first predict that the test will fail and run that experiment. If you havenât touched the production code, but the test passes, odds are that thereâs something wrong with the test.
When you are adding tests to an existing code base, you canât perform that experiment without jumping through some hoops. After all, the behaviour you want to observe is already implemented. You must therefore either write a variation of a test that deliberately fails (as Michael Feathers recommends), or temporarily sabotage the system under test so that you can verify that the new test fails as expected.
The example shows how to proceed empirically with a Characterisation Test of a C# class that Iâd earlier added without tests. Perhaps, however, I should have rather approached the situation in another way.
Next: Empirical software prototyping.