How etcd Solved Its Knowledge Drain With Deterministic Testing

The loss of institutional knowledge when people leave an organization can be tough. When longtime maintainers leave an open source project, it can be nearly impossible to recapture that knowledge.

That’s what happened to etcd, an open source, distributed key value store that’s “older than Kubernetes itself,” said etcd’s lead maintainer, Marek Siarkowicz, in this episode of The New Stack Makers.

Siarkowicz, a senior software engineer at Google, joined me for this On the Road episode of Makers, recorded at [KubeCon + CloudNativeCon North Ame…

The loss of institutional knowledge when people leave an organization can be tough. When longtime maintainers leave an open source project, it can be nearly impossible to recapture that knowledge.

Siarkowicz, a senior software engineer at Google, joined me for this On the Road episode of Makers, recorded at KubeCon + CloudNativeCon North America in Atlanta last month.

The Challenge of Maintainer Turnover and Knowledge Loss

Siarkowicz moved four years ago from Google’s Kubernetes team to its etcd team. Roughly three years ago, he told me, the etcd project hit some reliability challenges.

As the team of maintainers worked on rolling out a new release, “a lot of maintainers left the project and were replaced with new maintainers, and there was a drain of knowledge. So all the properties that could not be written into the code were lost with those people. All the procedures, how to test, how to guarantee correctness that was done before were not done for the new release.”

As a result, the team released a version that “has had multiple issues that were critical, like if the application crashed, it could cause an inconsistency.”

Achieving the ‘Holy Grail’ for a Distributed System

To remedy the situation, the new crew of maintainers implemented what it called “robustness testing.” To validate the project’s basic correctness, but also the distribution system’s correctness, the team built its own framework “inspired by” open source Jepson.

The goal, Siarkowicz said, was to achieve linearizability — the ability to “have a distributed system that should behave like a single node. This is like a Holy Grail of distributed systems. And validating this is a very hard problem.”

Solving it, the maintainers learned, meant they needed to bring forth their own failure injection mechanism. “We needed to teach people, the community, how to debug it, and all those challenges were immense,” Siarkowicz said.

Underlying it all, he suggested, was a wish to create a knowledge base that wouldn’t disappear if team members left the project.

Using Deterministic Simulation Testing to Recapture Knowledge

Seeking a solution to all this, the etcd team reached out to Antithesis, which worked on deterministic simulation testing. Without this approach to software testing, locating and reproducing a bug in a distributed system can get dicey.

“You have some hypothesis, you try to reproduce it, but you need to get lucky to sometimes find some race between multiple components or multiple logs and multiple, separate processes, communicating by network to find the bug.” Siarkowicz said.

By contrast, he said, “deterministic simulation testing allows you to linearize everything, so there will be only one execution path and it’ll always be reproducible.”

The collaboration with Antithesis, Siarkowicz said, made it easier to capture knowledge. The team could ”define the properties that were just in documentation or just in maintainers’ heads.”

An advantage of using the Antithesis platform, he said, was the ability to test engineers’ assertions more robustly. “Previously, we already had assertions, but those were never tripped. So it seemed, Oh, like if it never trips, it should be good.”

But that no-news-is-good-news approach, he suggested, deprived the team of deeper knowledge that more robust testing could reveal. Antithesis’s testing and failure injection went beyond what the maintainer team could build on its own, Siarkowicz said. “The failure combination that you need to do to trip is very hard to implement yourself, and it’s unique for every such property.”

Addressing the Unique Testing Challenges in Open Source

As the lead maintainer of an open source project, Siarkowicz said, teaching community members how to do more robust testing is a big challenge.

Open source projects, he noted, “are like a tree. … at the beginning, the main part is the most important. But as the project grows, there is more community, they build out new features, new things. There are a lot of people who can work on the leaves, but the core is usually very sensitive, because it’s connected to everything.”

When it comes to long-running projects like etcd or Kubernetes, he likes working on the core, the trunk, of those “trees.” But he acknowledged, those core parts are “not very accessible to most contributors, so having such an approach to testing can allow maintainers to write rules that will ensure that, even if a maintainer makes a mistake, or doesn’t have enough time to review something in full detail, we’ll still be able to catch it in the testing.”

Check out the full episode for more about testing open source software, including the role AI may play in the future, and what’s on the etcd road map.

The Challenge of Maintainer Turnover and Knowledge Loss

Achieving the ‘Holy Grail’ for a Distributed System

Using Deterministic Simulation Testing to Recapture Knowledge

Addressing the Unique Testing Challenges in Open Source

Similar Posts