Thoughts on Building Reliable Systems

7 min readJust now

–

Before diving in, we need to define what we mean by reliable. The dictionary defines reliable as “consistently good in quality or performance; able to be trusted.” In other words, does the code or system work, and more importantly, does it work well over time under various conditions? This is the only measure we will use, and it will be the focus of this paper. Arguably, there isn’t much else that matters if the system works.

7 min readJust now

–

Creating reliable systems is easier said than done, and with so much variance in the type of code that is “in the wild”, it’s hard to make universal statements. However, we will discuss a few guiding principles we hope will translate into the systems you build. These are idempotency, simplicity, and adaptability. Together, these three principles help support what I’d consider the most critical aspect of a reliable system, determinism. While we won’t delve into determinism on its own, designing systems with the above in mind will help us get closer to our goal. Some might wonder how I could forget to mention observability. I did not, and it is essential, but also so common in these types of posts, I didn’t feel that much value in repeating what so many others have already written about.

It’s important to note that we will be discussing these terms at a higher level, from the perspective of building a complete system or feature. Although these concepts do translate to a lower level, and in fact, it is hard to create a reliable system without using these concepts at the lowest levels, we won’t be discussing low-level code, such as interfaces you may define or functions, etc. We’ll save that for another post.

Simplicity

Anytime simplicity is mentioned, it’s almost a must to throw out the famous quote by Dijkstra on the relationship between simplicity and reliability, being “Simplicity is prerequisite for reliability.” At the surface, this is easy to agree with, but why is it so hard to do? And why is it really that important? The answer to both of these questions, in a way, is the same. Software engineering is challenging due to the various requirements, such as the diverse group of people involved over time, deadlines, organizational politics, and other factors. However, this is also the reason why, in a chaotic environment, it’s critical to make the software itself as simple as possible. The ability to reason about and, more importantly, debug something simpler is much easier than troubleshooting something complex. With that being said, let’s stop talking in such abstract terms and actually share some valuable insights into how we can simplify our systems.

Minimize Components

A component can be an application, an API, a third-party vendor, a cloud provider service, or any other relevant entity. Especially in the cloud, if you’ve ever read any of the providers’ recommended solutions, it can be pretty easy to go overboard with all the fancy services. Similarly, it can be easy to over-abstract and break things apart into far too many microservices, entry points, branches, and so on. The more components we depend on, the more points of failure we introduce. Or put another way, when things do go wrong, how easy is it to trace which component broke and why? How easy would it be to fix that particular issue? If we think of our components and their relationship with one another as a graph, it becomes easy to visualize how troublesome this could be. Any number of “edges” could be the problem; the more we have, the harder this causality problem becomes.

Minimize Options

When we refer to options, we mean the mechanisms that support non-default or optional paths within the system. Simplicity comes from optimizing for the most common path of execution. Software that’s singularly focused on a specific task is not only more straightforward to manage, but paradoxically, often more flexible. Over-engineering a component to support too many optional behaviors leads to the same problem as having too many components. Instead of a dependency graph of components, think of a graph of possible execution paths the software can take. This becomes especially problematic from a QE or testability perspective. Maintaining feature parity with automated end-to-end testing is already challenging, and each additional option exacerbates the problem exponentially. Worse, it increases the risk of a subtle bug; a seemingly unrelated change can easily break an untested or rarely used path.

Minimize States

Most applications track state in some form. By state, we usually mean an internal representation of some external truth. Concretely, this typically involves recording some information in our database. For example, if we deploy a VPC, we might store the provider’s VPC ID as part of our internal state. State can also represent steps in a process. Using the deployment example above, these steps could include deploying, deployed, destroying, and so on. So why does minimizing state matter? What’s the big deal? The main issue with excessive state is that it can create conflicting sources of truth. This is especially problematic if what we are representing is often subject to change or is inherently dynamic in nature. Tracking the VPC ID would be totally okay, how else would we know what that ID was? Every situation is different, but there’s a good chance it won’t change that often either. However, trying to track the deployment progress doesn’t really make sense. We can verify the state of the VPC directly with the cloud provider, rather than relying on internal representation. Furthermore, as we will explore in more detail in the upcoming section, we can leverage the idempotency of our systems to reduce the need for tracking states, such as the deployment progress of a VPC.

To summarize the points above, the reason we should minimize state is to reduce the risk of conflicting sources of truth. If something is usually static or exists to link us to an external system, it’s likely okay to store state internally. However, if the state changes frequently, can be modified by external actors, or can be easily retrieved from another system, we should avoid storing it ourselves and instead rely on the source that already maintains it.

Hopefully, it has become clear that building for simplicity is about the art of removal. Often, when creating a system, the goal of the first iteration is to simply get the job done. As a system matures, we should ask what can be removed to further simplify without sacrificing any functionality.

Idempotency

A system that is idempotent is one that we can count on to consistently produce the same result, or even better, to consistently produce a deterministic expected result. So, what do we mean by an expected deterministic result? For example, for specific creation workflows, we might want the first call to succeed, and all subsequent calls to result in an “already exists” error or a similar response. The deterministic nature of both of these return types allows us to handle them predictably in code. At a higher level, an idempotent system should be retry-resilient. In other words, if the system were to retry a path 100 times, it should succeed or produce a deterministic result. This is critical because a system could have N different types of these creation calls. When we retry, there is no point in failing if we have already created some resource in the early part of the flow; we can proceed to the next steps.

With that in mind, what techniques can we use to ensure an idempotent system? The easy part is ensuring our outputs return deterministic results. This tends to manifest as structured responses, predictable status codes, named errors, and similar. Deterministic outputs go a long way in helping us achieve this goal. To ensure our system is retry-resilient, let’s imagine a path or workflow that can be visualized as some directed graph. Each node represents a step in which some action is taken, such as creation, retrieval, or other operations. To ensure idempotency as a whole, we can inject errors into various nodes of this graph and retry our workflow N times. If we inject non-fatal errors for retry counts 1–99 on random nodes, on retry count 100 with no injected errors, the workflow should succeed.

In a nutshell, idempotency is about achieving deterministic results through repetition. A system that behaves in a way that we can reliably predict, no matter the number of times it’s run, is one we can trust.

Adaptability

Adaptability can mean many things in the context of software engineering, but we here will focus on what it means in relation to reliability. The goal is to design systems with knobs we can turn to fix, change, or optimize in production. For anything we have deployed, how easy is it to implement a change? Common strategies for this include using hot-fixes, feature flags, and dynamic configuration. Dynamic configuration can typically be sourced from environment variables or external sources, such as a database. These can often provide live changes without requiring re-deployment of apps, especially changes driven through database calls or feature flags.

Sometimes, this isn’t so simple and requires a more nuanced approach to the design. Consider the example of designing an agent that communicates with a central control plane. If an agent contains too much complex business logic or makes too many decisions, upgrades can quickly become painful — every logic change now requires a new agent release and rollout. Instead, we can offload the responsibility of this decision-making to the control plane, almost making the agent more of a receiver of its duties or tasks. This isn’t always possible, but a flexible agent that can retrieve information from a control plane that dictates its actions or responsibilities can make life much more manageable.

Luckily, adaptability is the most straightforward principle to reason about. When building our software, we always need ways to change or adapt quickly and safely. Software engineers will always write bugs; it’s what we do, so any reliable system needs to embrace this fact and know how to act accordingly.

Building reliable systems is an art form. It is a practice that all engineers should practice and refine. There’s always room for improvement. The most reliable systems are those that just work. A reliable system is consistent. It’s simple enough to reason about by keeping complexity at bay. Its idempotency creates predictable behavior. And if things do go wrong, it’s adaptable enough to recover quickly.