Tangled Mess · Scour

Once upon a time, our software was simple. All it took was a database, a user interface, and some glue code between them, running behind a web server farm, to go online. Most of the business-critical data stayed in the database. People rented space in colocation centers and bought servers to run their databases and code. Those colocation centers were not reliable, and your business could go down due to power or cooling failures. The servers were also not reliable, and you could lose data due to disk failures. A nasty cable damage somewhere, and the network could be out for hours or days. To get around such booboos, IT teams rented space in secondary colocation centers, bought servers and expensive replication systems to replicate data to those secondary sites, and took steps to brin…

Businesses were worried about doing business with other companies whose infrastructure failures could hurt them. So legal people got involved to add contractual language to their agreements. Those clauses required their vendors to implement disaster recovery policies. In some cases, those contracts also required declaring procedures for recovering critical systems at the secondary site, testing them once or twice a year, and sharing the results. Since breaches of such contracts could lead to termination of business agreements, companies hired compliance professionals to draft rules for their tech teams to follow. Tech teams begrudgingly followed those rules to ensure their backups and replication were working and that they could bring up the most critical systems in the secondary colocation centers. Paying such a disaster recovery tax made sense because the infrastructure was unreliable.

It was okay, sort of, until the cloud happened. Cloud providers started offering a variety of infrastructure services to build software. Open source gave us even more flexibility to introduce new software architecture patterns. As engineers, we loved that choice and started binging on it, adopting whatever cloud companies began offering. Of course, this choice gave us a tremendous speed advantage without requiring years of engineering investments to build complex distributed systems. Those services also began offering higher fault tolerance and availability than what most enterprises could fathom. Moving systems from on-prem to the cloud became necessary to take advantage of that flexibility. Many people have built their careers around cloud transformation, and that is still ongoing.

For a while, cloud services were flaky. It fell to customers to keep their businesses running even when a cloud provider’s systems were down. It seemed sensible to work around those problems and build redundancy in a second cloud region. Many people tried that pattern. Some, like Netflix, succeeded, or at least are known to have succeeded; I don’t know if that is still the case today. Many had partial success to the extent of getting some stateless parts of the business running from multiple cloud regions.

Around the same time, the SaaS industry took off. The proliferation of online systems has increased complexity and fueled the hunger for automation across enterprises. This created opportunities for SaaS-based companies to fill that gap and offer a variety of services, from infrastructure to customer service to finance to sales and marketing. Relying on third-party SaaS became a necessity for every enterprise. You can no longer take code to production without depending on a subscription or pay-as-you-go service from another company. The net result of this flexibility and abundance is that almost everything is now interconnected.

We are a tangled mess

We are now a tangled mess. There are no more independent systems. Our systems share the fate with large swaths of the Internet. Almost every business now depends on other companies, mostly consuming their services. Thus, all bets on building redundant secondary sites are off. You not only have to get your part right, but you also need all your dependencies to do the same to be fully redundant. Most companies can’t even make their software redundant across multiple locations due to the variety of services they are building, their interconnectedness, and the types of infrastructure needs. After all, building highly available and fault-tolerant systems requires more discipline, talent, and time than most enterprises can afford. Let’s not kid ourselves.

Where does this leave us?

First, get unstuck from the old paradigm of redundancy in secondary sites. That is over-simplified thinking. It no longer makes sense for most companies to waste their precious resources on building redundancy across multiple cloud regions. Yes, cloud providers will fail from time to time, as last week’s AWS us-east-1 outage. Yet they are still incentivized to invest billions of dollars and time in the resilience of their infrastructure, services, and processes. As for yourself, instead of focusing on redundancy, invest in learning to use those cloud services correctly. These days, most cloud services offer knobs to help their customers survive disasters (like automated backups, database failovers, and availability zone failures). Know what those are, and follow meticulously.

Second, if you truly need and care about five or more nines of absolute (i.e., not brown-out) availability for your business, make sure your business can afford the cost. To achieve such availability, you have to do several things right. You need the right talent who understands how to build highly available, fault-tolerant systems. In most cases, you have to develop that talent in-house because such talent is rare. Then you need to standardize patterns like cells, global caches, replication systems, and eventual consistency for every critical piece of code you create. You will need to invest in paved paths to make it easy to follow those patterns. Implementing those patterns takes time, and you need to get them right. Most importantly, you also need a disciplined engineering culture that prioritizes high availability and fault tolerance in every decision. Your culture needs to embrace constrained choices and sacrifice flexibility in favor of high availability and fault tolerance.

Third, somehow cajole your compliance and legal people to refine or avoid the dangerous parts like “secondary sites.” Unless your company’s architecture is stuck in time, like 20 years ago, such language no longer makes sense. Refining such language can be easier said than done since some contracts are dated, and fixing that may be the least important priority for your legal teams. Don’t get me wrong. You still need to invest in game days and similar failure-embracing exercises to build resilience in your culture. But how you practice resilience needs to evolve.

We are indeed in a tangled mess. Resilience is costly. Know what you need, figure out the business cost, and do what you can.

If you enjoyed this article, consider subscribing for future articles. I write about technology, engineering leadership, and psychology of leadership. I will never share your email with anyone. You can unsubscribe at any time.

Similar Posts