How to Evolve Software for Minimum Disruptions: The Architect's Two Hats

7 min read14 hours ago

–

Press enter or click to view image in full size

This article draws on our experience evolving large software systems and explores common change scenarios and how to anticipate and manage backward compatibility.

I originally published this article in WSO2 site.

Introduction

Good software has a long lifespan, but the technological landscape constantly changes. To remain useful and relevant, your software must evolve, adapting to new demands, much like FAANG companies. This continuous evolution isn’t just expected; it’s often beneficial, leading to better, more resilient systems over time.

Changes come in many forms: evolving requirements, new features…

7 min read14 hours ago

–

Press enter or click to view image in full size

This article draws on our experience evolving large software systems and explores common change scenarios and how to anticipate and manage backward compatibility.

I originally published this article in WSO2 site.

Introduction

Changes come in many forms: evolving requirements, new features, and shifting dependencies, such as updated libraries or external APIs. We can’t avoid them. Effectively managing these changes is often the difference between long-term success and failure, requiring a deliberate strategy to navigate the process gracefully.

However, change requires effort from both system providers and users. As designers and architects, how can we navigate these necessary transformations? When implementing change, we must balance several critical factors: minimizing user impact, ensuring a painless transition if user action is required, and minimizing technical debt, like maintaining multiple parallel systems for too long.

To handle evolution well, we must adopt two distinct approaches:

Reactive Evolution: The “first hat” — How do we implement changes with the least disruption to current users?
Preventive Design: The “second hat” — How do we design systems so future changes are easier to implement or even unnecessary?

This article draws on our experience evolving large software systems to handle millions and billions of transactions in production across the globe. It explores common change scenarios and how to anticipate and manage backward compatibility. The first two sections focus on the reactive strategies (first hat), and the final section covers preventive design (second hat).

First hat: Reactively evolving SaaS products

In software-as-a-service (SaaS) environments, changes reach users almost instantly. To ensure a seamless transition, we often have to support both new and old versions simultaneously, giving users time to adapt.

A SaaS system interacts with users primarily through websites or APIs.

Managing website and UI changes

For websites, which are used by humans who are more adaptable, we can roll out most incremental changes directly to the current system. For significant user interface (UI) overhauls, the best practice is to:

Provide a switch to toggle between the old and new UI for a transition period.
Allow time for staff training and user adoption before forcing the transition.

Managing API changes

APIs present a harder challenge. Even minor changes can break client applications, making backward compatibility mandatory, at least during a transition period.

Version all APIs from the start, following Semantic Versioning (SemVer) best practices (major, minor, patch).
Only expose major versions to users. All minor and patch versions should be backward compatible.
When a breaking change is unavoidable, release a new major version. Support for old versions often needs to continue for six months to a year, as client-side code updates take time.

To limit complexity and costs, API providers must eventually phase out older APIs.

Incremental changes to APIs

Adding new fields, operations, or resources is generally safe and maintains backward compatibility.
Removing fields or operations breaks backward compatibility, which necessitates versioning.
The goal is to support multiple versions using a single codebase by checking the API version and accepting the correct input.
Crucially, never change behavior without changing the API version, as this creates hard-to-detect bugs.

Major overhauls to APIs

For significant changes, a separate implementation may be necessary. Running versions in parallel, with an API gateway or load balancer routing messages to the correct version, is a common strategy until the old version can be safely retired.

A difficult scenario is a critical bug (e.g., security) requiring a breaking API change where old API presents a security risk. While a feature flag can let users decide the behavior, this increases cognitive load. We should try to avoid the need for this kind of feature flags by striving for high-quality, bug-free releases in the first place.

Finally, be cautious of forward compatibility. If a critical issue forces a rollback of a new version, maintaining support for that version becomes increasingly complex. To avoid this, try to mininize rollbacks.

First hat: Reactively evolving on-premises products

When users run our software on-premises, they control the upgrade timing. While this flexibility means changes will not immediately affect their running systems, it introduces significant challenges and a higher long-term support burden for us.

The on-premises model carries the following downsides:

The user must handle the change process themselves, often without vendor/developer assistance, increasing the risk of mistakes.
The natural instinct to adhere to the principle, “do not fix what is not broken,” works against timely upgrades.
The longer a user delays the upgrade, the harder the process becomes, which further reduces the likelihood of the upgrade happening.
Unlike SaaS, older versions live much longer, significantly increasing our internal support and maintenance costs as well as risks customers face.

Easy, safe, and painless upgrades are essential for on-premises products, saving everyone time and money in the long run. If you must break backward compatibility, the following areas require special care:

1. APIs (network and internal libraries)

Maintain backward compatibility as much as possible, perhaps by emulating older behavior through compatibility layers.
Clearly mark deprecated APIs and only drop support in the next major release.
If breakage is unavoidable, introduce it in a new major version, and make the breakage explicit.
It’s better to change the API signature than to silently change the behavior with the same signature, which can break key workflows unexpectedly.
When migrating, proactively detect incompatibilities and warn the user before the upgrade.

2. Configuration files

Configuration file changes are a frequent source of friction.

Aim for backward compatibility with older formats.
Include a version field in the configuration file to allow the software to apply necessary transformations or defaults.
Avoid silent changes in behavior; if a setting is unsupported, the product must fail clearly and inform the user.
Deprecate and remove options sparingly, only in major releases.

3. Databases and data migration

Changes to databases are the most complex due to the inherent risk of migrating persistent user data.

There are two cases where offline migration require a downtime while online migration can happen on the fly. However, there are tradeoffs.

Press enter or click to view image in full size

The safest path is to avoid data migrations altogether when possible. If unavoidable, they require significant effort, clear documentation, and tooling for verification.

Second hat: Designing for future evolution

The goal here is to design systems so that parts hard to change are done right from the start, and the rest is built for adaptability. Prevention is far more effective than reactive fixes.

Prioritize hard-to-change elements

As discussed in How to approach Software Architecture? A First Principle Perspective, certain system aspects are inherently resistant to change and require upfront investment and care. These include the following:

Public APIs
Database schema and choice (involves persistent user data)
Core frameworks and platforms
Integration points
Scalability design
User and security model

Best practices for future-proof design

Get APIs and Schemas Right Early: Invest time in gathering feedback, getting multiple perspectives, and aligning with well-established standards (e.g., HTTP, OAuth, SQL). Standards are stable, widely understood, and reduce the likelihood of disruptive changes.
Minimize Foundational Bugs: The most painful migration scenarios arise from shipping a bug in a foundational layer like the API or schema that requires a breaking fix later. Prevention is the best cure.
Encapsulate Dependencies Judiciously: While abstracting dependencies behind internal interfaces is sound, abstraction can add unnecessary complexity if users directly interact with a dependency. This is known as the “inner platform effect antipattern”. Use this approach with care.
Keep Defaults in Code: Avoid storing default values in the database, as this creates an unnecessary dependency and often triggers migrations when the default changes.
Isolate Extensions: If your software allows user-defined extensions (plug-ins, scripts), take great care not to expose internal APIs or unstable data structures. Limit what is exposed and isolate extensions by running user code within a container or as a service. This protects extensions even if product dependencies change.

Remember the old proverb: “Measure twice, cut once.” This is the essential mindset for architectural design.

Key takeaways: Sustaining trust and reliability

Software must evolve to remain relevant. Whether you’re building SaaS or on-premises products, backward compatibility is an essential engineering principle for maintaining trust, reliability, and long-term success.

Backward compatibility is core: Supporting older versions during an upgrade is mandatory for maintaining user trust and operational continuity.
SaaS requires immediate support: Changes impact users instantly, so SaaS systems must support both old and new versions, especially for APIs, and provide a clear migration path.
On-premises upgrades are complex: Since users upgrade at their own pace, changes to APIs, configuration files, and especially databases, must be handled with meticulous care to avoid major disruptions.
Database migrations are high-risk: Unlike code, data must be preserved and transformed safely. Avoid data migrations unless absolutely necessary, and if required, treat them as a serious engineering effort.
Prevention is the best strategy: Architectural decisions must prioritize getting hard-to-change parts (like APIs and schemas) right early on to make the rest of the system adaptable.

Ultimately, this is about building and keeping trust with your users. Trust takes years to build and minutes to lose. What is your organization’s biggest pain point in managing major software upgrades?

If you enjoyed this post, you might also like my new Book: Software Architecture and Decision-Making. You can find more examples from the book.

Get the Book, or find more details from the Blog.

Please note that as an Amazon Associate, I earn from qualifying purchases.

Introduction

Introduction

First hat: Reactively evolving SaaS products

Managing website and UI changes

Managing API changes

First hat: Reactively evolving on-premises products

1. APIs (network and internal libraries)

2. Configuration files

3. Databases and data migration

Second hat: Designing for future evolution

Prioritize hard-to-change elements

Best practices for future-proof design

Key takeaways: Sustaining trust and reliability

Similar Posts