Having spent some time across quite distinct infrastructure codebases and developer environments, from isolated, simple platforms to large monolithic behemoths, I have seen a common pattern to solving new infrastructure problems. First, an engineer performs effective discovery to decide if someone else has solved the same problem that you are facing. Next, if the organization requires a domain-specific solution, they create an MVP that demonstrates the merits of the new system.
Ideally this would be delivered hand-in-hand with metrics that can show business alignment with said merits, but this is not always easy at first glance (quantifying developer productivity and happiness is a topic for another post :)). Finally, if the engineer is able to gain buy-in in this general direction…
Having spent some time across quite distinct infrastructure codebases and developer environments, from isolated, simple platforms to large monolithic behemoths, I have seen a common pattern to solving new infrastructure problems. First, an engineer performs effective discovery to decide if someone else has solved the same problem that you are facing. Next, if the organization requires a domain-specific solution, they create an MVP that demonstrates the merits of the new system.
Ideally this would be delivered hand-in-hand with metrics that can show business alignment with said merits, but this is not always easy at first glance (quantifying developer productivity and happiness is a topic for another post :)). Finally, if the engineer is able to gain buy-in in this general direction, the team iterates!
It’s a tale as old as time as teams ship to meet the needs of their developers and customers. However, there is a thematic problem that tends to crop up in this process – technical debt. In a vacuum, it’s easy to accumulate more and more complex tools and services to meet business objectives. After all, product roadmaps are ever-growing at a faster and faster pace. Those AI-generated shell scripts and one-off microservices made developers more productive at a given moment of time, but improving them was never in the product roadmap! To keep this complexity managed, I want to emphasize these words of the first step in the process: “effective discovery”.
Far too often, engineers are not aware of the tools and systems that are outside of their domain. There is seldom time to research unique solutions, pour over technical blogs, and network with colleagues throughout the industry to understand how their challenges map to potential solutions. Aside from the time constraint, there are additional confounding incentives which limit the discovery process. These include promotion (as building something new is often perceived as more impressive than leveraging an existing solution; see the chat apps at killedbygoogle.com), management pressure to ship ever-quicker, and the bias towards leaning on a technology that one already knows – using the same hammer on any nail in the nearby vicinity.
In the semiconductor industry, this notion of accumulating custom infrastructure is further exacerbated by a lack of OSS software (which is getting better! But is still extremely rampant; another topic for its own post) and a lack of internal experience to apply SWE development practices internally (which could be attributed to “hardware engineers using software in a brute force way?” Perhaps this is a bit of an unfair assessment).
However, I strongly believe that spending the time to learn, explore, and exercise the creative muscle can be the most powerful tool for an infrastructure organization – problems once thought of as entirely unique and challenging are if not isomorphic, at least similar to problems that others have solved. If a problem seems completely unsolved, I’m inclined to believe that I just wasn’t being creative enough. I’m very fortunate to have witnessed this story play out across organizations with a variety of sizes and degrees of existing infrastructure depth. From Google to Tenstorrent to NVIDIA, each company’s history has played a major role in the way I’ve seen developer tools/infrastructure be designed and perceived. And so I want to focus on bringing attention to this growing field of study: what I will refer to as taxonomies of infrastructure.
Mock system taxonomy
When I use the word “taxonomy”, I just mean hierarchical categorization based on different properties that a system could provide, and the tradeoffs entailed by those properties. Think of it similar to fault tolerance models in distributed systems, where you can show the relationship between different properties for replicated systems based on CAP.
Beyond an individual system, I also think it’s useful to see the macro-lense of the types of systems that can exist and the overall function that they serve within a larger organizational context.
A great example of this framing is the book Designing Data-Intensive Applications. It provides a pedagogical framework for comparing and contrasting components for data-intensive systems across a broad range of data ingestion and sharing use cases: ETL pipelines, traditional SQL databases, messaging technologies, etc. Along with each type of technology, it provides a group of relevant case studies and diagrams mapping out the connections between each type of technology. Any time I want to brainstorm ideas on how to orchestrate data flows between components, I reference it to get inspiration and share ideas with my teammates.
However, in terms of “infrastructure systems”, I have yet* to find a general equivalent. There are, of course, fragments of this story across infrastructure blogs and software engineering methodology books that help weave this story together. Honorable mentions in this regard are:
- •
- •
- •
All of these resources provide significant depth and context behind infrastructure engineering challenges and approaches to software methodology. But in terms of having a concrete breakdown of the differences and hierarchies of different types of infrastructure systems, I’m not sure that the broader DevOps ecosystem has one.
One resource that does do this in the realm of infrastructure is the classic paper, Build Systems à La Carte. Its organized approach to categorizing build systems and showing their trade offs makes it easier for build enthusiasts to speak the same language when referring to builds – without such language, I’ve found that build systems discussions are often drastically simplified to “is it make-like” or “is it Bazel-like”. With the depth provided by this paper, we can examine both the holistic functionality that a build system provides and the nuances of its properties to make educated tooling decisions. I’ll admit that having a Haskell representation of the design space might not be feasible for all such areas of infrastructure, but I greatly appreciate the formalism in their approach.
Going beyond build systems though, I think that there are many other areas of infrastructure from which we could construct useful groupings and breakdowns of distinct properties that could be optimized based on tradeoffs. These include:
Development & Build
-Developer environments -Build systems -Developer tools
CI/CD & Deployment
-CI systems -CD systems -Releases / deployments
Monitoring & Data
-Telemetry systems -Alerting systems -Metrics collection / analysis -Data streaming
Collaboration
-Version control
On top of covering the nuances of each type of these components in isolation, I think it would be invaluable to see how they connect with each other to build larger developer flows across an organization, from the start of a project through the development process to each release – a cycle which you could then examine more carefully to have an understanding of what the org needs to prioritize. When one component is lacking in capability, the classic temptation is to build features into one of the other components within the larger picture or to pin up a new service. I’ve seen it happen many times, accelerating the process of accruing technical debt. So by having a common framing and language to talk about this domain, it makes it much easier to make calculated decisions which span the test of time with:
- • explicit awareness of the tradeoffs involved
- • intentional engagement with the pedagogy
With these aspirations in mind, I’m quite aware of the practical limitations of this potentially verbose analysis – whether it’s time to market, size of the development team, or infrastructure expertise. But I think that for a sufficiently sized organization invested in building tools and platforms that serve them for the long run, exploring infrastructure taxonomization is critical. It will improve the quality of infrastructure being built to best serve the needs of the team and reduce technical debt, as there was research and planning that went into the careful organization of tools and services (rather than a scrappy collection of scripts and monolithic services).
Actionable Recommendations
Two practices that could provide relatively impact for low cost:
1
Taxonomize your own internal infrastructure
By mapping the high-level goals of each system onto a broad category of infrastructure (e.g. “the build tool”, “the common developer scripts”, “the CI/CD system”) it should become rapidly apparent whether there are these clear boundaries to begin with. The harder it is to draw these lines and isolate the modular behavior of each component, the more difficult it will be to maintain and integrate with OSS tools/platforms.
2
Prioritize creativity and learning
Whenever you see a large gap between what your infrastructure provides today (slow builds, flakey CI, inaccurate alerts), developers seldom have time to dig beyond a “quick fix” solution. Sometimes the hot patch is what will provide the most value. Sometimes, taking a fresh look at the broader horizon of how others have solved such challenges will reveal a glaring bias in the design. Being eager to think about developer productivity/ops critically and patient in the design process will make it easier to build infrastructure that is enjoyable (not “what are all these random Jenkins jobs/bash scripts”) and long-lasting. Even if you don’t have problems with your tools and infrastructure today, bias towards going against the grain and brainstorming the issues of today’s systems, thinking about the technical and business needs of the future, and learning outside of a narrow focus.
*
A Note on Completeness
It’s perfectly possible I’m missing existing comprehensive resources in this space. For example, I’m looking forward to reading “Continuous Delivery” by Jez Humble and David Farley, which may provide the kind of systematic approach to deployment and release infrastructure that I’m advocating for here. If you know of other resources that provide this kind of infrastructure taxonomy, I’d love to hear about them!