A taxonomy is a low-fidelity semantic structure with high returns on investment. Through hierarchical relationships alone, taxonomies support definitions, establish context, create meaning, and enable inference, without the complexity or the overhead of full ontological modeling. Taxonomy is the third stage in the**Ontology Pipeline® framework**, yet it’s often the first semantic structure organizations attempt to build. It’s also the structure many enterprises rush to automate, absent a controlled vocabulary. This creates problems.
Without controlled vocabularies and metadata standards in place, taxono…
A taxonomy is a low-fidelity semantic structure with high returns on investment. Through hierarchical relationships alone, taxonomies support definitions, establish context, create meaning, and enable inference, without the complexity or the overhead of full ontological modeling. Taxonomy is the third stage in the**Ontology Pipeline® framework**, yet it’s often the first semantic structure organizations attempt to build. It’s also the structure many enterprises rush to automate, absent a controlled vocabulary. This creates problems.
Without controlled vocabularies and metadata standards in place, taxonomies become unwieldy, inconsistent, and difficult to maintain. But when built correctly—as part of a rigorous, iterative process—taxonomies deliver measurable value for both human navigation and machine reasoning.
This is why taxonomies do not have to be static or fail to capture new concepts and definitions. If built correctly with defined processes and workflows, a taxonomy becomes more robust and resilient over time. Just like a library system constantly updates its classification schemes and controlled vocabularies to account for new resources, materials and formats, so should an organization attend to the care and feeding of its own classification schemes and controlled vocabularies.
And yes—a taxonomy is a classification scheme.
This three-part series focuses specifically on taxonomy construction. We’ll examine what taxonomies are, how they function within the larger semantic ecosystem, when to build them with and without AI assistance, and how to implement them systematically. The examples come from real-world taxonomies: a non-governmental organization, government, education, marketing and advertising, software products and comprehensive knowledge management systems, spanning multiple domains.
This essay is focused on prep work—the essential phase that help you design and plan for taxonomy construction. The prep work becomes the scope for the work and the framework for the taxonomy build itself. Without planning and a solid design framework, taxonomies construction is more difficult and far less predictable, often failing to meet organizational needs or lacking sufficient coverage.
While the prep work may seem arduous, it’s well worth the effort, because the resulting taxonomy will deliver accurate, data-driven coverage for domain concepts. And who doesn’t love processes and informed decision making?
A taxonomy is a hierarchical classification system that structures controlled vocabulary terms from broad to specific, establishing parent-child relationships between concepts. The hierarchy creates navigable paths through knowledge domains, enabling both discovery and context.
Let’s examine this three-level taxon (sub-hierarchy) from one of my projects, theBee Taxonomy, part of the Shock knowledge graph :
This four-level hierarchy tells us that ‘Grocery shopping’ is a subcategory of ‘Eldercare’, which is a type of ‘Caregiving’, which is a type of Community empowerment (this is set theory). Each level in the hierarchy narrows the scope, defining a thing by the parts of the whole. The structure provides classification for content and events, navigation for users, reasoning paths for algorithms and rich sensemaking for AI systems.
A taxonomy is not:
→ A flat list of terms (that’s a list and maybe a controlled vocabulary)
→ A network of associative relationships (that’s a thesaurus)
→ An ontological system consisting of complex relations governed by rules and constraints
Taxonomy construction is when we take our controlled vocabulary—a flat list or index of defined, disambiguated terms—and introduce a hierarchical structure. Nothing more, nothing less. Also, important to note: a taxonomy is a type of controlled vocabulary, structured as a hierarchy, to support categorization and inference.
Taxonomies serve three essential functions in knowledge management systems. Beyond knowledge management ,there are many other uses for taxonomies. For today, we will focus on classification, navigation and reasoning.
When we create categories, we are building a classification scheme. If we classify “Book clubs” under “Literary events,” which sits under “Library activities,” which belongs to “Cultural activities,” we’re asserting that book clubs share properties with other literary events, and that literary events share properties with library activities. The hierarchy becomes a classification engine.
Users can browse from general to specific, discovering related concepts along each path. A taxonomy provides multiple entry points into content and for website navigation trees. Searching broadly for “Cultural activities”, a user can narrow search results navigate down through “Library activities” to “Literary events” to find “Book clubs.” The structure guides discovery while also supporting the accurate tagging of assets.
AI systems can traverse the hierarchy to infer relationships. If the system knows that clinical trials require participant consent, and “Human Drug Trials” is a type of “clinical trial,” it can infer that human drug trials also require participant consent. The taxonomy enables semantic reasoning without explicit rules for every concept.
These three functions—classification, navigation, reasoning—make taxonomies essential infrastructure for search systems, content management platforms, recommendation engines, and AI applications.
While I would love folks to jump into taxonomy construction without pause, taxonomy construction requires planning and thoughtful research. The reason being, that you are building an artifact for digital systems and intend this artifact to be used for specific reasons and use cases. Just as we shape software features and products with things like user requirements, use cases and product requirement documents (PRDs), a taxonomy has its own set of processes to justify its construction and guide decisions related to architecture decisions.
⚪️ Metadata schema(s) and Metadata Application Profile (MAP)
⚪️ A controlled vocabulary (seeControlled Vocabulary series)
⚪️ANSI Z39.19 Guidelines for Construction of Monolingual Vocabularies and Thesauri (reference for guidance)
⚪️ Use cases
⚪️ Domain research—internal and external
⚪️ Coverage model
⚪️ Taxonomy requirements document
🔵 OpenRefine or Excel for data cleaning and manipulation
🔵 Taxonomy editor or Excel
Let’s step through the steps, and prepare for taxonomy construction. But first, we will find our bearings within the Ontology Pipeline® framework.
I created the Ontology Pipeline® as a framework for building semantic knowledge infrastructures, to guide builders and thinkers through the logical steps that lead us towards reliable and logical knowledge systems. Each stage of the Ontology Pipeline® prepares us for the next stage of building, and provides a repeatable and measurable process for organizing information and knowledge.
We start with controlled vocabularies because we need clean, disambiguated concepts to model coherent, machine and human readable knowledge. Poor term quality can result in circular logic, where we end up withself referential data.
Another common error in taxonomy building is the failure to reconcile and represent synonyms and acronyms, structured and associated with their preferred terms. In other words, not controlling the vocabulary. A machine (and some people) will understand these unresolved entities to be duplicates or get confused with multiple first class terms sharing the same definitions and meaning.
Controlled Vocabularies, Part I
·
October 10, 2025
A controlled vocabulary is the first step in building a semantic knowledge ecosystem, as detailed by the Ontology Pipeline framework. For many organizations, simply establishing operationalized controlled vocabularies delivers immense benefit, with or without artificial intelligence entering the equation. Let’s dive into controlled vocabularies, the benefits and the build process.
There are all sorts of bad things that can happen with self referential data and unresolved, poorly defined concepts such as fractured search, broken inference and inaccurate AI workflows. Therefore, the controlled vocabulary work will never go away. New terms and concepts will always enter systems. Controlled vocabulary work will become habit, and impact each stage of the pipeline, no matter. Knowledge work is not static, its dynamic.
After we’ve created a controlled vocabulary, we move onto metadata schemas. For more about metadata schemas, read my series, Metadata as a Data Model. Metadata schemas inform our knowledge architectures as to how data is represented using metadata.
Metadata schemas also tell us how data is transported through systems, successfully or unsuccessfully. Metadata itself gives us a greater understanding through insight, clarifying ownership of data assets and metadata, while also exposing where there are gaps within metadata systems and data catalogs. Metadata schemas inform our building while also illuminating gaps in machine and human understanding.
The third stage of the Ontology Pipeline® is taxonomy and yes—here we are.
We have our controlled vocabulary, our analysis of metadata schemas and hopefully a metadata application profile (MAP) to account for the metadata highways and byways within an organization. When we design and construct a taxonomy, we use our controlled vocabulary as a foundation, from which we will structure a hierarchy.
The metadata schema and MAP are used to build use cases and harvest more preferred labels, synonyms and acronyms. The MAP is used to account for the movement of data, capturing data and metadata architectures, pipelines and data ownership, amongst other things. We use the MAP to guide our taxonomy research and design processes, so that we may apply rigor in analysis and design tasks.
Metadata schemas become architectural blueprints for the taxonomy while controlled vocabularies become the material from which we construct taxonomies. Together, the first two stages of the Ontology Pipeline prepare builders for taxonomy construction, leading to more reliable taxonomy build processes, that take into account how language and metadata are leveraged within existing systems.
With our controlled vocabulary, metadata schemas and MAP at the ready, we can now create a coverage model, to ensure that we are building a taxonomy that is fit for purpose. A coverage model includes the controlled vocabulary as part of input, so that we may assess gaps in coverage, which amounts to opportunities for category expansion in our taxonomy building phase.
A coverage model utilizes metadata schemas and the MAP to determine data and metadata ownership. Moreover, metadata and the MAP are used to determine the necessary depth of a hierarchy. How many levels of categorization are needed to fairly represent concepts? Finally, the MAP and metadata schemas highlight taxonomy risks such as where silos exist, mismatches in vocabularies and variances in how data is imagined within a system.
As a first step in assembling a coverage model, we start by gathering use cases and define taxonomy requirements.
There’s likely a reason, or more than one reason for needing a taxonomy. What are the use cases? For most organizations, the reasons for a taxonomy are often simple, such as improving search and retrieval for a machine learning project that seeks to detect overpayments or to improve the accuracy of LLM output for a customer-facing product.
Start by collecting any and all use cases where a taxonomy is being used or will be used. Don’t be afraid to collect existing use cases from information and knowledge management workspaces such as Sharepoint and Confluence. Use cases are about identifying problems and opportunities.
Create a spreadsheet, where you can organize each use case, the problem(s) to be solved, and how taxonomy can solve the stated problems, like this:
Your requirements document will guide you in determining the starting point for your taxonomy build, in addition to the appropriate level of granularity required to represent concepts. If we keep our scope constrained and start with the first listed item, ‘Streamlining Content Discovery’ we have enough input to start the coverage model research, discovery and reconciliation processes.
Pick one and if possible, multiple use cases, choosing according to priority, scope and feasibility of taxonomy implementation and integration within systems. This means there will be a fair amount of diplomacy, to create agreements with teams and orgs, to coordinate with workflows and product management.
You will likely have to engage in formal product and project discussions, with proposals and presentations. In some organizations, proposing a taxonomy project as a pilot or proof-of-concept (POC) can fast track a taxonomy project, bypassing organizational formal processes.
And with that, we are off to the races.
Let’s talk coverage and coverage models—the planning phases of taxonomy construction.
A coverage model is a structured assessment of what your taxonomy must cover to serve its intended purpose. It maps your domain boundaries, identifies required concept areas, and establishes completeness criteria before you begin hierarchical construction.
Without a coverage model, you’re building blind—you don’t know if you’ve captured 40% or 90% of your domain, whether critical gaps exist, or if you’re over investing in areas that don’t matter to users. The coverage model answers four essential questions:
✅ What topics must be included?
✅ How deep should each branch go?
✅ What level of granularity do users and digital systems need?
✅ What are competitors or comparable organizations covering that we need to match or exceed?
This is where domain research is required, in order to have enough input to guide the coverage model.
The coverage model itself will include a spreadsheet workbook, documents and assets, organized in a dedicated file folder or server file where you will collocate your research. This will become your collection of information and knowledge assets.
A coverage model also serves as the foundational research framework that ensures your taxonomy reflects the actual language, concepts, and organizational needs of your domain. Rather than designing a taxonomy based on assumptions or a single stakeholder’s perspective, the coverage model systematically captures terminology from the full breadth of sources that your taxonomy will eventually serve.
This approach grounds your controlled vocabulary in evidence, revealing both the explicit terms people use and the implicit conceptual structures that organize their thinking.
Because a coverage model relies upon domain research as input, let’s be specific about what we are looking for in our research tasks. Before embarking on our quest for more input, have your controlled vocabulary on hand, always. Our controlled vocabulary tells us what is already accounted for and represented.
We are looking to discover where there are gaps in coverage, and where the business departs from how the rest of the world defines things and concepts. Aligning on definitions and the representation of things on the Web and the real world is critical, especially where an AI system will be an end user. Mismatches between internal definitions and real world definitions often result in AI system failures, as a model’s training data comes from the open internet whose sense of reality is guided by crowd sourced understandings of what things are and what those things mean.
This misalignment—between the real world and internal vocabularies—should be captured and lightly documented, to flag concepts whose definitions contradict more widely accepted definitions.
For example, if your organization defines Journey as a feature in a product that is used to construct a customer journey for marketing campaigns, a taxonomy will want to support the unique concept definition of Journey through definitions and parent—child relations. This is what is meant when we say “context” relative to the value proposition of taxonomies.
Here, the Journey Taxonomy for Acme Products, defines the concept ,Journey, through its parent and child relations:
The Journey Taxonomy by Claude Opus 4.5 is not wrong, it’s just not how Acme defines Journey as a feature in a product.
As you can see, the concept ‘Journey’ can assume different meanings by way of a hierarchy, whereby the parent-child relations add context and meaning to each category in the classification scheme. Journey could also mean the stadium rock band, but we will save that taxonomy for another day.
Effective coverage modeling begins with casting a wide net across the documentary landscape of your organization. Sample documents should include unstructured content such as reports, policy documents, marketing materials, and internal communications in formats ranging from Word documents and PDFs to presentation decks.
Structured sources prove equally valuable—spreadsheets containing product catalogs, customer data fields, or inventory classifications often reveal hidden taxonomic logic embedded in column headers and categorical values. Legacy databases, CRM systems, and enterprise applications contain metadata schemas that represent previous attempts to organize information. Even email subject lines and ticket categories from help desk systems offer insight into how people naturally categorize their work.
The goal is not exhaustive collection but to arrive at a representative sampling—enough variety to surface terminology patterns.
Internal coverage should extend across departmental silos, since different teams often develop their own vocabularies for overlapping concepts. Marketing may describe products using customer-facing language while engineering uses technical specifications; legal employs regulatory terminology while sales focuses on competitive positioning.
Gathering term lists from each stakeholder group reveals these dialectical variations and highlights where synonymy, homonymy, and conceptual gaps exist. Interview subject matter experts, review departmental documentation, and examine the taxonomies already embedded in team wikis, SharePoint sites, and project management tools. This cross-functional approach ensures your taxonomy can serve as a bridging vocabulary rather than privileging one department’s worldview.
Beyond internal sources, publicly available taxonomies and classification schemes provide invaluable benchmarks for coverage and structure. These external resources offer tested conceptual frameworks and reveal industry-standard terminology that your stakeholders may expect. For a list of open, available taxonomies, consult the Taxonomy and Thesaurus Catalog, organized by theme.
Competitor analysis also belongs in external coverage work. Examine how similar organizations structure their navigation, tag their content, and categorize their offerings. This competitive intelligence reveals both opportunities for differentiation and baseline expectations within your industry.
With sources identified, the practical work of term harvesting begins. Capture involves extracting candidate terms through manual review, automated text analysis, corpus analysis or a combination of all. Natural language processing tools can surface frequently occurring noun phrases and named entities, while human review catches contextual nuances that algorithms miss. Maintain provenance throughout—record which terms came from which sources, preserving the connection between vocabulary and origin.
Let’s focus on corpus analysis, as I find it to be the most reliable and easiest way to gather high quality vocabulary terms.
Corpus analysis involves TF (Term Frequency) and TF-IDF (Term Frequency and Inverse Document Frequency) algorithms. TF measures how often a word appears in a document, focusing locally on word count in a document. To run corpus analysis, refer to this guide, with my suggestions for various corpus analysis methods and tools.
Another option for conducting your own corpus analysis is to partner with an engineer to assist in corpus analysis. The most important part of corpus analysis is to keep all assets together, per source. This is so we can maintain the provenance and source for terms extracted.
To reconcile the corpus results, populate the extracted terms and term collection in a spreadsheet workbook, with a dedicated tab for each corpus source. For example, the terms extracted from collection of documents and resources from marketing will populate a workbook tab you will name ‘marketing’.
This is one of the core grind tasks in vocabulary harmonization.
The mechanics of worksheet reconciliation deserve their own essay—it’s a methodical process with decision points and edge cases that benefit from detailed treatment.
For guidance on running your own corpus analysis, these tutorials cover the fundamentals:
“Analyzing Documents with TF-IDF” from Programming Historian”: a comprehensive walkthrough of TF-IDF foundations using Python Programming Historian
“TF-IDF with Scikit-Learn” from Introduction to Cultural Analytics—practical implementation using scikit-learn’s Tf-idfVectorizer Song Genius API
“Text Mining with R: TF-IDF”—covers term frequency analysis using the tidytext approach Tidytextmining
For now, here’s the high-level approach.
Your reconciliation workbook needs to account for three things: term provenance (which source each term came from), term frequency (how often it appeared in that source), and mapping status (how the term relates to your existing controlled vocabulary).
Work through the harvested terms in three passes.
Resolve exact matches between harvested terms and your existing controlled vocabulary. These are terms you already have—the corpus analysis simply confirms they’re in active use. Note the match, capture the frequency data, and move on. High frequency across multiple sources validates your vocabulary choices. Exact matches that appear rarely might signal that your vocabulary uses terminology that doesn’t reflect actual usage—worth flagging for review.
For terms that don’t exact match, use clustering to gather like terms together. These are synonyms, spelling variants, acronyms, and related expressions that orbit the same concept. Map these clusters to existing categories in your controlled vocabulary. The clustered terms become candidates for alternative labels —different ways of expressing concepts you already have. The clustered terms can also be candidates for subcategories in your taxonomy—narrower terms that help to round out the description of a category. Term frequency within clusters helps you see which variants are common enough to merit inclusion.
Terms that don’t exact match and can’t thematically cluster with existing concepts are your new candidates. These represent potential gaps in your vocabulary—concepts the domain uses that you haven’t yet captured. Note each new candidate term along with its source, marking it as a candidate for expansion. Don’t add these to your vocabulary yet; flag them for evaluation during the structuring phase.
At the end of reconciliation, you have your material to work with. Your original controlled vocabulary (validated and annotated with frequency data), clustered terms ready to become alternative labels, and a list of new candidate terms, all terms noting provenance.
To further bolster a vocabulary, map your existing controlled vocabulary to publicly available industry standard vocabularies and taxonomies. This is an additional or alternative way to capture more terms to be considered.
Standard vocabularies represent the accumulated work of domain experts and standards bodies. Mapping to them reveals terminology your organization hasn’t adopted, concepts you haven’t named, and relationships you haven’t articulated. Refer to the Taxonomy and Thesaurus Catalog, published on my Substack.
Many of these vocabularies are published in SKOS format and available as linked data, making automated mapping possible using SKOS mapping properties like exactMatch, closeMatch, broadMatch, and narrowMatch. Alignment with industry standardized vocabularies is a very powerful mechanism, to stay relevant to a domain and an industry as a whole, especially with AI.
While an organization may have its marketing and branding terminology, that same terminology may not make sense outside of the walls of an organization. This may look like adopting industry standard terminology as alternative labels, ultimately enabling more robust category definitions and improved entity reconciliation.
If I’m covering a domain—any domain—I want coverage for the domain, not just my company. This matters because for LLMs, for discovering new opportunities in the marketplace, and for maintaining a handle on the competitive landscape, we need to cover not just what we know but also what we don’t know.
A taxonomy that includes terminology uncommon in company speak is valuable for creating a tech radar of the domain or industry. This supports signal detection for upcoming trends, forecasting and competitive analysis. It also helps describe domains more fully and contextualize concepts by grounding them in facts beyond internal assumptions.
This aspect of taxonomy building always catches folks off guard: why cover concepts that are outside of the things an organization cares about right now?
I once worked at a company where the subject matter experts (SMEs) took it upon themselves to remove half of the concepts from a vocabulary because it was not relevant to the domain or company. The SMEs viewed the extra concepts as noise, but really, it was a competitive advantage.
As a librarian, I understand that distribution and fair coverage of subject areas, topics, and domains is critical to understanding. Comprehensive coverage mitigates bias. A vocabulary built only from internal sources reflects internal blind spots. A vocabulary built from domain-wide corpus analysis and mapped to industry standards reveals the full landscape—including the parts your organization hasn’t noticed yet, the terminology competitors use that you don’t, and the emerging concepts that haven’t made it into your operational language.
This is how taxonomies become strategic assets.
From the assembled research and prep work, the next step is to assemble the actual coverage model framework, which will serve as your design framework for taxonomy construction.
The coverage model complete includes:
the existing controlled vocabulary
metadata schemas
metadata application profile
use cases
taxonomy requirement document
collections of documents and resources used for corpus analysis
the workbook with harvested terms, one tab per source
the reconciliation work (exact match, clusters and candidate terms)
The framework itself will be a requirements document, that becomes your taxonomy blueprint, informing hierarchy depth decisions (how many levels are needed), preventing scope creep, and giving stakeholders concrete metrics for completeness. You’ll know you’re done when you’ve achieved the coverage targets you defined upfront.
The coverage model documentation becomes your framework for informed taxonomy design decisions. By tracking which sources contributed which terms, you can identify emergent themes. Perhaps customer service documentation emphasizes problem categories that product documentation ignores, or recent materials introduce terminology absent from legacy content. These patterns inform scope decisions about what your taxonomy must cover comprehensively versus peripherally.
Concept frequency analysis reveals depth requirements. Domains where many fine-grained distinctions appear in source materials warrant deeper hierarchical development, while areas with sparse or generic terminology may need only shallow coverage. Trend analysis across document dates can show evolving vocabulary, helping you anticipate where your taxonomy will need flexibility for emerging concepts.
Finally, source documentation supports governance. When stakeholders question why certain terms were included or excluded, or why particular hierarchical relationships were established, the coverage model provides evidence-based justification. This transparency builds trust in the taxonomy as a representation of organizational reality rather than an arbitrary imposition, and it establishes the baseline against which future taxonomy evolution can be measured.
People don’t care about concepts, naming conventions and taxonomies until THEIR concepts and naming are challenged. Retaining term sources will come in handy when there is a word war. Best to come prepared with citations.
If you would like me to publish a coverage model template, drop a message let me know:
The coverage model IS the necessary prep work that drives key structural decisions. It helps determine how many levels your hierarchy needs—where depth serves precision and where it creates unnecessary complexity. It identifies where alternative labels are essential to capture the varying ways people describe the same concepts, ensuring your vocabulary meets users where they are rather than forcing them to guess your preferred terminology. And it highlights areas requiring attention: gaps where domain coverage is thin, ambiguities where a single term carries multiple meanings that need disambiguation, and overlaps where concepts blur into each other and require clearer boundaries.
The second essay in this series will walk through the nuts and bolts of structuring your vocabulary—taking the reconciled terms, clustered candidates, and new additions and organizing them into a hierarchy with parent-child relations, alternative labels, and definitions. The coverage model will guide every decision.
Taxonomy building is patient work. It requires listening to how people actually talk about their domains, respecting the accumulated wisdom of industry standards, and holding space for what your organization doesn’t yet know it needs to name. The payoff is a vocabulary that doesn’t just organize what exists—it illuminates what’s emerging, reveals blind spots, and positions your organization to see the full landscape of its domain. A well-built taxonomy is quiet infrastructure with loud impact. It makes search work, makes AI systems smarter, and makes knowledge findable. Start with coverage. The structure will follow.
Author’s Note: This article is part of the Intentional Arrangement series on building semantic knowledge management systems.
Connect with me onLinkedIn or subscribe toIntentional Arrangement for upcoming articles in this three-part taxonomy series.
**about me. **I’m an information architect, semantic strategist, and lifelong student of systems, meaning, and human understanding. For over 25 years, I’ve worked at the intersection of knowledge frameworks and digital infrastructure—helping both large organizations and cultural institutions build information systems that support clarity, interoperability, and long-term value.
I’ve designed semantic information and knowledge architectures across a wide range of industries and institutions, from enterprise tech to public service to the arts. I’ve held roles at Overstock.com, Pluralsight, GDIT, Amazon, System1, Battelle, the Oregon Health Authority, and the Department of Justice and most recently, Adobe. I built an NGO infrastructure for Shock the System, which I continue to maintain and scale.
Throughout the years, I’ve worked a bunch at GLAM organizations (Galleries, Libraries, Archives, and Museums), including the Smithsonian Institution, The Shoah Foundation for Visual History, Twinka Thiebaud and the Art of the Pose, Nritya Mandala Mahavihara, the Shogren Museum, and the Oregon College of Art and Craft.
And through it all, I am a librarian.