This is post 3 out of my series on how I structure my data pipelines. If you’re interested in reading the first two, you can find them here:
Overview
You don’t have to read the priors, but the first post gives a mile-high overview of it all, and the rest of this series dissects individual layers.
Silver is where that clean data becomes a business model. This is where Kimball lives, with facts at the finest usable grain, dimensions with attributes for filtering and grouping, surrogate keys,…
This is post 3 out of my series on how I structure my data pipelines. If you’re interested in reading the first two, you can find them here:
Overview
You don’t have to read the priors, but the first post gives a mile-high overview of it all, and the rest of this series dissects individual layers.
Silver is where that clean data becomes a business model. This is where Kimball lives, with facts at the finest usable grain, dimensions with attributes for filtering and grouping, surrogate keys, and conformed dimensions that resolve the same entity across multiple sources.
Dimensional modeling is more important than ever.
The methodology has decades of literature behind it. The patterns are documented, the edge cases are known, and there’s no need to invent solutions from scratch. Facts and dimensions are composable primitives that mix and match to answer questions nobody has thought of yet. Paired with an ERD, tests, and naming conventions, Silver becomes something people can navigate without asking questions.
Gold models are the primary consumers of Silver. Every metric view, every wide table, every consumption artifact in Gold starts by referencing Silver facts and dimensions. But Silver also serves analysts/data scientists who need more granularity than Gold provides.
If directly querying the Silver layer leads to defining a new metric, we maintain a clear-cut path for consumers to contribute back to the gold layer.
Despite everything in the previous posts and anything in future posts, let me make this explicit and clear right now with ten toes down on the ground: facts and dimensions are the most important part of any of these pipelines. Without a robust, predefined data modeling methodology, all upstream effort goes to waste and isn’t properly leveraged, and everything downstream is built on a foundation of jelly.
Facts are measurable business events at a specific grain. Each row represents something that happened, whether that’s a transaction, a click, a conversion, or a daily spend total. Facts contain numeric measures (amounts, counts, durations) and foreign keys to dimensions. Think of them as verbs: a customer purchased, a user clicked, a campaign spent.
Every fact has a declared grain: one row per transaction, one row per campaign per day, one row per user per session. My principle is to go as granular as possible, because rolling up is easy, but rolling down is impossible once aggregation has already happened. People should definitely consider pre-aggregating when the use case warrants it (ex. tracking every click is expensive), but they should never prematurely aggregate.
Dimensions are descriptive context. They contain the attributes used for filtering and grouping, such as campaign names, channel categories, customer segments, and dates. Dimensions give meaning and context to the facts. These are the nouns: the customer who purchased, the product they bought, the campaign that drove the click, and the date it happened.
The power is in the composability. A single dimension can join onto many facts. When an attribute in the dimension updates, every fact that references that dimension is immediately able to use that attribute. On the flip side, when a new fact table goes live, it immediately leverages all existing dimensions.
The Silver layer becomes a comprehensive dictionary. With this, the team can combine nouns with verbs to make cohesive sentences. However, without a good data modeling strategy, you will always find yourself speaking gibberish.
I always use surrogate keys for dimensions, even when they’re derived from a single natural key.
This is because using a natural key is stable until they’re not. The product may decide emails can change, the ad platform introduces a new ID format, or that the transaction number assumed to be unique has duplicates. Surrogate keys create an abstraction layer that lets the dimension handle these changes without refactoring every fact that references it.
When multiple sources describe the same entity, Silver resolves them into one authoritative dimension. This is conformance. If Google’s, Meta’s, and LinkedIn’s ad services each have their own concept of a “campaign,” Silver produces one dim_campaigns that assigns surrogate keys, maps naming conventions to an internal taxonomy, and gives every fact table a single campaign dimension to reference.
Bridge dimensions take this further by connecting facts that measure different things.
Consider dim_campaigns in a marketing attribution model. It bridges fct_ad_spend (how much was spent), fct_clicks (how users engaged), and fct_conversions (what resulted in revenue). These facts have different grains and measure different events, but they all relate to campaigns. The campaign dimension is the connective tissue that lets an analyst ask “what was the return on ad spend for this campaign?” by joining spend to conversions through a shared key.
Bridge dimensions are what make the dimensional model feel like a coherent whole rather than a collection of disconnected tables.
The previous section covered what goes into Silver. This section covers how Silver relates to the surrounding layers and how the project is organized.
Each layer in the architecture makes assumptions about what’s upstream, and those assumptions are what make development predictable.
When building in Silver, I get to assume that Bronze has already handled the messy parts. Column names are consistent, data types are correct, nulls are normalized, and deduplication is done. I don’t have to think about whether Google sends dates as strings or whether Meta’s campaign IDs have leading zeros because I handled that in the Bronze layer. The Silver layer has the privilege of focusing solely on the business logic.
The same pattern flows downstream. When someone builds a new metric in Gold, they get to assume that Silver facts and dims are clean, tested, and well-documented. The grain is set in stone, the keys are stable, and the conformed dimensions mean that dim_campaigns works the same way whether it’s joined to fct_ad_spend or fct_conversions. People developing the Gold layer get to focus on consumption patterns rather than wrangling data into entities that make sense.
Silver also marks a shift in how the project is organized. Bronze was structured by the data’s source system because that’s how data arrives. Silver pivots to business domains. The marketing domain contains dim_campaigns, dim_channels, fct_ad_spend, and fct_conversions. It doesn’t matter that the underlying data came from three different ad platforms. An analyst looking for campaign performance goes to the marketing domain and finds it. Some assets, like date dimensions, are cross-domain and live in a shared area called silver_common and are referenced by facts and dims across the project.
The first tier of access control we established was the medallion’s layers themselves. Only people who need to perform custom analyses or contribute to our semantic layer have access to the facts and dimensions. However, this is too coarse and doesn’t provide the finer, more detailed control we need to manage access and permissions.
Because as we move closer to where end consumers and business users interact with our data, we have to be more diligent, detailed, and granular about security and access control.
In the bronze layer, access was granted based on the source system and roles. However, now that we’ve split semantic-valuable domains like sales, marketing, and operations, we need better control
In the silver layer, we continue the previous pattern by granting access on a per-domain basis.
Users get access to schemas, and schemas map to domains. Someone with access to silver_marketing can query the marketing facts and dims. Someone without access can’t. This is straightforward to administer and works well for broad boundaries.
Row- and column-level security supplements schema access for finer-grained control. Column-level security handles sensitive attributes, such as a PII column in dim_customers that only certain roles can see, or an internal tax rate column in fct_transactions is restricted to finance.
Row-level security (RLS) handles data partitioning so a regional manager only sees rows for their region, and a brand manager only sees rows for their brands. The foreign keys in facts point to dimensions, and those dimensions have the attributes that drive access rules. We filter on both the facts and dimensions to ensure that only the right people see the right data they should see.
Column-level masking (CLM) is used less often because it is often easier to remove sensitive columns from the original model and move them into separate models. However, CLM provides a good balance of control and sustainability when managing many user roles.
Silver is the natural place for these additional security levers because the data is structured in business terms. In Bronze, the data isn’t yet organized around business concepts. In Gold, the data might be pre-aggregated, which can lose the granularity needed for row-level filtering.
The previous sections covered what Silver is and how it fits into the architecture. This section covers the practical details of building it.
Between every layer is an optional utility called intermediates. These models exist purely to make it easier, more performant, or more maintainable to build the actual layer they integrate into. In this case, I have intermediate models that use Bronze data and do some computation before building Silver models.
To be clear, intermediates are optional. Many pipelines won’t need them at all, and people shouldn’t reach for them right out of the gate. Folks should first try building the facts and dimensions directly. If you find yourself duplicating logic across models or waiting on expensive transformations that run multiple times, that’s when intermediates earn their place. The value is having a predefined pattern ready when the need arises. The team doesn’t have to invent a solution or debate where the model belongs. The architecture already has a spot for it.
The benefits are practical. Intermediates enable reuse without duplication. When the same transformation feeds multiple downstream models, you write it once. The channel mapping example serves both fct_ad_spend and dim_channels. If the taxonomy changes, I update one model instead of hunting through every fact and dimension that touches channel data.
Intermediates can also serve as performance checkpoints. Some transformations are expensive, such as complex joins, window functions across large datasets, or fuzzy matching logic. Materializing an intermediate as a table means downstream models query the cached result instead of recomputing it every time. I trade storage for compute. This is especially valuable when the same expensive operation feeds multiple downstream models. The cost is paid once during the pipeline run, not multiplied by every model that needs the result.
The key distinction is that intermediates are not products. They don’t get tests, documentation, or YAML specs. They exist to serve the models that do. If an intermediate starts feeling like it deserves its own documentation, that’s a signal it should probably be promoted to a proper model, and something in the layer has to get refactored.
Intermediates are plumbing. In this case, the facts and dimensions are the fixtures people actually use.
Naming:
Facts follow fct_{entity}.sql
Dimensions follow dim_{entity}.sql
Intermediates follow int_silver__{domain}__{name}.sql.
The prefixes make it immediately clear what kind of model I’m looking at.
Materialization: Facts and dimensions are tables. They’re the products of the layer and need to be persisted for querying. Intermediates are views by default, with the optionality of persisting as a table for expensive computations worth caching.
Schema: Facts and dims live in silver_{domain}. Intermediates live in silver_intermediates. The schema name signals both the layer and the domain.
Folder structure: Facts and dims go in silver/{domain}/ and intermediates also exist as a pseudo-domain in silver/_intermediates/. The underscore prefix keeps intermediates sorted to the top and visually separated.
YAML: Each domain has a _{domain}__silver.yml file with model descriptions, grain documentation, column descriptions, and tests. Intermediates get no YAML because they’re not part of the interface.
Reference rules: Silver models reference Bronze staging models (stg_*) and Silver intermediates (int_silver__*). Facts reference dimensions via surrogate keys. If a Silver model needs to reference another Silver fact or dim directly, that’s usually a sign that something should be refactored into an intermediate.
Here’s how this plays out with the marketing attribution example from the series.
Intermediate:
int_silver__marketing__channel_mapping consolidates channel naming conventions from Google, Meta, and LinkedIn into a single taxonomy. It maps “Paid Search,” “Search Ads,” and “Sponsored Search” all to “Paid Search.” This intermediate serves both fct_ad_spend and dim_channels.
If these mappings existed in a weird nested JSON object that required a series of lateral self joins, JSON arrays, and explosions, I would suggest materializing the intermediate as a table, but by default, it is a view.
Facts:
fct_ad_spend has a grain of one row per campaign per channel per day. Measures include spend amount, impressions, and clicks. Foreign keys point to dim_campaigns, dim_channels, and dim_dates.
fct_conversions has a grain of one row per conversion event. Measures include revenue and conversion count. Foreign keys point to dim_campaigns, dim_channels, dim_users, and dim_dates.
Dimensions:
dim_campaigns is the conformed campaign dimension across all ad platforms. It has a surrogate key, business keys from each platform, and campaign attributes like name, status, and start date. It bridges all three facts together.
dim_channels uses the channel mapping intermediate to provide a consistent channel taxonomy. Each row represents a channel category, with attributes such as channel name and channel group.
dim_dates is a standard date dimension with fiscal periods, day of week, and other calendar attributes. It lives in a shared area since it’s used across domains.
Gold doesn’t need to know about the channel mapping intermediate model. It can only consume fct_ad_spend, fct_conversions, and the dimensions. The upfront investment in the interface creates a clean separation and a predictable development experience.
The next post covers Gold: the semantic layer. I’ll explain what metric views actually are, why the semantic layer deserves its own tier, and how Gold intermediates prepare data for consumption.
Frankly, this is the one I’m most excited for because I am a really big fan of warehouse-native semantic layers stored as database objects, and I’m excited to talk more about that.
No posts