Six more steps for a disciplined, repeatable approach to building robust, scalable data pipelines.
While every engineer’s implementation is shaped by their business objectives, data source technology, and governance requirements, most pipeline projects share a consistent sequence of steps. Here’s six more steps engineers should follow to ensure a crisp, clean, standardized data pipeline.
To read the first article in this two-part series, click here.
Data cleansing and standardization
Based on the design, the project team can develop the required custom software or configure a software package to cleanse and standardize the data across all data sources in the pipeline staging environment.
Th…
Six more steps for a disciplined, repeatable approach to building robust, scalable data pipelines.
While every engineer’s implementation is shaped by their business objectives, data source technology, and governance requirements, most pipeline projects share a consistent sequence of steps. Here’s six more steps engineers should follow to ensure a crisp, clean, standardized data pipeline.
To read the first article in this two-part series, click here.
Data cleansing and standardization
Based on the design, the project team can develop the required custom software or configure a software package to cleanse and standardize the data across all data sources in the pipeline staging environment.
The cost and complexity of data cleansing and standardization are largely proportional to the number of data columns that require attention across all data sources.
Data transformation design
The pipeline’s data transformation design considers the following frequently occurring situations:
- Merging data for identical rows from multiple data sources.
- Normalizing repeating data groups into multiple rows in new tables.
- Defining rules for selecting the highest quality value when identical columns exist in multiple data sources.
- Aggregating data to a uniform level of summarization.
- Defining new data columns for calculated values, such as sales or margin amounts, to enrich the data and ensure congruity across queries and reports.
- Defining new data columns that will be populated by parsing text columns and columns with multiple values.
- Defining new data confidence columns that will be populated by data confidence values for adjacent columns.
- Defining new tables for calculated metrics, such as production quality, schedule variance or scrap rates.
The team focuses the data transformation design on only the required data columns and rows from each data source. Most data pipelines do not process all the data in every data source.
Data transformation
Based on the design, the project team can develop the required custom software or configure a software package to transform all data in the staging environment and populate the target datastore.
In most cases, the target datastore will use a relational DBMS and one of the following architectures:
- A data lakehouse – a more flexible design that can accommodate both structured and unstructured data.
- A data warehouse – a more rigid design that is suited for structured data analytics.
An alternative to building and operating a target datastore is to implement a complex database view over the data in the staging environment. This approach is highly appealing because it avoids the cost of creating and maintaining another datastore. It’s only feasible when the required data transformation is modest.
An alternative to a relational DBMS is a graph DBMS. This relatively new technology offers dramatically faster query performance. The disadvantages are slower update performance and a more complex query language.
Irrespective of the chosen storage technology, the target datastore can be hosted on-premises or in the cloud. Housing the staging environment and the target datastore in different computing environments adds avoidable complexity to the data pipeline design.
The cost and complexity of data transformation are largely proportional to the number of tables that must be restructured or added across all data sources.
Testing and validation
Before the team can release the data pipeline and its target datastore into production, they must undergo rigorous testing. The scope of testing includes:
- Unit tests of the individual software components.
- Review of the data profiling and quality assessment results.
- Confirmation of the data transformation logic.
- An integration test of the end-to-end data pipeline.
- A performance evaluation under expected loads.
- Tests of the various artifacts that the target datastore will produce.
The cost and elapsed time of testing and validation are largely proportional to the number of data sources, the complexity of the data transformation and the number of artifacts the data pipeline will produce.
Implementation
Once the tests are completed satisfactorily and the target datastore of the data pipeline is being updated routinely, the team turns its attention to people change management. Unfortunately, this work is often neglected or under-resourced. Helping end-users, such as engineers and business analysts, become familiar with the target datastore ensures adoption of the new capabilities that include:
- The available data, its meaning and its structure.
- Data query, visualization and reporting tools.
The team may also assist end-users in converting previous outputs to the new datastore.
The cost and elapsed time of implementation are largely proportional to the number of end-users who will access the new datastore.
Operation
Data pipelines require ongoing monitoring during operation to ensure that:
- Unsuccessful data update jobs are investigated and restarted.
- Puzzling results are investigated to confirm their accuracy or to understand data or software defects better.
- Degradation in update and query performance is investigated and translated into actions to optimize the datastore performance and upgrade the computing environment.
- Data or software defects are corrected.
- The impact of planned schema changes to data sources is well understood and planned for.
Data pipeline risks
The significant risks that data pipelines can encounter include:
- Difficulties integrating data from various data sources.
- Uneven data quality that is difficult or expensive to correct.
- Schema changes in data sources crashing the data pipeline.
- Unavailability of data sources causing cascading failures.
- Unnoticed issues where the pipeline appears to work but produces incorrect data.
- Data breaches due to the number of data sources and data movements involved.
- Data security and compliance issues due to unauthorized access.
When these risks become reality, the result is increased project costs, cancelled projects or inaccurate analytics that can lead to lost revenue, reputational damage, and legal penalties.
This sequence of steps describes a disciplined, repeatable approach to building robust, scalable pipelines aligned with business needs.