Anyone who has worked in data integration for a significant length of time has almost certainly been confronted with the following question:
“Where did this data come from?”
Confronting this question brings to the front burner the need for establishing data lineage tracking in ETL processes.
Data Lineage Tracking in ETL Processes
Extract, transform, and load (ETL) operations frequently involve the movement and consolidation of data from multiple sources through numerous transformation steps before being routed to its final resting place. As the ETL touchpoints increase in number and complexity, so does the difficulty in tracking the data back to its origins. The concept of data lineage in ETL is intended to make this process easier. Data lineage tracking involves building ETL elements in such a way that each row of data in the destination tables is unambiguously traceable back to it source, including any transformation process through which it passed during its travels.
Building ETL processes that include data lineage tracking takes extra work and careful planning. Sadly, the majority of ETL processes I’ve found in the wild do not have provisions for capturing data lineage. Because data lineage tracking does not add to the core functions of extraction, transformation, and loading, this design is often skipped during architecture or build. Much like the related ETL best practices of logging and auditing, data lineage is a typically unseen yet still valuable component of a well-designed ETL architecture.
Why Use Data Lineage Tracking?
There are several benefits that are realized with proper ETL data lineage tracking:
- Trustworthiness. When the origin of each row of data and the path it took to arrive is systematically tracked, users and administrators of the data will have more reason to trust the data.
- Easier troubleshooting. When the ETL data path is self-describing, it makes testing and troubleshooting far easier.
- Expose leaky processes. When each row of data is trackable from source to destination, it helps to reveal any holes in the process where data might be lost.
- Visibility. I can’t count the number of times I have been asked by a client to help them find and document business rules in their ETL processes. A side benefit of data lineage tracking is that the places where the business rules hide become more evident.
Not all load mechanisms require data lineage tracking. Very simple processes, those that load volatile tables, and some single-use, “throwaway” code does not have the same need for tracking data lineage. However, these are the exception rather than the rule.
Conclusion
Data lineage tracking in ETL processes is a best practice for most loads. Although it takes time to properly design and implement this pattern, the value gained is almost always worth the effort.