When it comes to managing data, there is often a difference between what should happen and what can happen. That space between should and can is often challenging, forcing data professionals to balance risk with business value. A couple of common examples:
“There should not be any sales transactions without a valid customer, but because the OLTP system doesn’t use foreign keys, it could theoretically happen.”
“Location IDs should be less than 20 characters, but those aren’t curated so they could exceed 20 characters.”
“This list of product IDs from System A should match those in System B, but because they are entered manually it is possible to have a few typos.”
“This data warehouse should only contain data loaded as part of our verified ETL process, but since our entire data team has read-write permission on the data warehouse database, it’s possible that manual data imports or transformations can be done.”
“This field in the source system should always contain a date, but the data type is set to ASCII text so it might be possible to find other data in there.”
“The front-end application should validate user input, but since it’s a vendor application we don’t know for certain that it does.”
Managing data and the processes that move data from one system to another requires careful attention to the data safeguards and the things that can happen during input, storage, and movement. As a consultant, I spend a lot of time in design sessions with clients, discussing where data comes from, how (if at all) it is curated in those source systems, and what protections should be built into the process to ensure data integrity. In that role, I’ve had this conversation, almost verbatim, on dozens of occasions:
Me: “Is it possible that <data entity> might not actually be <some data attribute>?”
Client: “No, that should never happen.”
Me: “I understand that it shouldn’t. But could it?”
Client (after a long silence): “Well…. maybe.”
Building robust systems requires planning not just for what should happen, but for what could happen. Source systems may not include referential integrity to avoid situations that are impossible in business but technically possible inside the data store. Fields that appear to store one type of data might be structured as a more generic type, such as text. Data that should be in curated lists can sometimes contain unvalidated user input. None of these things should happen, but they do. When designing a data model or ETL process, be sure that you’re asking questions about what protections in place to make sure that the things that shouldn’t happen, don’t.