A universal truth in business is that bad data costs money, and can even be dangerous. Data quality processes are often an afterthought rather than a central component of the architecture, due in part to the fear over the complexity of checking and cleansing data quality. In many cases, that fear is warranted; the longer data quality is delayed (in time as well as in data lineage), the more time and money it costs to make it right.
Some data quality needs require a formal project and specialized tools to address. Other issues, can be mitigated or avoided entirely with small changes to the data model or load processes. We like to look for these “low-hanging fruit” opportunities when designing a data or ETL infrastructure, because some of these simple modifications can save dozens of hours (or more) and significantly improve data quality very quickly.
Keeping Data Quality Simple
Among the passive changes that can help prevent or fix data quality issues:
Use of proper data types. Many of the data quality issues we fix are related to numerical data stored as text, and then some non-numerical data ends up inadvertently loaded. Even harder to detect is when numerical data of an incorrect precision is stored. The same goes for date values, date + time values, geolocation data, among others. Storing data in the type that most closely represents its real use will avoid a lot of downstream problems that are often hard to diagnose and expensive to fix.
Non-nullable fields. Every major RDBMS platform supports non-nullable fields, which require a value in said field before an insert or update operation will complete. If a particular field must have a value before that record can be considered valid, marking the column as non-nullable can avoid data consistency issues where that value is missing.
Foreign key validation. The use of foreign keys for data validation is a best practice in most any relational database architecture, and that is doubly true when improving data quality is a main objective. Using foreign keys to limit values to only those entries explicitly allowed in the table referenced by the foreign key prevents stray values from sneaking into a constrained field.
Check constraints. Preventing the insertion of data with values outside a define range can be accomplished through check constraints found in every major database platform. These, like foreign keys, limit the values that can be entered for a column but the check constraint does not use a separate lookup table. Also, you have flexibility to set a range of allowable entries rather than a discrete list of values. An example of this would be using a check constraint to enforce that all dates of birth are on or after a certain date.
ETL cleanup. Most ETL tools have built-in functionality allowing for lightweight data cleansing. Assuming the data in question is being processed through a structured ETL tool, adding in logic to correct minor data quality issues is relatively easy to do with low risk. Emphasize a light touch here – you aren’t going to want to do address standardization or name deduplication without some formal codified process.
No Substitute for Formal Data Quality Processes
Even when taking these precautions to prevent or correct some issues, you’ll still run into cases where a more rigid and comprehensive data quality initiative will be needed. The above suggestions will not eliminate the need for proper data quality tooling, but can help reduce the pain from or delay the need for in-depth data quality remediation.
Conclusion
Data quality requires a multifaceted strategy. Taking care of some of the simple problems with easy-to-use tools already at your fingertips can have a significant and immediate impact on the quality of your data.