The Trellance Data Blog

The Dirty Data Debacle

Posted by Peter Keers, PMP on Aug 26, 2015 4:12:00 PM


Credit unions today are increasingly aware of the mountain of valuable data accumulating in their core and other operational systems. With technology evolving at a rapid pace, opportunities to leverage this data are becoming not only more available but also more affordable than ever before.

As a result, credit union decision makers are anxious to ramp up Big Data & Analytics initiatives. However, before plowing ahead with an investment in hardware, software, and services, it is important to consider the risk lurking within the data itself: dirty data.

Dishing Up the Dirt

The official title for dirty data is “poor data quality”.  However, that seems too clinical considering the mess that dirty data can make for an important Big Data & Analytics program. Yet, when program champions are warned of the dangers of dirty data, they are often skeptical. They point to the source systems from which they plan to pull data and note, correctly, that there appear to be few issues due any dirty data. In fact, they doubt that there is much a data quality issue since these systems run just fine.

This is a common misconception. Operational systems often have many dirty data secrets. There are several reasons for this.

  1. Transaction processing may allow too much flexibility in accepting data. While there are many input filters that force correct data entry, there are often instances where dirty data is allowed. For example, suppose data entry for “State” allows both “NY” and “New York” or even “New Yrk”. These different variations will cause a load failure in the Big Data & Analytics data warehouse unless more expensive software code is written to account for the variations.
  2. Current data and legacy data don’t match. It is typical that historical data will be desired in a data warehouse to support trending analysis.  Attempts to load data archived from previous versions of operational systems often fail due to incompatibly with current data. This can be especially true in situations where a past merger has mixed data from different core systems.
  3. Integrating data from non-integrated systems. One of the premier attractions of building a Big Data & Analytics capability is the prospect of integrating data from disparate operational systems. The majority of credit unions have best-of-breed stand alone systems that perform work outside the core processor. These systems have limited, if any, data integration functionality.  Yet, from a decision making perspective, the data from these multiple systems needs to be integrated in order to provide meaningful information.  In many cases, properly integrating data for this purpose is not easy and careful efforts must be undertaken to bring everything together to create “one source of truth”.

Dealing With the Mess

It is very common for Big Data & Analytics initiatives to uncover Dirty Data even at credit unions that pride themselves on following strict data quality standards. To what is the best means for dealing with the mess?

  1. Expect data quality problems.  In planning for Big Data & Analytics programs, employ risk management practices regarding data quality. Credit unions that underestimate this risk and do not prepare contingencies to deal with it will suffer the consequences.
  2.  Clean up operational systems. Take time in the early stages to perform a data quality audit on operational systems. By identifying and correcting dirty data at the source, headaches can be avoided when developing a Big Data & Analytics project.
  3. Avoid loading expensive junk. Having a wealth of historical data sounds wonderful but it might be cost prohibitive. It might make more sense to load less (dirty) historical data and have shorter trending opportunities when the Big Data & Analytics project is first launched.
  4. Go slow when integrating data from multiple sources. Ambitious plans to bring everything together at once may need to be tempered with the idea that successful integration takes time. Choose integration projects not only based on strategic importance but also on project complexity. Big Data & Analytics initiatives are long-term learning processes. Taking on the more manageable integration projects in the early years may allow a tradition of success to take hold and drive the program forward in the future.

Credit Union Data Analytics: Beginning The Journey Whitepaper

Topics: Big Data, Data Integrity, Data Quality