Credit unions aiming to build Big Data & Analytics capabilities have a lot of decisions to make. One of the most fundamental decisions is how much source data to capture. The two dimensions of “how much” are depth and breadth.
Depth refers to the amount of historical data to be loaded into a data warehouse at inception. Loading a small amount of historical data is an option which may make the Big Data & Analytics launch quicker. For most credit unions, however, trending data over time is a major requirement. This option means it will take years to accumulate the necessary volume of data to support trending.
The other option is to load as much historical data as possible. While this is the preferred approach for most credit unions, there are important factors to keep in mind.
First, it is often the case that at least some of the historical data will be in an archival state. This typically means data that is stored on tape or some other offline medium. It can take significant effort to locate the data, restore it to a temporary repository, and then load it into the data warehouse.
Second, the quality of historical data frequently decreases with its vintage. As data screening improvements are made over time, newer data is less prone to error. Mistakes that previously were allowed by a system (e.g. – text in a zip code field), are no longer allowed by improved data entry controls. Older data that was not subject to these business rules, however, must undergo a costly cleansing process before it can be loaded into the data warehouse.
Third, it is not uncommon for a credit union to have important historical data that was part of a now defunct legacy system. Conversion of this data is a necessary evil if the data is judged to be valuable.
In all three factors, the value of the historical data must be weighed against the cost it will take to make it usable.
Breadth is the measure of the number of data elements to be captured from the total available number of data elements. Like depth, there is a consideration of speed to launch versus the number of data elements available at inception. There is an additional consideration of predicting future reporting and analytics needs.
There is a strong motivation among credit union decision makers to limit the number of data elements captured when the Big Data & Analytics program is first rolled out. The frame of reference is usually the data elements used in current reporting. This seems to be a logical choice since current reporting is the minimum level of utility required from the project. Yet, placing this limitation on the data at this stage is a very expensive mistake.
A nearly immediate effect of a successful Big Data & Analytics launch is a flood of requests for additional data elements. Unfortunately, the effort to bring in additional data elements after the launch is many times greater than if more were captured in the first place. Why? The development team is typically busy supporting the post-launch growth so resources may not be available to take on the new effort. Also, there is a problem stemming from the depth discussion above. If a costly effort was undertaken to pull history for a limited breadth of data elements, it must be repeated to bring in the historical data for the additional data elements.
When trying to answer the breadth question, credit unions often are at a loss when it comes to choosing which data to bring in beyond current reporting needs. However, there are some approaches that can simplify the process.
First, adopt an Operational Data Store (ODS) methodology. An ODS is a repository that stores current and historical data extracted from the source but does not yet convert it to the architecture of the data warehouse. It is, in fact, the interim, bulk storage area from which the data is extracted and loaded into the data warehouse. The data warehouse only uses a percentage of the ODS data. Yet, adding data elements to the data warehouse from the ODS is many magnitudes faster and less costly than returning to the source system.
Second, if a source table is “touched” even once, capture all the data in that table. This means even if only one data element is required to support forecasted reporting needs, all the data elements from that table are extracted into the ODS as well. The assumption is that the data elements in the table are likely to be related, so their probable use in future reporting may be high.
Third, evaluate all remaining tables in the system for reporting value. Once the tables in #2 have been selected, the remaining tables must be inspected to determine their utility for reporting, and therefore, captured. This can be a daunting prospect but some simplifying rules make it easier. There are many “administrative” tables in any system. These handle security and other background tasks. Eliminating these from consideration is liable to significantly reduce list of likely candidates. Also, large tables, whether in terms of rows, columns, or both deserve close scrutiny. Finally, ask the system vendor for documentation about their data. Vendors vary in their willingness to divulge this type of information but it is worth it to ask.
Comprehensive Data Capture
Taking the time and effort to carefully consider the depth and breadth questions is a crucial component of a robust Big Data & Analytics initiative. Limiting depth and breadth may achieve a faster implementation. However, a more prudent course is to take the time to perform the necessary analysis and planning to maximize the amount of data captured in both these dimensions. Achieving comprehensive data capture at inception will increase the probability of organizational utility and minimize the total cost of ownership of the credit union’s Big Data & Analytics program.