In two previous OnApproach blogs, the concept of a data lake was defined and differentiated from a traditional data warehouse. Yet, a key point was a data lake and a data warehouse are not mutually exclusive. In fact, a structured data warehouse could be a subset of an overall data lake architecture.
Simply stated, a data lake is an effective way to store and access very large quantities of data.
What does this mean for credit union decision makers?
Delving into these questions, OnApproach Senior Engagement Manager, Pete Keers recently discussed the world of data lakes and other cutting-edge ideas with Bill Preachuk, Solutions Engineer with Hortonworks. With a mission, “to manage the world’s data”, Hortonworks describes itself as an innovator in creating, distributing and supporting enterprise-ready open data platforms and modern data applications. Among “Big Data” players, Hortonworks ranks in the top tier.
PK: Bill, what do you consider to be the major themes regarding data lakes?
BP: I think of them as a synthesis of structured data and unstructured data. Yet, they are not just a dumping ground for all kinds of data. It’s all about the business value and answering the questions that credit unions have.
PK: Why data lakes? Credit unions are just now starting to embrace the idea of the traditional structured data warehouse.
BP: Credit unions are used to fixed queries and fixed data. There’s only “rear view mirror reporting” now. Data lakes can help set a larger vision for the credit union industry. In a data lake, large volumes of structured and unstructured data can be brought into one place. When you bring transactional data from multiple sources into one place you can expand that reporting. It’s like having multiple data warehouses where you will be able to drill across and drill down.
Where it really gets exciting is where you have disparate data sources – weather, time, geography, external financial feeds, or credit rating information. It can be structured and unstructured. For example, imagine having geographic information about homes and neighborhoods sorted by geo-location or zip code. What if all that was in one place? It could be integrated and used to expand and augment existing reports. This is a step along the road but the real value comes when you can go from rearview reporting to predictive analysis.
All this data allows for forward thinking and integrated forecasting. It allows extrapolation of past information with external data resulting in faster analysis and faster forecasting. It means being able to make decisions forward and see patterns.
It’s a situation where data scientists can prepare and build out predictive analyses. It also allows the use of machine learning and augmentation via artificial intelligence. Augmentation and machine learning can not only add value but also set business direction.
PK: How can this translate in actual business benefit for a credit union?
BP: The benefits can be such things as cost savings, cost avoidance, or finding new opportunities that were not previously known. You end up with more products you can sell. A specific example for credit unions would be to look at mortgage information for particular customers and augmenting that with real estate information for forecasting home sales of homes in a particular area. The analysis would involve applying patterns to see how they look forward. You may be able to see potential cash crunches in certain zip codes.
Another possible area of value might be risk avoidance. Being able to apply patterns, mine information, and see potential risk threats at a credit union level or zip code or even with individual members. It enables detecting fraud and avoiding fraud. There’s a huge potential for security. Being able to take disparate log information from servers, from external rating agencies, and bring that all together and apply machine learning credit unions can look for those patterns which could identify service attacks and brute force attacks. It could stop threats as they happen.
PK: If data lake can have both unstructured and structured data, what does this mean for the traditional data warehouse?
BP: A data lake does not necessarily do away with the data warehouse. The data warehouse can be replicated in the data lake and in doing so extends the utility of the data warehouse. You can run your data warehouse on the cluster.
PK: Can you define a cluster?
BP: A Hortonworks Data Platform (HDP) Hadoop cluster refers to a group of commodity computers that are connected and centrally coordinated – each with their own processors and disk. Processing and storage are all redundant and fault-tolerant by default. (A really nice intro to clusters can be found at https://hortonworks.com/apache/hadoop/).
Jobs are divided into tasks and tasks are sent to the nodes in the cluster. Each node completes tasks in parallel and the results are brought together. Massive amounts of data can be processed at a low cost using this method.
Suppose a credit union were to attempt to use a single server to build a very large data lake and was experiencing 6 to 8-hour processing cycles. In that case a single server is tasked with all the computational requirements, and cannot easily scale out. When they bring those same data feeds onto a 20 to 30 node HDP cluster, all of the sudden you have similar ETL processing to load the dimensional data warehouse but you only require perhaps a half hour of processing since you have so much more parallel horsepower you can throw at it.
The cost of storage and the cost of processing in a cluster is also much lower since it uses commodity disk, allowing you to keep full fidelity of your historical source data and all of your transaction feeds in the cluster. You can restate or add information to your data warehouse because it is there and available – you didn’t have to purge source data since it is no longer cost-prohibitive to store it.
Within the cluster you can bring all of your structured data warehouse data and other unstructured data together in one place. Multiple tools are available that allow you to explore these different types of data very quickly and derive value with little overhead. Later you can choose to store it in a highly optimized format.
PK: When you talk about clusters handling large amounts of data, how large are we talking about?
BP: Hadoop clusters scale to petabyte levels.
But a big difference is how Hadoop handles data ingestion. Conventional relational database data has to be loaded into a single specific schema/format. This is where a huge amount of time is spent during ETL processing. But Hadoop gives you “schema-on-read” capabilities. You need only ingest your file as-is into the cluster, and if your data has any kind of structure (CSV, tab-separated, JSON, etc.) – you can instantly define a schema on the file and use the data immediately.
As soon as you have that, which only takes a second after the file shows up, you can issue SQL queries against it, and that can validate your data. At that point you have your data available. Data scientists love it. They can analyze data quickly without load processing overhead.
PK: It sounds like this technology could evolve in these traditional data stores where everything migrates to Hadoop clusters. They are in a more efficient environment.
BP: I came originally from the traditional SQL relational database world, and what I see is a kind of convergence. Relational databases are adding more Big Data capabilities, and at the same time Big Data SQL offerings (like Hive LLAP, Spark SQL, and Phoenix) are adding more and more capabilities available to the relational databases.
But it’s not just performance improvements, It’s that open source software overall improves at a pace far quicker than closed source software. For example, four years ago Security, Metadata handling, and stream processing were rather immature in Hadoop. Now you’ve got Ranger and Kerberos built/integrated for security and Atlas for metadata. You also have massive improvements to the ability to stream in your data in real time, process it, and cleanse it in-stream with Storm, Kafka, and NiFi. All built in the open, with contributors and committers from dozens of organizations and companies.
I recall there were 12 eco-system tools in a Hortonworks Cluster 3 years ago and now there’s almost 30.
PK: As I said before, credit unions are just now embracing the concept of the traditional data warehouse. Isn’t this asking the industry to move a bit too fast?
BP: And perhaps they aren’t able to move too quickly.
PK: So, it seems to me that there’s an opportunity for a vendor to create a product that will help with a standardized solution. It would take some of the complexity out of the situation. OnApproach is interested in partnering with companies like Hortonworks to bring in the concept of data lakes and unstructured data to reduce the complexity so credit unions can more quickly get the value out of it. A credit union wouldn’t have to get its own data scientists or hire whole crew of employees who know how to run this. It would be an effort to bring value without big cost or having to hire their own experts.
BP: Yes, this would be a situation where credit unions that already have a data warehouse can augment what they have and make it better. It would provide them the opportunity to develop and sell entirely new products that are built upon this data that they would never have been able to do themselves.
PK: A huge component of this that we haven’t touched on yet is the prospect for multiple credit unions to pool their data in a data lake. OnApproach has been looking for a way to pool data across the industry. As we’ve been discussing, the technology to do this is readily available. I think the challenge is to convince credit unions to contribute their data and articulate why it is valuable to pool data across credit unions. My sense is if the immense value of the opportunity can be clearly communicated, credit unions will gladly join in to get that value. Again, one of the value opportunities is not only sharing data in a pool but sharing a centralized group of data scientists.
BP: There’s great value in being able to take information from multiple credit unions as an anonymized aggregate and find patterns in the customers and in the states – then being able to package that up for them across the country. Individual credit unions cannot do this alone. There’s such a value in bringing all that data together. It’s a terrific opportunity.