The Trellance Data Blog

What is a Data Lake? - Part 2: Sink or Swim

Posted by Mark Portz on Jul 26, 2017 12:07:00 PM

Find me on:


In my previous blog, “What is a Data Lake? Part 1”, I discussed how to define a data lake, and how it differs from a data warehouse. To briefly recap, a data lake is a massive data repository for raw data in its native format. To better understand the idea, let’s dive a bit deeper and get to know the advantages and disadvantages surrounding data lakes.

To start, there are a number of advantages data lakes serve for financial institutions:

  1. Storage/Scalability – As mentioned in the definition, data lakes are capable of holding vast amounts of data, and can easily be scaled up if needed.
  2. Input Form – Due to the fact that data lakes accept data in its native forms, it is easy for credit unions to dump data in from as many sources as desired. Data from any disparate sources, internal or external, structured or unstructured, can be stored easily in the data lake, regardless of the data format.
  3. Price – Data lakes are relatively affordable form of data repository. Due to the flat architecture, there is not a need for special hardware systems, and server and license reduction actually helps to cut cost.
  4. Enterprise Solution – Data lakes are designed so that data can be retrieved by query enterprise-wide. Anyone in the organization with a question should be able to access the data lake and extract the necessary data for analysis.

These advantages work together to create a very compelling product for certain purposes. Prior to declaring a data lake to be your primary tool for analytics though, there are several other characteristics of this technology to consider:

  1. Connectivity – As previously mentioned, data is stored in its native form. This means the various data sources are collected in the same place, but not integrated. The data silos still exist, which is a major pain-point for the purposes of enterprise analytics, as there is not a single source of truth.
  2. Organization – Rather than a filing system, data lakes work by relying on metadata and a consistent tagging system. While this can be a very successful method, it requires extreme governance. Especially if it is used as an enterprise system, everyone within an organization must understand how to properly tag the data, and know the system well in order to retrieve the desired data. If not, you run into the risk of maintaining a “data swamp”.
  3. Data Quality – By lacking proper governance and organization, it can become very difficult to determine data quality. Some users may be successfully finding value from the lake, but there is not necessarily a way to track that lineage or take advantage of the reports being built by other users.
  4. Training – While marketed as an enterprise solution, data lakes do not generally act as such in the real world. For it to be a true enterprise solution, it assumes every user in the organization has the necessary training to analyze and manipulate the various data sources contained within the lake. Unfortunately, this does not tend to be case, and most “users” will rely on others within the organization to find and analyze the data for the rest.

This is Part 2 of a blog series on Data Lakes. Click here to read Part 1, “What is a Data Lake”. Also, don't forget to subscribe to our blog to stay up-to-date on the latest in data analytics for credit unions. 

Stay up to date - Subscribe to our educational blog

Topics: Data Pool, Analytic Data Model, Data Lake