When we talk to credit unions about their plans for big data/analytics we try to explain that the word “data warehouse” is a loosely defined term and that a true enterprise data warehouse requires a significant amount of planning and a robust architecture to meet the needs of the users. The architecture we have chosen for our OnApproach M360™ data model is the star schema developed by Ralph Kimball.
Our data model is designed specifically for the credit union industry using the Kimball methodology. Why did we choose the Kimball star schema architecture? One of our employees came across this blog that does a great job of explaining the three prevailing data warehouse architecture options. I have provided a link and the actual blog below. It is a must read for any credit union data architect or business intelligence developer.
Corporate Information Factory (CIF) data warehouse architecture was pioneered by Bill Inmon. It is also referred to as top-down approach because it represents a complete view of enterprise data. Data is stored in 3rd normal form (3NF) closely resembling source system structures. To capture historic data changes timestamps are added to each table key columns. Model does not support direct reporting and requires a layer of dimensional data marts.
Kimball Dimensional Data Mart approach was developed by Ralph Kimball. It is sometimes called bottom-up approach because it recommends building integrated reporting data marts sequentially, based on business priority. Data is stored in star or snowflake structure: large fact tables representing numerical measures and counts are in 3NF, smaller dimensions are denormalized to 2NF. Model is optimized for direct reporting – it does not require any additional layers on top of it.
Data Vault is a data warehousing architecture developed by Dan Linstedt in 1990s. It is based on the concept of Hubs, Links and Satellites. Hubs represent source system business keys in master tables (e.g. Customer, Product, etc.). Links represent associations/transactions between Hubs with validity period. Satellites point to links and contain detailed attributes of transaction and their validity period. Data is loaded from source systems “as is”, with no quality checks, validation or cleansing. The structure is highly normalized (4NF+) and is not designed for direct reporting. Most of the industry-standard BI vendor reporting tools would not support direct reporting out of the model. Similar to Inmon methodology, Data Vault model relies on dimensional data marts (Kimball approach) to expose data to users.
Kimball approach is typically least complex, easiest and fastest to implement, provides best combination of loading and querying performance. It is ideally structured to support iterative agile development (subject area sprints) and would be optimal approach for most organizations.
Inmon’s key advantage is in completeness of enterprise data model – although it is very difficult to achieve in real life and requires substantial investment and long-term commitment.
Data Vault is innovative concept and it has merits when compliance demands are very high and auditing and traceability requirements frequently change.
Below is high level comparison of 3 data warehousing methodologies across key relevant categories:
Data is stored in 3NF with structure that closely resembles source system with addition of timestamp keys that allow capturing of changes over time. The structure is not optimized for direct queries and requires dimensional data marts for reporting. These data marts need to be persisted (physical) for all but smallest data volumes for performance reasons. This essentially doubles storage requirements compared to Kimball approach
|Data is stored in 3NF hub-link-satellite structure with several time-stamped copies of data that capture of changes over time. The structure is very difficult to query directly and requires dimensional data marts for reporting. These data marts need to be persisted (physical) for all but smallest data volumes for performance reasons. This essentially doubles storage requirements compared to Kimball approach||Data is stored in final consumption format optimized for reporting. Changes are tracked via slowly changing dimensions||Kimball|
Model requires 2 ETL processes :- loading from source systems- building reporting data marts
|Model requires 2 ETL processes :- loading from source systems- building reporting data marts||Model requires single ETL process that loads final data model used for reporting||Kimball|
|ETL scalability and loading performance||Architecture supports loading of multiple data sources in parallel. Within each source table loads need to be sequenced based on dependencies in source 3NF model||Architecture supports loading of multiple data sources in parallel. Within each source the model structure supports better parallelism (hubs loaded first, then links, then satellites)
|Architecture supports loading of multiple subject areas in parallel (dimensions first, then facts). 2 tier architecture requires single ETL layer which delivers faster processing||Kimball|
|Auditing, traceability and compliance||Historical information is captured by inserting new record each time source data changes. Change tracking is easy as data warehouse structure closely resembles source||Historical changes are capturing by inserting new links and satellites. Provides the most detailed and auditable capture of changes. Change tracking is fairly complex due to highly normalized structure
|Uses concept of slowly changing dimensions to track historic changes. Requires business to identify attributes requiring tracking prior to load. Adding new attributes is possible but will not re-create historic changes||Data Vault|
|Modeling||Modeling is generally not complex as the structure essentially copies source system with some denormalizations and addition of timestamps to each primary key to track changes. Requires additional modeling for dimensional datamarts||Modeling can be fairly complex as link/satellite structure can be modeled in multiple different ways. Requires additional modeling for dimensional datamarts||Modeling complexity varies by industry/subject area. Well-established methodology exists with guidelines and modeling frameworks for each subject area type by industry. Since the model itself is exposed for final consumption it does not need additional modeling to capture source data as in Inmon and Data Vault approach.
|Query performance||Direct query is very slow due to 3NF structure of the data. Requires data marts (virtual or materialized) for querying and reporting||Direct query is very slow due to highly normalized structure of the data. Every join requires date component which makes queries very complex Requires data marts (virtual or materialized) for querying and reporting||Model is designed for highest query performance by denormalizing dimensions containing filter attributes and hierarchies and keeping fact data in 3NF for optimal performance and storage. Large fact tables can be easily partitioned (typically on date key) and indexed to support high performance. Newest columnstore in-memory indexing technologies work very well for fact tables||Kimball|
Article written by: Paul Ablack, Founder and Chief Executive Officer at OnApproach