This article was published as a part of the Data Science Blogathon.
Hello, data-enthusiast! In this article let’s discuss “Data Modelling” right from the traditional and classical ways and aligning to today’s digital way, especially for analytics and advanced analytics. Yes! Of course, last 40+ years we all worked for OLTP, and followed by we started focusing on OLAP.
After cloud ear come into the picture the Data become very crazy level and every industry started zooming them and looking at different levels and perspectives. So, Big Data, Data Platform, Data Analytics, Data Science, and many more buzzwords are popping out of the window
“This is the technique is used to characterize the data and help us to know how it is stored in the available tables and alongside with other table and association between them”
image designed by author shanthababu
Before getting into Data Modelling, let’s understand the few terminologies which is the ground for DATA architecting and modeling, which are nothing but OLTP and OLAP.
OLTP is nothing but Online Transaction Processing, and we can call this database workload used for transactional systems, which we use to play around with DDL, DML, and DCL.
OLAP is Online Analytical Processing, database workloads are used for modern data warehousing systems, in which we use to play around SELECT queries with simple or complex queries by filtering, grouping, aggregating, and portioning a large data set quickly for reporting/visualization for Data Analyst and Dataset for Data Scientists for specific reasons.
OLTP |
OLAP | |
Focus | Day-To-Day Operations |
Analysis and Analytics |
DB Design |
Application-Specific | Business Driven |
Nature Of the Data |
Current [RDBMS] |
Historical and Dimensional |
DB Size |
In GB |
In TB |
What is Data Modelling
This defines the abstract model that organizes the Description, Semantics, and Consistency constraints of data.
DWH Data Model is like an architect’s building plan, which helps to build conceptual models and set a relationship between data-item, let’s say Dimension and Fact, and how they are linked together.
While deriving the data model, there are several factors that need to be considered, these factors vary based on the different stages of the Data Lifecycle.
Yes! The Modern Data Warehouse systems solve many problems in business challenges
So far, we have discussed the concepts around the Modern DWH system, Let’s move on to data modelling components and techniques.
Generally, before building the model, each table would undergo the below stages, conceptual, logical, and physical, so exactly in the last stage only we would realize the model as accepted by the business.
Source: image designed by author shanthababu
The main components are Fact and Dimension tables are the main two tables that are used when designing a data warehouse. The fact table contains the measures of columns and a special key called surrogate, that link to the dimensions tables.
Facts: To define FACTS in one word that is nothing but Measures
It can be measured attributes of the fields, it can be Quantitatively Measured, and in Numerical Quantities. Generally, it would be a number of orders received and products sold.
Dimensions: It has the attributes and basically “Category Values” or “Descriptive Definition” would be the Product Name, Description, Category, and so on.
Source: image designed by author shanthababu
For most of the scenarios, while developing the data modelling for DWH, we use to follow the Star Schema or Snowflake Schema, or Kimball’s Dimensional Data Modelling.
Source: image designed by author shanthababu
Star Schema: This is the most common technique and basic modelling type and is easy to understand. In which Fact table is connected with other all Dimension tables and considerably accepted architectural model and used to develop DWH and Data marts. Each dimension table in the star schema has a Primary-Key and which is related to a Foreign-Key. In the Fact table. joining the tables and querying a little complex and performance a bit slow.
The representation of this model seems like a star with the Fact table at the center and dimensions-tables connecting from all other sides of it, constructing a STAR-like model
Source: image designed by author shanthababu
Snowflake Schema: This is an extension of the Star Schema with little modification and reduced load and improved performance. here the dimensions tables are normalized into multiple related tables as sub-dimension. So, it minimizes data redundancy. Apparently, it has multiple levels of joins which leads to less query complexity and ultimately improves query performance.
Tables are arranged logically and a many-to-one relationship hierarchy structure and it is resembling a SNOWFLAKE-like pattern. It has more joins between dimension tables, so performance issues might be in place, which leads to the slow query processing times for data retravel.
Let’s do a quick comparison of Star & Snowflake Schema
Star Schema | Snowflake Schema |
Simplified design and easy to understand | Complex design and a little difficult to understand |
Top-Down model | Bottom-Up model |
Required more space | Less Space |
The fact table is surrounded by Dimension tables | The fact table is connected with dimension tables and dimension tables are connected with sub-dimension tables in normalized |
Low query complexity | Complex query complexity |
Not normalized, so there is a lesser number of relationships and foreign keys. |
Normalized, so required number of foreign keys and the well-defined relationship between tables |
Since not normalized, a High volume of data redundancy | Since normalized, Low volume data redundancy. |
Fast query execution time | Low query execution time due to more joins |
One Dimensional | Multidimensional |
Everything is fine with the star schema, as we understood that this is Flexible, Extensible, and many more. But not answered business process and questions from DWH.
Kimball’s answer to below dimensional data modelling.
Then why do we need Kimball’s Approach? Obviously, we need them to Expedite the business value and Performance enhancement.
Expedite the business value: When you want to speed to business value, the data needs to be denormalized, so that BI teams can deliver to the business quickly and reliably and improve analytical workloads and performance.
Let’s quickly go through Inmon DWH Modelling, it follows a top-down approach. In this model, OLTP systems are a data source for DWH and play as a central repository of data in 3NF. Followed by this Datamart is plugged in and in 3NF. Comparatively with Kimball’s model, this Inmon is not that great option while dealing with BI and AI and data provisioning.
Kimball | Inmon |
De-normalized data model. | Normalized data model. |
Bottom-Up Approach | Top-Down Approach |
Data Integration mainly focuses on Individual business-area(s). |
Data Integration focuses on Enterprise specific |
Data source systems are highly stable since the Datamart stage will take care of the challenges |
Data source systems have a high rate of change Since DWH is plugged with the Data source directly. |
Building time-lime takes less time. | Little complex and required more time. |
Involves an iterative mode and is very cost-effective. | Building the blocks might consume a high cost. |
Functional and Business knowledge is enough to build the model. |
Understanding of Database, Table, Columns and key relationship knowledge is required to build the model. |
Challenge in maintenance | Comparatively easy to maintenance |
Less DB space is enough |
Comparatively more DB space is required |
So far, we have discussed various data modelling techniques and their benefits around them.
Data Vault Model (DVM): What had discussed models earlier are predominantly focused on Classical or Modern Data Warehousing and Reporting systems. All we know now is we’re in the digital world delivering a Data Analytics Service to support enterprise-level systems like rich BI, Modern DWH, and Advanced Analytics like Data Science, Machine Learning, and extensive AI. This methodology is an agile way of designing and building modern, efficient and effective DWHs.
DVM is composed of multiple components like Model, Methodology, and Architecture, this is quite different from other DWH modelling techniques in current use. Another way around this is simply we can say that this is NOT a framework, product, and any service, instead, we can say this is Very Consistency, Scalability, highly Flexibility, easily Auditability, and specifically AGILITY. Yes! It is a modern agile way of designing DWH for various systems as mentioned earlier. Along with we can incorporate and implement the standards, policies, and best practices with the help of a well-defined process.
This model consists of three elements Hub, Link, and Satellite.
Hubs: This is one of the core building blocks in DVM. Which is to record a unique list of all the business keys for a single entity. Let’s say, for example, an It may contain a list of all Customer IDs, Employee IDs, Product IDs, and Order IDs in the business.
Links: Is fundamental component in a DVM is Links, which form the core of the raw vault along with other elements Hubs, and Satellites. Generally speaking, this is an association or link, between two business keys in the model. A typical example is Orders and the Customers in the respective table which is associated with customers and orders. And one more I can say store and employee working in store under various department so the link would be link_employee_store
Satellites: In DVM, Satellites connect to other elements in DVM (Hubs or Links). Satellite tables hold attributes related to a link or hub and update them as they change. For example, SAT_EMPLOYEE may feature attributes such as the employee’s Name, Role, Dob, Salary, or Doj. Simply say “The Point in Time Record in the table”. In simple language, we can say Satellites contain data about their parent Hub or Link and Metadata along with when the data has been loaded, from where, and effective business date details. Where the actual data resides for our business entities in the other elements discussed earlier (Hubs and Links).
In DVM architecture each Hub and Link record may have one or more child Satellite records, all the changes to that Hubs or Link.
Source: image designed by author shanthababu
Pros
Cons
So far, we discussed data and modelling concepts in the below items in detail,
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,