Building a data warehouse for an E-commerce website to answer business questions. The project consists of :
·Schema Design and Data Modeling
·Staging Area Implementation
·Data Warehouse Construction
·ETL Process Implementation
·Answering Business questions
you can find the source data used in this GitHub link.
we have developed a dimensional model design life cycle (DMDL) which consists of the following phases:
In this phase, we prioritize the business processes. The basic idea here is to identify the most and least feasible processes for building a dimensional model, so based on the data we have These are some processes for e-commerce business.
Business processes | Date | Time | Product | Customer | seller | feedback | …more |
---|---|---|---|---|---|---|---|
sales | x | x | x | x | x | x | |
Payment | x | x | x | x | |||
Shipping | x | x | x | x | |||
Customer Service | x | x | x | x |
The grain conveys the level of detail associated with the fact table measurements. Identifying the grain also means deciding on the level of detail you want to be made available in the dimensional model.The chosen grain is "Order Lifetime". It provides a comprehensive view of the entire lifecycle of each order. This level of detail allows for thorough analysis of customer behavior, order fulfillment, and payment processing. It enables tracking changes in order status over time and facilitates insights into the performance of both users and sellers throughout the order lifecycle.
During this phase, we identify the dimensions that are true to the grain we chose
Dimension table | Granularity of the dimension table |
---|---|
Time | This dimension stores information about hour, and a description of the time, such as morning, afternoon, and night |
Date | This dimension stores information down to the day level including month, quarter, and year |
Users | This dimension stores information about the user. |
Sellers | This dimension stores information about the seller. |
Feedback | This dimension stores information about the feedback for orders. |
Order_Fact Table Measurements:
Total price FKs from Dimensions: User Time Payment Seller Feedback Data
Choosing Primary Keys of Dimensions:
Dimension tables (UserDim, PaymentDim, Seller, Feedback) use surrogate keys. A surrogate key is:
An automatically generated unique identifier for each row It has no intrinsic meaning - just provides a unique id Usually an integer (identity) column For example: -UserDim - user_key int identity(1,1) -PaymentDim - payment_key int identity(1,1)
·This allows adding/updating rows flexibly without constraints from a natural key.
TimeDim table: TimeKey column is used as the primary key.
It is an integer identity column.
This makes it a surrogate key. We have used a surrogate key for TimeDim because: Time is a continuous measure and there may be duplicate times like 12:00 AM. A natural key based on time values alone would not be unique. An integer key provides a simple, unique identifier for each time.
DateDim table: DateKey column is used as the primary key
It is calculated as YEAR(@Date) * 10000 + MONTH(@Date) * 100 + DAY(@Date) This makes it a composite natural key We have used a composite natural key for DateDim
Why did we use a star schema?
The schema revolves around a central fact table called Order_Fact which contains the numeric facts like total price.
It has multiple dimension tables like UserDim, PaymentDim etc that provide attributes about the measures in the fact table.
The dimension tables are normalized with a primary key and connected to the fact table through foreign keys.
There are no connections between the dimension tables themselves, they only connect to the central fact table.
This structure forms a star shape with the fact table at the center and dimension tables radiating out from it.
We have Loaded the data from Excel sheets to a temporary database so that we can clean it and prepare it Data warehouse construction by SQL Server:
the code of creating it is found in this repo link .
we implemented ETL (Extract - Transform -Load) process using ssis tool , you can found the whole project here link .
After creating the dataware house, and loading data into it now we can answer business questions using sql queries sample of questions the bussness need to answer it :
When is the peak season of our e-commerce?
Select top 1 count(f.order_id) , d.season
From order_fact as f
Inner join dateDim as d
on d.[DateKey] = f.[order_date_key]
Group by d.season
Order by count(f.order_id) desc
and creating a dashboard represent these result :
to see the answers on Business questions and all the insights , see this queries file