Building data pipelines with airflow

Build production-ready data pipelines with Airflow, working with two different input datasets that will need to be processed and cleaned before loading the information separately into a data warehouse (using ELT pipelines) and a data mart for analytical purposes.

Part 1: Design a data warehouse

Design the architecture of a data warehouse (star schema) on Snowflake with 4 layers (raw, staging, warehouse, datamart). In the warehouse you will need to have a fact table which contains only IDs and metrics and at least 4 dimensions tables (e.g. listing, host, suburb, lga,etc…). In addition your warehouse will also contain the two tables from the census as dimension tables.

For the data mart, create the 3 following tables/views:

Per “listing_neighbourhood” and “month/year”: Active listings rate Minimum, maximum, median and average price for active listings Number of distinct hosts Superhost rate Average of review_scores_rating for active listings Percentage change for active listings Percentage change for inactive listings Total Number of stays Average Estimated revenue per active listings Named it “dm_listing_neighbourhood”
Per “property_type”, “room_type” ,“accommodates” and “month/year”: Active listings rate Minimum, maximum, median and average price for active listings Number of distinct hosts Superhost rate Average of review_scores_rating for active listings Percentage change for active listings Percentage change for inactive listings Total Number of stays Average Estimated revenue per active listings Named it “dm_property_type”
Per “host_neighbourhood_lga” which is “host_neighbourhood” transformed to an LGA (e.g host_neighbourhood = 'Bondi' then you need to create host_neighbourhood_lga = 'Waverley') and “month/year”: Number of distinct host Estimated Revenue Estimated Revenue per host (distinct) Named it “dm_host_neighbourhood”

Part 2: Populate the data warehouse following an ELT pattern with Airflow

Use an Airflow dag to load the entire datasets into the defined star schema using an ELT pattern (data processing is done on Snowflake through instructions sent by Airflow).
Export manually the 3 tables/views of the data mart as CSVs after loading the entire dataset. The results must be ordered: “dm_listing_neighbourhood.csv”, ordered by “listing_neighbourhood” and “month/year” “dm_property_type.csv”, ordered by “property_type”, “room_type” ,“accommodates” and “month/year” “dm_host_neighbourhood.csv”, ordered by “host_neighbourhood_lga” and “month/year”

Important:

Truncate all tables before running your dag for the first time. Be careful of the order of operation, especially when loading dimension and fact data. The KPIs in the data mart must be kept up-to-date with new data ingested.

Part 3: Ad-hoc analysis

Answer the following questions with supporting results (you can only use SQL):

What are the main differences from a population point of view (i.g. higher population of under 30s) between the best performing “listing_neighbourhood” and the worst (in terms of estimated revenue per active listings) over the last 12 months?
What will be the best type of listing (property type, room type and accommodates for) for the top 5 “listing_neighbourhood” (in terms of estimated revenue per active listing) to have the highest number of stays?
Do hosts with multiple listings are more inclined to have their listings in the same LGA as where they live?
For hosts with a unique listing, does their estimated revenue over the last 12 months can cover the annualised median mortgage repayment of their listing’s “listing_neighbourhood”?

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
dm_host_neighbourhood.csv		dm_host_neighbourhood.csv
dm_listing_neighbourhood.csv		dm_listing_neighbourhood.csv
dm_property_type.csv		dm_property_type.csv
part_1.sql		part_1.sql
part_2_dag.py		part_2_dag.py
part_3.sql		part_3.sql

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Building data pipelines with airflow

Part 1: Design a data warehouse

Part 2: Populate the data warehouse following an ELT pattern with Airflow

Part 3: Ad-hoc analysis

About

Releases

Packages

Languages

gerardo5797/DataPipelinesWithAirflow

Folders and files

Latest commit

History

Repository files navigation

Building data pipelines with airflow

Part 1: Design a data warehouse

Part 2: Populate the data warehouse following an ELT pattern with Airflow

Part 3: Ad-hoc analysis

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages