New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Data Pipelines course material #56

Open

bhupatiraju wants to merge 6 commits into worldbank:main from bhupatiraju:data-pipelines

bhupatiraju commented Oct 10, 2024

This pull request includes tutorial notes for the Data Pipelines topic (as part of the python course advanced topics section), utilising the BOOST financial data from Kenya to illustrate the construction of a data pipeline using the medallion schema and then automating this pipeline.

It contains the following:

Introduction File: Created a file named intro-to-data-pipelines that provides an overview of the data pipeline topic and illustrates its importance in data processing workflows.
Data Processing Walk-through: Created files called Bronze, Silver and Gold which contains the data processing code using the medallion schema for the Kenya data.
Get additional data: Created a file called subnational_population that retrieves data from the WB API and restricts to the columns needed for merging with the cleaned Kenya data
Aggregation: Simple aggregation is done using the subnational population and the cleaned Kenya data to illustrate a simple use case
Orchestration: Added a section on orchestration using Databricks Workflows, detailing how to automate and manage the data processing pipeline effectively (contained in the intro-to-data-pipelines file).

bhupatiraju added 5 commits

October 10, 2024 19:53


          Create Medallioin stages for data processing

8f6cb85


          Create intro-data-pipelines.ipynb

e68a3f0

Entry point to the course with notes


          Create kenya_func_agg.ipynb

d97832d


          Create subnational_population.ipynb

3bc52eb


          Reduced the code block for silver. Added comments. Added basic refere…

3983fa4

…nces

Member

weilu commented Oct 10, 2024

Thanks @bhupatiraju! Following the convention of other modules, can you add a README.md and clear the outputs of the notebooks?

luisesanmartin requested review from luisesanmartin and weilu

October 17, 2024 15:25

bhupatiraju marked this pull request as draft

December 10, 2024 18:54


          Added Readme

1577c85

weilu marked this pull request as ready for review

December 17, 2024 16:32

weilu reviewed

View reviewed changes

2-advanced-topics/data-pipelines/README.md

+              ## Contact
+              If you have any questions you can contact one of the teams behind this training
+              on [email protected].

Member

weilu Dec 17, 2024

@bhupatiraju this README seems identical to what's currently at the root of this repo. Did you mean to copy and modify?

weilu requested changes

View reviewed changes

Member

weilu left a comment

@bhupatiraju I think the key thing to consider is to re-organize the flow to have a smaller code example (subnational population) moved up front, right after the motivations for data pipelines – see my inline comments.

General:
Consider running the content through ChatGPT for proofreading. It's generally well written but I feel like there are some potentially improvements that can be easily picked up by ChatGPT.

Some questions to anticipate:

Why do we need to use pyspark to process the Kenya BOOST data? Can we not just use pandas? When to use pyspark vs plain pandas?
Why do we have separate scripts for different stages of the medallion architecture? Can we not just have a single script?

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  "\n",
+                  " In this tutorial we will use the workflows feature in **databricks** for orchestration, although it's possible to consrtuct it in pySpark itself. \n",
+                  "\n",
+                  "**Monitoring and Logging:**\n",

Member

weilu Dec 17, 2024

Add a new line after this section title, so it is consistent with the rest

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  }
+                 },
+                 "source": [
+                  "#### The Core of the Data Pipeline Process\n",

Member

weilu Dec 17, 2024

This section title doesn't quite read. How about something like "What are the building blocks of a data pipeline?" to keep the question format as used for the previous 2 sections?

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  }
+                 },
+                 "source": [
+                  "## Introduction to Data Pipelines"

Member

weilu Dec 17, 2024

Consider using # as this is the h1 level heading for this notebook

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  }
+                 },
+                 "source": [
+                  "#### What is a data pipeline?\n",

Member

weilu Dec 17, 2024

This is the next level of section heading, so should use ## (h2). Same consistency should be applied to all subsequent headings.

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  "#### The Core of the Data Pipeline Process\n",
+                  "A data pipeline is a structured framework that enables the flow of data from source to destination, encompassing several key processes. Specific implementation may vary but the fundamental components of a data pipeline can be abstracted as follows:\n",
+                  "\n",
+                  "**Data Ingestion:**\n",

Member

weilu Dec 17, 2024

Use proper headings (e.g. ###) so the table of content will nest properly. Same below.

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  }
+                 },
+                 "source": [
+                  "The flow diagram below shows the flow of data through these layers, and we will illstrae this model using the Kenya BOOST data. \n",

Member

weilu Dec 17, 2024

s/illstrae/illustrate

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  "             .option(\"inferSchema\", \"true\")\n",
+                  "             .load(Data_DIR))\n",
+                  "\n",
+                  "# Clean column names by replacing spaces and special characters\n",

Member

weilu Dec 17, 2024

Can you check if this step is still necessary? I think I read somewhere the latest version of delta(?) accommodate spaces so perhaps special characters too?

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                 "source": [
+                  "#### Silver\n",
+                  "\n",
+                  "In the silver stage (again the script is found in the data_pipelines project folder as silver.py), we read the data produced in the bronze stage and transform and refine the data. \n",

Member

weilu Dec 17, 2024

link to the silver script instead of (again....)

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

+                  }
+                 },
+                 "source": [
+                  "##### Subnational Population\n",

Member

weilu Dec 17, 2024

I'd suggest moving this example up to be right before the "The Core of the Data Pipeline.." section. Not only this is a good stand-alone example of a simple data pipeline covering ingestion, processing, and writing, it also provides a quick recap of basic data processing using pandas. The only thing new to someone already know pandas is the writing to the data lake in delta format – it would be a good opportunity to introduce the concept of Spark and delta. Then you can move onto the next section on the basic components / building blocks of a data pipeline and referring back to this simple example and add the orchestration with scheduling demo to show how this can automate updating and saving the subnational population dataset, which can be reused. Then the latter BOOST example can focus on illustrating the medallion architecture and showcasing the re-use/merging with the subnational dataset. This way, we build the concepts on top of each other and gradually introduce them to the participants. Thoughts?

Member

weilu Dec 17, 2024

It also breaks the instruction/concept only wall of texts up front.

2-advanced-topics/data-pipelines/intro-data-pipelines.ipynb

Comment on lines +970 to +976

+                 "source": [
+                  "### Lakehouse\n",
+                  "\n",
+                  "A lakehouse is a unified architecture which enables storage of structured and unstructured data in a unified system. The databricks lakehouse is a specific implementation, which offers tools to process this data in the lakehouse setup. \n",
+                  "\n",
+                  "It allows for unified data management, and more importantly avoids **data silos**. In our setting, the delta tables constructed are not necessarily tied to the project, and can be accessed across multiple projects. For instance, the table containing subnational population for Kenya can be accessed for the purposes of a different different project. "
+                 ]

Member

weilu Dec 17, 2024

Not sure if it's necessary to introduce lake house, especially towards the end. Consider remove, or move it up to where we first read from or write to the lake.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet