-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data Pipelines course material #56
base: main
Are you sure you want to change the base?
Conversation
Thanks @bhupatiraju! Following the convention of other modules, can you add a README.md and clear the outputs of the notebooks? |
## Contact | ||
|
||
If you have any questions you can contact one of the teams behind this training | ||
on [email protected]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhupatiraju this README seems identical to what's currently at the root of this repo. Did you mean to copy and modify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bhupatiraju I think the key thing to consider is to re-organize the flow to have a smaller code example (subnational population) moved up front, right after the motivations for data pipelines – see my inline comments.
General:
Consider running the content through ChatGPT for proofreading. It's generally well written but I feel like there are some potentially improvements that can be easily picked up by ChatGPT.
Some questions to anticipate:
- Why do we need to use pyspark to process the Kenya BOOST data? Can we not just use pandas? When to use pyspark vs plain pandas?
- Why do we have separate scripts for different stages of the medallion architecture? Can we not just have a single script?
"\n", | ||
" In this tutorial we will use the workflows feature in **databricks** for orchestration, although it's possible to consrtuct it in pySpark itself. \n", | ||
"\n", | ||
"**Monitoring and Logging:**\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a new line after this section title, so it is consistent with the rest
} | ||
}, | ||
"source": [ | ||
"#### The Core of the Data Pipeline Process\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section title doesn't quite read. How about something like "What are the building blocks of a data pipeline?" to keep the question format as used for the previous 2 sections?
} | ||
}, | ||
"source": [ | ||
"## Introduction to Data Pipelines" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider using # as this is the h1 level heading for this notebook
} | ||
}, | ||
"source": [ | ||
"#### What is a data pipeline?\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the next level of section heading, so should use ## (h2). Same consistency should be applied to all subsequent headings.
"#### The Core of the Data Pipeline Process\n", | ||
"A data pipeline is a structured framework that enables the flow of data from source to destination, encompassing several key processes. Specific implementation may vary but the fundamental components of a data pipeline can be abstracted as follows:\n", | ||
"\n", | ||
"**Data Ingestion:**\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use proper headings (e.g. ###) so the table of content will nest properly. Same below.
} | ||
}, | ||
"source": [ | ||
"The flow diagram below shows the flow of data through these layers, and we will illstrae this model using the Kenya BOOST data. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/illstrae/illustrate
" .option(\"inferSchema\", \"true\")\n", | ||
" .load(Data_DIR))\n", | ||
"\n", | ||
"# Clean column names by replacing spaces and special characters\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you check if this step is still necessary? I think I read somewhere the latest version of delta(?) accommodate spaces so perhaps special characters too?
"source": [ | ||
"#### Silver\n", | ||
"\n", | ||
"In the silver stage (again the script is found in the data_pipelines project folder as silver.py), we read the data produced in the bronze stage and transform and refine the data. \n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
link to the silver script instead of (again....)
} | ||
}, | ||
"source": [ | ||
"##### Subnational Population\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd suggest moving this example up to be right before the "The Core of the Data Pipeline.." section. Not only this is a good stand-alone example of a simple data pipeline covering ingestion, processing, and writing, it also provides a quick recap of basic data processing using pandas. The only thing new to someone already know pandas is the writing to the data lake in delta format – it would be a good opportunity to introduce the concept of Spark and delta. Then you can move onto the next section on the basic components / building blocks of a data pipeline and referring back to this simple example and add the orchestration with scheduling demo to show how this can automate updating and saving the subnational population dataset, which can be reused. Then the latter BOOST example can focus on illustrating the medallion architecture and showcasing the re-use/merging with the subnational dataset. This way, we build the concepts on top of each other and gradually introduce them to the participants. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also breaks the instruction/concept only wall of texts up front.
"source": [ | ||
"### Lakehouse\n", | ||
"\n", | ||
"A lakehouse is a unified architecture which enables storage of structured and unstructured data in a unified system. The databricks lakehouse is a specific implementation, which offers tools to process this data in the lakehouse setup. \n", | ||
"\n", | ||
"It allows for unified data management, and more importantly avoids **data silos**. In our setting, the delta tables constructed are not necessarily tied to the project, and can be accessed across multiple projects. For instance, the table containing subnational population for Kenya can be accessed for the purposes of a different different project. " | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's necessary to introduce lake house, especially towards the end. Consider remove, or move it up to where we first read from or write to the lake.
This pull request includes tutorial notes for the Data Pipelines topic (as part of the python course advanced topics section), utilising the BOOST financial data from Kenya to illustrate the construction of a data pipeline using the medallion schema and then automating this pipeline.
It contains the following:
Introduction File: Created a file named intro-to-data-pipelines that provides an overview of the data pipeline topic and illustrates its importance in data processing workflows.
Data Processing Walk-through: Created files called Bronze, Silver and Gold which contains the data processing code using the medallion schema for the Kenya data.
Get additional data: Created a file called subnational_population that retrieves data from the WB API and restricts to the columns needed for merging with the cleaned Kenya data
Aggregation: Simple aggregation is done using the subnational population and the cleaned Kenya data to illustrate a simple use case
Orchestration: Added a section on orchestration using Databricks Workflows, detailing how to automate and manage the data processing pipeline effectively (contained in the intro-to-data-pipelines file).