From d25d017e3d510dee8c7eabb7515649594d34fb1a Mon Sep 17 00:00:00 2001 From: Vijay Janapa Reddi Date: Tue, 10 Oct 2023 20:28:27 -0400 Subject: [PATCH] added learning objectives --- data_engineering.qmd | 19 ++++++++++++++++--- 1 file changed, 16 insertions(+), 3 deletions(-) diff --git a/data_engineering.qmd b/data_engineering.qmd index 75b474be..f252306e 100644 --- a/data_engineering.qmd +++ b/data_engineering.qmd @@ -1,14 +1,27 @@ # Data Engineering -::: {.callout-note collapse="true"} +Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models. + +::: {.callout-tip collapse="true"} ## Learning Objectives -* coming soon. +* Understand the importance of clearly defining the problem statement and objectives when embarking on a ML project. + +* Recognize various data sourcing techniques like web scraping, crowdsourcing, and synthetic data generation, along with their advantages and limitations. + +* Appreciate the need for thoughtful data labeling, using manual or AI-assisted approaches, to create high-quality training datasets. + +* Briefly learn different methods for storing and managing data such as databases, data warehouses, and data lakes. + +* Comprehend the role of transparency through metadata and dataset documentation, as well as tracking data provenance to faciltate ethics, auditing, and reproducibility. + +* Understand how licensing protocols govern legal data access and usage, necessitating careful compliance. + +* Recognize key challenges in data engineering, including privacy risks, representation gaps, legal restrictions around data access, and balancing competing priorities. ::: ## Introduction -Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models. Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy$^{1}$, aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.