Skip to content

Commit

Permalink
Push dev branch build
Browse files Browse the repository at this point in the history
  • Loading branch information
Naeemkh committed Jun 8, 2024
1 parent ba6d85a commit 8b94a6a
Show file tree
Hide file tree
Showing 3 changed files with 10 additions and 10 deletions.
10 changes: 5 additions & 5 deletions docs/contents/data_engineering/data_engineering.html
Original file line number Diff line number Diff line change
Expand Up @@ -559,7 +559,7 @@ <h1 class="title"><span id="sec-data_engineering" class="quarto-section-identifi
<figcaption><em>DALL·E 3 Prompt: Create a rectangular illustration visualizing the concept of data engineering. Include elements such as raw data sources, data processing pipelines, storage systems, and refined datasets. Show how raw data is transformed through cleaning, processing, and storage to become valuable information that can be analyzed and used for decision-making.</em></figcaption>
</figure>
</div>
<p>Data is the lifeblood of AI systems. Without good data, even the most advanced machine-learning algorithms will not succeed. This section will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering involves collecting, storing, processing, and managing data to train machine learning models.</p>
<p>Data is the lifeblood of AI systems. Without good data, even the most advanced machine-learning algorithms will not succeed. However, TinyML models operate on devices with limited processing power and memory. This section explores the intricacies of building high-quality datasets to fuel our AI models. Data engineering involves collecting, storing, processing, and managing data to train machine learning models.</p>
<div class="callout callout-style-default callout-tip callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
Expand All @@ -583,10 +583,10 @@ <h1 class="title"><span id="sec-data_engineering" class="quarto-section-identifi
</div>
<section id="introduction" class="level2" data-number="5.1">
<h2 data-number="5.1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">5.1</span> Introduction</h2>
<p>Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy<span class="math inline">\(^{1}\)</span>, aggregation, and reducing detail provide alternatives to balance privacy and utility but have downsides. Creators must strike a thoughtful balance based on the use case.</p>
<p>Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can only work when certain intersections are present. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but it lacks enough cases to capture older women with a specific condition. Such <a href="https://blog.google/technology/health/healthcare-ai-systems-put-people-center/">higher-order gaps</a> are not immediately obvious but can critically impact model performance.</p>
<p>Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling of overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted but surmountable with thoughtful effort.</p>
<p>We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, and utilizing sensors and IoT devices to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the tradeoffs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases need extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it has its own advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!</p>
<p>Imagine a world where AI can diagnose diseases with unprecedented accuracy, but only if the data used to train it is unbiased and reliable. This is where data engineering comes in. While over 90% of the world’s data has been created in the past two decades, this vast amount of information is only helpful for building effective AI models with proper processing and preparation. Data engineering bridges this gap by transforming raw data into a high-quality format that fuels AI innovation. In today’s data-driven world, protecting user privacy is paramount. Whether mandated by law or driven by user concerns, anonymization techniques like differential privacy and aggregation are vital in mitigating privacy risks. However, careful implementation is crucial to ensure these methods don’t compromise data utility. Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy<span class="math inline">\(^{1}\)</span>, aggregation, and reducing detail provide alternatives to balance privacy and utility but have downsides. Creators must strike a thoughtful balance based on the use case.</p>
<p>While privacy is paramount, ensuring fair and robust AI models requires addressing representation gaps in the data. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. These combinations, sometimes called higher-order gaps, can significantly impact model performance. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but it lacks enough cases to capture older women with a specific condition. Such <a href="https://blog.google/technology/health/healthcare-ai-systems-put-people-center/">higher-order gaps</a> are not immediately obvious but can critically impact model performance.</p>
<p>Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Elusive perfect solutions necessitate conscientious data engineering practices like anonymization, aggregation, under-sampling of overrepresented groups, and synthesized data generation to balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted but surmountable with thoughtful effort.</p>
<p>We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, and utilizing sensors and IoT devices to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases need extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative with advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!</p>
</section>
<section id="problem-definition" class="level2 page-columns page-full" data-number="5.2">
<h2 data-number="5.2" class="anchored" data-anchor-id="problem-definition"><span class="header-section-number">5.2</span> Problem Definition</h2>
Expand Down
8 changes: 4 additions & 4 deletions docs/contents/frameworks/frameworks.html
Original file line number Diff line number Diff line change
Expand Up @@ -1409,7 +1409,7 @@ <h3 data-number="6.8.3" class="anchored" data-anchor-id="library"><span class="h
<section id="choosing-the-right-framework" class="level2" data-number="6.9">
<h2 data-number="6.9" class="anchored" data-anchor-id="choosing-the-right-framework"><span class="header-section-number">6.9</span> Choosing the Right Framework</h2>
<p>Choosing the right machine learning framework for a given application requires carefully evaluating models, hardware, and software considerations. By analyzing these three aspects—models, hardware, and software—ML engineers can select the optimal framework and customize it as needed for efficient and performant on-device ML applications. The goal is to balance model complexity, hardware limitations, and software integration to design a tailored ML pipeline for embedded and edge devices.</p>
<div id="fig-tf-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-align="center" data-caption="TensorFlow Framework Comparison - General">
<div id="fig-tf-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-caption="TensorFlow Framework Comparison - General" data-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-tf-comparison-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="images/png/image4.png" style="width:100.0%" data-align="center" data-caption="TensorFlow Framework Comparison - General" class="figure-img">
Expand All @@ -1425,7 +1425,7 @@ <h3 data-number="6.9.1" class="anchored" data-anchor-id="model"><span class="hea
</section>
<section id="software" class="level3" data-number="6.9.2">
<h3 data-number="6.9.2" class="anchored" data-anchor-id="software"><span class="header-section-number">6.9.2</span> Software</h3>
<div id="fig-tf-sw-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-align="center" data-caption="TensorFlow Framework Comparison - Model">
<div id="fig-tf-sw-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-caption="TensorFlow Framework Comparison - Model" data-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-tf-sw-comparison-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="images/png/image5.png" style="width:100.0%" data-align="center" data-caption="TensorFlow Framework Comparison - Model" class="figure-img">
Expand All @@ -1439,7 +1439,7 @@ <h3 data-number="6.9.2" class="anchored" data-anchor-id="software"><span class="
</section>
<section id="hardware" class="level3" data-number="6.9.3">
<h3 data-number="6.9.3" class="anchored" data-anchor-id="hardware"><span class="header-section-number">6.9.3</span> Hardware</h3>
<div id="fig-tf-hw-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-align="center" data-caption="TensorFlow Framework Comparison - Hardware">
<div id="fig-tf-hw-comparison" class="quarto-figure quarto-figure-center quarto-float anchored" data-caption="TensorFlow Framework Comparison - Hardware" data-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-tf-hw-comparison-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="images/png/image3.png" style="width:100.0%" data-align="center" data-caption="TensorFlow Framework Comparison - Hardware" class="figure-img">
Expand Down Expand Up @@ -1486,7 +1486,7 @@ <h2 data-number="6.10" class="anchored" data-anchor-id="future-trends-in-ml-fram
<section id="decomposition" class="level3" data-number="6.10.1">
<h3 data-number="6.10.1" class="anchored" data-anchor-id="decomposition"><span class="header-section-number">6.10.1</span> Decomposition</h3>
<p>Currently, the ML system stack consists of four abstractions as shown in <a href="#fig-mlsys-stack" class="quarto-xref">Figure&nbsp;<span>6.10</span></a>, namely (1) computational graphs, (2) tensor programs, (3) libraries and runtimes, and (4) hardware primitives.</p>
<div id="fig-mlsys-stack" class="quarto-figure quarto-figure-center quarto-float anchored" data-align="center" data-caption="Four Abstractions in Current ML System Stack">
<div id="fig-mlsys-stack" class="quarto-figure quarto-figure-center quarto-float anchored" data-caption="Four Abstractions in Current ML System Stack" data-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-mlsys-stack-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="images/png/image8.png" class="img-fluid figure-img" data-align="center" data-caption="Four Abstractions in Current ML System Stack">
Expand Down
Loading

0 comments on commit 8b94a6a

Please sign in to comment.