Skip to content

Commit

Permalink
Updated figures and figure captions in data_engineering.qmd
Browse files Browse the repository at this point in the history
  • Loading branch information
mpstewart1 committed Oct 10, 2023
1 parent 4ba9f2a commit cd2b6d1
Showing 1 changed file with 6 additions and 9 deletions.
15 changes: 6 additions & 9 deletions data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Take, for example, Keyword Spotting (KWS). KWS serves as a prime example of Tiny

It is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.

![The seamless integration of Keyword Spotting technology allows users to command their devices with simple voice prompts, even in ambient noise environments](images/data_engineering_kws.png)
![The seamless integration of Keyword Spotting technology allows users to command their devices with simple voice prompts, even in ambient noise environments.](images/data_engineering_kws.png)

Building a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS model's effectiveness is not just about recognizing a word; it's about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. It's about ensuring that a whispered "Alexa" in the dead of night or a shouted "Ok Google" in a noisy marketplace are both recognized with equal precision.

Expand Down Expand Up @@ -196,10 +196,7 @@ The stored data is often accompanied by metadata, which is defined as 'data abou

Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.

![Data Governance](images/data_engineering_governance.png)

Figure source:
[[https://www.databricks.com/discover/data-governance]{.underline}](https://www.databricks.com/discover/data-governance)
![Comprehensive overview of the data governance framework.](https://www.databricks.com/en-website-assets/static/b9963e8f428f6bb9e0d3fc6f7b8b9453/c742b/key-elements-of-data-governance.webp)

Some examples of data governance across different sectors include:

Expand Down Expand Up @@ -241,7 +238,7 @@ Data often comes from diverse sources and can be unstructured or semi-structured

Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

![A detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training](images/data_engineering_kws2.png)
![A detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training.](images/data_engineering_kws2.png)

Let’s take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.

Expand All @@ -259,7 +256,7 @@ Data labeling is an important step in creating high-quality training datasets fo
**Label Types**
Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (@Johnson-Roberson_Barto_Mehta_Sridhar_Rosaen_Vasudevan_2017).

![](images/CS249r_Labels.png)
![An overview of common label types.](images/CS249r_Labels.png)

Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.

Expand All @@ -275,7 +272,7 @@ After deciding on their labels' desired content and format, creators begin the a
**Ensuring Label Quality:**
There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (@Northcutt_Athalye_Mueller_2021). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.

![](images/Hard_Labeling_cases.png)
![Some examples of hard labeling cases.](https://raw.githubusercontent.com/cleanlab/assets/master/cleanlab/label-errors-examples.png)

When working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (@Google).

Expand All @@ -292,7 +289,7 @@ Here are some examples of how AI-assisted annotation has been proposed to be use
- **Self-driving cars:** AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs.
- **Social media:** AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.

![](images/AI_Labeling_approaches.png)
![Strategies for acquiring additional labeled training data in machine learning.](https://dawn.cs.stanford.edu/assets/img/2017-07-16-weak-supervision/WS_mapping.png)

## Data Version Control

Expand Down

0 comments on commit cd2b6d1

Please sign in to comment.