From 6898a5de0842ff90a2c0b9aa8d6353a672ec14d7 Mon Sep 17 00:00:00 2001 From: mrdragonbear Date: Tue, 10 Oct 2023 16:27:39 -0400 Subject: [PATCH] Built site for gh-pages --- .DS_Store | Bin 6148 -> 6148 bytes .nojekyll | 2 +- contributors.html | 18 +++++++++--------- data_engineering.html | 22 +++++++++++----------- references.html | 3 ++- search.json | 14 +++++++------- 6 files changed, 30 insertions(+), 29 deletions(-) diff --git a/.DS_Store b/.DS_Store index b74dcb9dc75ceccf02fdd2e610ba4f121c8cfdad..bc57f6bd468b13f9c19329000e59ea2effeb7ba7 100644 GIT binary patch delta 137 zcmZoMXffE}!NSUWq&(^CdK_ASw|fw z2e67w?qlU*1q)1`%j(S(>^9ksO>7by5382oN1$dwHdQ!FlD{CsFgQ6sw}1fzcpq#m MT*A1So#QV*0E!bY^8f$< delta 137 zcmZoMXffE}!NTgv_FVDP7by59{l+KIzYs1=&>LEJ^-?48!2${M-Tt5a2zq Nv2Y3FW_FIh`~VK=F7W^W diff --git a/.nojekyll b/.nojekyll index 2435800d..dc5c4e1d 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -1dc291bb \ No newline at end of file +ba10447e \ No newline at end of file diff --git a/contributors.html b/contributors.html index d4e2aaaf..4a074f77 100644 --- a/contributors.html +++ b/contributors.html @@ -426,33 +426,33 @@

Contributors

-sjohri20
sjohri20

📖 +oishib
oishib

📖 -ishapira
ishapira

📖 +Marcelo Rovai
Marcelo Rovai

📖 -Jessica Quaye
Jessica Quaye

📖 +Ikechukwu Uchendu
Ikechukwu Uchendu

📖 -Matthew Stewart
Matthew Stewart

📖 +sjohri20
sjohri20

📖 -Ikechukwu Uchendu
Ikechukwu Uchendu

📖 +Jessica Quaye
Jessica Quaye

📖 -Vijay Janapa Reddi
Vijay Janapa Reddi

📖 +Matthew Stewart
Matthew Stewart

📖 -oishib
oishib

📖 +Vijay Janapa Reddi
Vijay Janapa Reddi

📖 -Shvetank Prakash
Shvetank Prakash

📖 +ishapira
ishapira

📖 -Marcelo Rovai
Marcelo Rovai

📖 +Shvetank Prakash
Shvetank Prakash

📖 diff --git a/data_engineering.html b/data_engineering.html index e72d7b23..eca7432e 100644 --- a/data_engineering.html +++ b/data_engineering.html @@ -482,7 +482,7 @@

6  6.1 Introduction

Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.

Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\(^{1}\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.

-

Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.[^2] It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.

+

Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.

Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.

We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!

@@ -645,7 +645,7 @@

The stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.

-

Data Governance[^1]: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.

+

Data Governance: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.

Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.

@@ -663,13 +663,13 @@

Special data storage considerations for tinyML

Efficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:

    -
  • Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)[^2] are commonly used to represent important audio characteristics.

  • +
  • Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs) are commonly used to represent important audio characteristics.

  • Creating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.

  • -
  • Vector quantization[^3] - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.

  • +
  • Vector quantization - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.

  • Sequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.

This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.

-

Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018[^4]), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.

+

Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.

6.5 Data Processing

@@ -687,7 +687,7 @@

Encoding categorical variables
  • Using techniques like dimensionality reduction
  • -

    Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation[^3], contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

    +

    Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

    @@ -723,7 +723,7 @@

    When working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).

    -Google. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/. +Google. n.d. “Information Quality & Content Moderation.” https://blog.google/documents/83/.

    AI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:

    • Pre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets.
    • @@ -798,8 +798,8 @@

      6.10 Licensing

      Many high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.

      -

      For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020[^6]). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.

      -

      The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.

      +

      For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.

      +

      The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.

      New Data Regulations and Their Implications

      New data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:

        @@ -809,10 +809,10 @@

        Challenges in Assembling ML Training Datasets

        -

        Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing[^7] or public-private data collaborations could greatly accelerate industry progress and ethical standards.

        +

        Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing or public-private data collaborations could greatly accelerate industry progress and ethical standards.

        In some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.

        Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.

        -

        Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8],[^9],[^10] its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.

        +

        Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.

        Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.

    diff --git a/references.html b/references.html index 16c1c771..2b718b88 100644 --- a/references.html +++ b/references.html @@ -493,7 +493,8 @@

    References

    the ACM 63 (11): 139–44.

    -Google. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/. +Google. n.d. “Information Quality & Content +Moderation.” https://blog.google/documents/83/.
    Han, Song, Huizi Mao, and William J. Dally. 2016. “Deep diff --git a/search.json b/search.json index 0b28606b..5acdbe70 100644 --- a/search.json +++ b/search.json @@ -25,7 +25,7 @@ "href": "contributors.html", "title": "Contributors", "section": "", - "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\nsjohri20📖\n\n\nishapira📖\n\n\nJessica Quaye📖\n\n\nMatthew Stewart📖\n\n\nIkechukwu Uchendu📖\n\n\nVijay Janapa Reddi📖\n\n\noishib📖\n\n\n\n\nShvetank Prakash📖\n\n\nMarcelo Rovai📖" + "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\noishib📖\n\n\nMarcelo Rovai📖\n\n\nIkechukwu Uchendu📖\n\n\nsjohri20📖\n\n\nJessica Quaye📖\n\n\nMatthew Stewart📖\n\n\nVijay Janapa Reddi📖\n\n\n\n\nishapira📖\n\n\nShvetank Prakash📖" }, { "objectID": "copyright.html", @@ -284,7 +284,7 @@ "href": "data_engineering.html#introduction", "title": "6  Data Engineering", "section": "6.1 Introduction", - "text": "6.1 Introduction\nData is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.\nDataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\\(^{1}\\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.\nLooking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.[^2] It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.\nCreating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.\nWe begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!" + "text": "6.1 Introduction\nData is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.\nDataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\\(^{1}\\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.\nLooking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.\nCreating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.\nWe begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!" }, { "objectID": "data_engineering.html#problem-definition", @@ -305,21 +305,21 @@ "href": "data_engineering.html#data-storage", "title": "6  Data Engineering", "section": "6.4 Data Storage", - "text": "6.4 Data Storage\nData sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.\n\n\n\n\n\n\n\n\n\n\nDatabase\nData Warehouse\nData Lake\n\n\n\n\nPurpose\nOperational and transactional\nAnalytical\nAnalytical\n\n\nData type\nStructured\nStructured\nStructured, semi-structured and/or unstructured\n\n\nScale\nSmall to large volumes of data\nLarge volumes of integrated data\nLarge volumes of diverse data\n\n\nExamples\nMySQL\nGoogle BigQuery, Amazon Redshift, Microsoft Azure Synapse.\nGoogle Cloud Storage, AWS S3, Azure Data Lake Storage\n\n\n\nThe stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.\nData Governance[^1]: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.\nData governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.\n\n\n\nData Governance\n\n\nFigure source: https://www.databricks.com/discover/data-governance\nSome examples of data governance across different sectors include:\n\nMedicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.\nFinance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.\nGovernment: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.\n\nSpecial data storage considerations for tinyML\nEfficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:\n\nExtracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)[^2] are commonly used to represent important audio characteristics.\nCreating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.\nVector quantization[^3] - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.\nSequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.\n\nThis format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.\nSelective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018[^4]), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications." + "text": "6.4 Data Storage\nData sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.\n\n\n\n\n\n\n\n\n\n\nDatabase\nData Warehouse\nData Lake\n\n\n\n\nPurpose\nOperational and transactional\nAnalytical\nAnalytical\n\n\nData type\nStructured\nStructured\nStructured, semi-structured and/or unstructured\n\n\nScale\nSmall to large volumes of data\nLarge volumes of integrated data\nLarge volumes of diverse data\n\n\nExamples\nMySQL\nGoogle BigQuery, Amazon Redshift, Microsoft Azure Synapse.\nGoogle Cloud Storage, AWS S3, Azure Data Lake Storage\n\n\n\nThe stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.\nData Governance: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.\nData governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.\n\n\n\nData Governance\n\n\nFigure source: https://www.databricks.com/discover/data-governance\nSome examples of data governance across different sectors include:\n\nMedicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.\nFinance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.\nGovernment: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.\n\nSpecial data storage considerations for tinyML\nEfficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:\n\nExtracting acoustic features - Mel-frequency cepstral coefficients (MFCCs) are commonly used to represent important audio characteristics.\nCreating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.\nVector quantization - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.\nSequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.\n\nThis format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.\nSelective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications." }, { "objectID": "data_engineering.html#data-processing", "href": "data_engineering.html#data-processing", "title": "6  Data Engineering", "section": "6.5 Data Processing", - "text": "6.5 Data Processing\nData processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.”\n\n\n\nA breakdown of tasks that data scientists allocate their time to, highlighting the significant portion spent on data cleaning and organizing.\n\n\nProper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.\nData often comes from diverse sources and can be unstructured or semi-structured. Thus, it’s essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:\n\nNormalizing numerical variables\nEncoding categorical variables\nUsing techniques like dimensionality reduction\n\nData validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation[^3], contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.\n\n\n\nA detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training\n\n\nLet’s take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.\nThe MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the model’s requirements, these noises can be removed or intentionally retained.\nMaintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.\nThere is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems." + "text": "6.5 Data Processing\nData processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.”\n\n\n\nA breakdown of tasks that data scientists allocate their time to, highlighting the significant portion spent on data cleaning and organizing.\n\n\nProper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.\nData often comes from diverse sources and can be unstructured or semi-structured. Thus, it’s essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:\n\nNormalizing numerical variables\nEncoding categorical variables\nUsing techniques like dimensionality reduction\n\nData validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.\n\n\n\nA detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training\n\n\nLet’s take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.\nThe MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the model’s requirements, these noises can be removed or intentionally retained.\nMaintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.\nThere is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems." }, { "objectID": "data_engineering.html#data-labeling", "href": "data_engineering.html#data-labeling", "title": "6  Data Engineering", "section": "6.6 Data Labeling", - "text": "6.6 Data Labeling\nData labeling is an important step in creating high-quality training datasets for machine learning models. Labels provide the ground truth information that allows models to learn relationships between inputs and desired outputs. This section covers key considerations around selecting label types, formats, and content to capture the necessary information for given tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. Ethical treatment of human annotators is also something we emphasize. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies is essential for constructing reliable, useful datasets that can train performant, trustworthy machine learning systems.\nLabel Types Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (Johnson-Roberson et al. (2017)).\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?” 2017 IEEE International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\nUnless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.\nAdditionally, annotators can potentially provide metadata that provides insight into how the dataset represents different characteristics of interest (see: Data Transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (Ardila et al. (2020)). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see the breakdown of who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages.\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of the 12th Conference on Language Resources and Evaluation, May, 4218–22.\nNext, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend both on their use case and their resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire.\nAnnotation Methods: Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but lacks scalability. Crowdsourcing enables distributed annotation by non-experts, often through dedicated platforms (Sheng and Zhang (2019)). Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels (Ratner et al. (2018)).\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with Crowdsourcing: A Brief Summary of the Past Research and Future Directions.” Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher RĂ©. 2018. “Snorkel Metal: Weak Supervision for Multi-Task Learning.” Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\nAfter deciding on their labels’ desired content and format, creators begin the annotation process. To collect large numbers of labels from human annotators, creators frequently rely on dedicated annotation platforms, which can connect them to teams of human annotators. When using these platforms, creators may have little insight to annotators’ backgrounds and levels of experience with topics of interest. However, some platforms offer access to annotators with specific expertise (e.g. doctors).\nEnsuring Label Quality: There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (Northcutt, Athalye, and Mueller (2021)). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021. “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.” arXiv, March. https://doi.org/  https://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via DataCite.\n\nWhen working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).\n\nGoogle. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.\nAI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:\n\nPre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets.\nActive learning: AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled dataset’s quality while reducing the overall annotation time.\nQuality control: AI models can be used to identify and flag potential errors in human annotations. This can help to ensure the accuracy and consistency of the labeled dataset.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009. “Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book Reviews].” IEEE Transactions on Neural Networks 20 (3): 542–42. https://doi.org/10.1109/tnn.2009.2015974.\nHere are some examples of how AI-assisted annotation has been proposed to be useful:\n\nMedical imaging: AI-assisted annotation is being used to label medical images, such as MRI scans and X-rays (Krishnan, Rajpurkar, and Topol (2022)). Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are both scarce and it becomes a costly effort. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently.\n\nSelf-driving cars: AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs.\nSocial media: AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022. “Self-Supervised Learning in Medicine and Healthcare.” Nature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1." + "text": "6.6 Data Labeling\nData labeling is an important step in creating high-quality training datasets for machine learning models. Labels provide the ground truth information that allows models to learn relationships between inputs and desired outputs. This section covers key considerations around selecting label types, formats, and content to capture the necessary information for given tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. Ethical treatment of human annotators is also something we emphasize. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies is essential for constructing reliable, useful datasets that can train performant, trustworthy machine learning systems.\nLabel Types Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (Johnson-Roberson et al. (2017)).\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?” 2017 IEEE International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\nUnless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.\nAdditionally, annotators can potentially provide metadata that provides insight into how the dataset represents different characteristics of interest (see: Data Transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (Ardila et al. (2020)). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see the breakdown of who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages.\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. “Common Voice: A Massively-Multilingual Speech Corpus.” Proceedings of the 12th Conference on Language Resources and Evaluation, May, 4218–22.\nNext, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend both on their use case and their resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire.\nAnnotation Methods: Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but lacks scalability. Crowdsourcing enables distributed annotation by non-experts, often through dedicated platforms (Sheng and Zhang (2019)). Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels (Ratner et al. (2018)).\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with Crowdsourcing: A Brief Summary of the Past Research and Future Directions.” Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher RĂ©. 2018. “Snorkel Metal: Weak Supervision for Multi-Task Learning.” Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\nAfter deciding on their labels’ desired content and format, creators begin the annotation process. To collect large numbers of labels from human annotators, creators frequently rely on dedicated annotation platforms, which can connect them to teams of human annotators. When using these platforms, creators may have little insight to annotators’ backgrounds and levels of experience with topics of interest. However, some platforms offer access to annotators with specific expertise (e.g. doctors).\nEnsuring Label Quality: There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (Northcutt, Athalye, and Mueller (2021)). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021. “Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.” arXiv, March. https://doi.org/  https://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via DataCite.\n\nWhen working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).\n\nGoogle. n.d. “Information Quality & Content Moderation.” https://blog.google/documents/83/.\nAI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:\n\nPre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets.\nActive learning: AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled dataset’s quality while reducing the overall annotation time.\nQuality control: AI models can be used to identify and flag potential errors in human annotations. This can help to ensure the accuracy and consistency of the labeled dataset.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009. “Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book Reviews].” IEEE Transactions on Neural Networks 20 (3): 542–42. https://doi.org/10.1109/tnn.2009.2015974.\nHere are some examples of how AI-assisted annotation has been proposed to be useful:\n\nMedical imaging: AI-assisted annotation is being used to label medical images, such as MRI scans and X-rays (Krishnan, Rajpurkar, and Topol (2022)). Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are both scarce and it becomes a costly effort. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently.\n\nSelf-driving cars: AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs.\nSocial media: AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022. “Self-Supervised Learning in Medicine and Healthcare.” Nature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1." }, { "objectID": "data_engineering.html#data-version-control", @@ -347,7 +347,7 @@ "href": "data_engineering.html#licensing", "title": "6  Data Engineering", "section": "6.10 Licensing", - "text": "6.10 Licensing\nMany high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.\nFor instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020[^6]). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.\nThe legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.\nNew Data Regulations and Their Implications\nNew data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:\n\nClassifies AI systems by risk.\nMandates development and usage prerequisites.\nEmphasizes data quality, transparency, human oversight, and accountability.\n\nAdditionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing \"unacceptable\" risks, stringent conditions for high-risk systems, and minimal obligations for \"limited risk\" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.\nChallenges in Assembling ML Training Datasets\nComplex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing[^7] or public-private data collaborations could greatly accelerate industry progress and ethical standards.\nIn some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.\nData collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.\nHaving the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8],[^9],[^10] its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.\nDataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering." + "text": "6.10 Licensing\nMany high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.\nFor instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.\nThe legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.\nNew Data Regulations and Their Implications\nNew data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:\n\nClassifies AI systems by risk.\nMandates development and usage prerequisites.\nEmphasizes data quality, transparency, human oversight, and accountability.\n\nAdditionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing \"unacceptable\" risks, stringent conditions for high-risk systems, and minimal obligations for \"limited risk\" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.\nChallenges in Assembling ML Training Datasets\nComplex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing or public-private data collaborations could greatly accelerate industry progress and ethical standards.\nIn some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.\nData collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.\nHaving the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.\nDataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering." }, { "objectID": "data_engineering.html#conclusion", @@ -1341,7 +1341,7 @@ "href": "references.html", "title": "References", "section": "", - "text": "Aledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.\n“Federated Learning: A Survey on Enabling Technologies, Protocols,\nand Applications.” IEEE Access 8: 140699–725. https://doi.org/10.1109/access.2020.3013541.\n\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael\nKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,\nand Gregor Weber. 2020. “Common Voice: A Massively-Multilingual\nSpeech Corpus.” Proceedings of the 12th Conference on\nLanguage Resources and Evaluation, May, 4218–22.\n\n\nARM.com. “The Future Is Being Built on Arm: Market Diversification\nContinues to Drive Strong Royalty and Licensing Growth as Ecosystem\nReaches Quarter of a Trillion Chips Milestone – ArmÂź.” https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\n“Autoencoders.” Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353–74.\n\n\nBarroso, Luiz AndrĂ©, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nBender, Emily M., and Batya Friedman. 2018. “Data Statements for\nNatural Language Processing: Toward Mitigating System Bias and Enabling\nBetter Science.” Transactions of the Association for\nComputational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009.\n“Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book\nReviews].” IEEE Transactions on Neural Networks 20 (3):\n542–42. https://doi.org/10.1109/tnn.2009.2015974.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal DaumĂ© III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\n“Generative Adversarial Networks.” Communications of\nthe ACM 63 (11): 139–44.\n\n\nGoogle. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. “Deep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.” https://arxiv.org/abs/1510.00149.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\n“Deep Residual Learning for Image Recognition.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770–78.\n\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia\nChmielinski. 2020. “The Dataset Nutrition Label.” Data\nProtection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\n\n\nHoward, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.\n“Mobilenets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.” arXiv Preprint arXiv:1704.04861.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. “SqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.” arXiv Preprint arXiv:1602.07360.\n\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur\nSridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the\nMatrix: Can Virtual Worlds Replace Human-Generated Annotations for Real\nWorld Tasks?” 2017 IEEE International Conference on Robotics\nand Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.\n“Self-Supervised Learning in Medicine and Healthcare.”\nNature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\n“Imagenet Classification with Deep Convolutional Neural\nNetworks.” Advances in Neural Information Processing\nSystems 25.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. “Optimal Brain\nDamage.” Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. “Edge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.” IEEE Transactions on Wireless Communications\n19 (1): 447–57.\n\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” arXiv, March. https://doi.org/ \nhttps://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via\nDataCite.\n\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.\n“Data Cards: Purposeful and Transparent Dataset Documentation for\nResponsible Ai.” 2022 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and\nChristopher RĂ©. 2018. “Snorkel Metal: Weak Supervision for\nMulti-Task Learning.” Proceedings of the Second Workshop on\nData Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\n“Learning Representations by Back-Propagating Errors.”\nNature 323 (6088): 533–36.\n\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with\nCrowdsourcing: A Brief Summary of the Past Research and Future\nDirections.” Proceedings of the AAAI Conference on Artificial\nIntelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Ɓukasz Kaiser, and Illia Polosukhin. 2017.\n“Attention Is All You Need.” Advances in Neural\nInformation Processing Systems 30.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. O’Reilly Media." + "text": "Aledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.\n“Federated Learning: A Survey on Enabling Technologies, Protocols,\nand Applications.” IEEE Access 8: 140699–725. https://doi.org/10.1109/access.2020.3013541.\n\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael\nKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,\nand Gregor Weber. 2020. “Common Voice: A Massively-Multilingual\nSpeech Corpus.” Proceedings of the 12th Conference on\nLanguage Resources and Evaluation, May, 4218–22.\n\n\nARM.com. “The Future Is Being Built on Arm: Market Diversification\nContinues to Drive Strong Royalty and Licensing Growth as Ecosystem\nReaches Quarter of a Trillion Chips Milestone – ArmÂź.” https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\n“Autoencoders.” Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353–74.\n\n\nBarroso, Luiz AndrĂ©, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nBender, Emily M., and Batya Friedman. 2018. “Data Statements for\nNatural Language Processing: Toward Mitigating System Bias and Enabling\nBetter Science.” Transactions of the Association for\nComputational Linguistics 6: 587–604. https://doi.org/10.1162/tacl_a_00041.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009.\n“Semi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book\nReviews].” IEEE Transactions on Neural Networks 20 (3):\n542–42. https://doi.org/10.1109/tnn.2009.2015974.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal DaumĂ© III, and Kate Crawford. 2021.\n“Datasheets for Datasets.” Communications of the\nACM 64 (12): 86–92. https://doi.org/10.1145/3458723.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\n“Generative Adversarial Networks.” Communications of\nthe ACM 63 (11): 139–44.\n\n\nGoogle. n.d. “Information Quality & Content\nModeration.” https://blog.google/documents/83/.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. “Deep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.” https://arxiv.org/abs/1510.00149.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\n“Deep Residual Learning for Image Recognition.” In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770–78.\n\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia\nChmielinski. 2020. “The Dataset Nutrition Label.” Data\nProtection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\n\n\nHoward, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.\n“Mobilenets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.” arXiv Preprint arXiv:1704.04861.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. “SqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.” arXiv Preprint arXiv:1602.07360.\n\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur\nSridhar, Karl Rosaen, and Ram Vasudevan. 2017. “Driving in the\nMatrix: Can Virtual Worlds Replace Human-Generated Annotations for Real\nWorld Tasks?” 2017 IEEE International Conference on Robotics\nand Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017. “In-Datacenter\nPerformance Analysis of a Tensor Processing Unit.” In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1–12.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.\n“Self-Supervised Learning in Medicine and Healthcare.”\nNature Biomedical Engineering 6 (12): 1346–52. https://doi.org/10.1038/s41551-022-00914-1.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\n“Imagenet Classification with Deep Convolutional Neural\nNetworks.” Advances in Neural Information Processing\nSystems 25.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. “Optimal Brain\nDamage.” Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. “Edge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.” IEEE Transactions on Wireless Communications\n19 (1): 447–57.\n\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.\n“Pervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.” arXiv, March. https://doi.org/ \nhttps://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via\nDataCite.\n\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.\n“Data Cards: Purposeful and Transparent Dataset Documentation for\nResponsible Ai.” 2022 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and\nChristopher RĂ©. 2018. “Snorkel Metal: Weak Supervision for\nMulti-Task Learning.” Proceedings of the Second Workshop on\nData Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\n“Learning Representations by Back-Propagating Errors.”\nNature 323 (6088): 533–36.\n\n\nSheng, Victor S., and Jing Zhang. 2019. “Machine Learning with\nCrowdsourcing: A Brief Summary of the Past Research and Future\nDirections.” Proceedings of the AAAI Conference on Artificial\nIntelligence 33 (01): 9837–43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Ɓukasz Kaiser, and Illia Polosukhin. 2017.\n“Attention Is All You Need.” Advances in Neural\nInformation Processing Systems 30.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. O’Reilly Media." }, { "objectID": "tools.html#hardware-kits",