From 4a8e1b81d1a246554e02099323e4d41d4d82ca7d Mon Sep 17 00:00:00 2001 From: mrdragonbear Date: Tue, 10 Oct 2023 16:06:40 -0400 Subject: [PATCH] Built site for gh-pages --- .DS_Store | Bin 6148 -> 6148 bytes .nojekyll | 2 +- data_engineering.html | 100 +----- mlops.html | 720 --------------------------------------- mlworkflow.html | 757 ------------------------------------------ search.json | 15 +- 6 files changed, 19 insertions(+), 1575 deletions(-) delete mode 100644 mlops.html delete mode 100644 mlworkflow.html diff --git a/.DS_Store b/.DS_Store index 630313f460c226f2e96517934e946d8fe6bd50e1..b74dcb9dc75ceccf02fdd2e610ba4f121c8cfdad 100644 GIT binary patch delta 105 zcmZoMXffE}!NTUr_FVB()?_~xIT*8uC5?3>4^VXST^40HOMJ2nYX}Fo%L}0DqmGjU kSj8syv2wA31t!mB&1Zd`)+haWvLKr(oF%zAgl&T$02mP`5&!@I delta 105 zcmZoMXffE}!NMjc<+>;?Zn7VX9E@4SlE!NBv9jgETable of contents
  • 6.9 Data Transparency
  • 6.10 Licensing
  • 6.11 Conclusion
  • -
  • 6.12 Helpful References
  • @@ -479,11 +478,11 @@

    6  +

    6.1 Introduction

    Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.

    Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\(^{1}\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.

    -

    Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.1 It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.

  • 1 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).

  • +

    Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.[^2] It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.

    Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.

    We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!

    @@ -600,7 +599,7 @@

    While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases.

    -
    +

    6.4 Data Storage

    Data sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.

    @@ -646,7 +645,7 @@

    The stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.

    -

    Data Governance2: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.

  • 2 Janssen, Marijn, et al. "Data governance: Organizing data for trustworthy Artificial Intelligence." Government Information Quarterly 37.3 (2020): 101493.

  • +

    Data Governance[^1]: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.

    Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.

    @@ -664,15 +663,15 @@

    Special data storage considerations for tinyML

    Efficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:

      -
    • Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)3 are commonly used to represent important audio characteristics.

    • +
    • Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)[^2] are commonly used to represent important audio characteristics.

    • Creating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.

    • -
    • Vector quantization4 - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.

    • +
    • Vector quantization[^3] - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.

    • Sequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.

    -
  • 3 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).

  • 4 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.

  • This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.

    -

    Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 20185), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.

  • 5 Rybakov, Oleg, et al. "Streaming keyword spotting on mobile devices." arXiv preprint arXiv:2005.06720 (2020).

  • +

    This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.

    +

    Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018[^4]), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.

    -
    +

    6.5 Data Processing

    Data processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.”

    @@ -688,7 +687,7 @@

    Encoding categorical variables
  • Using techniques like dimensionality reduction
  • -

    Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation6, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

  • 6 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.

  • +

    Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation[^3], contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.

    @@ -796,10 +795,10 @@

    There are several current challenges in ensuring data transparency, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can be difficult to achieve in these cases. Full transparency may also overwhelm the consumers with too much detail. And finally, it is also important to balance the tradeoff between transparency and privacy.

    -
    +

    6.10 Licensing

    Many high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.

    -

    For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 20207). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.

  • 7 Birhane, Abeba, and Vinay Uday Prabhu. "Large image datasets: A pyrrhic win for computer vision?." 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2021.

  • +

    For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020[^6]). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.

    The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.

    New Data Regulations and Their Implications

    New data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:

    @@ -810,91 +809,20 @@

    Challenges in Assembling ML Training Datasets

    -

    Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing8 or public-private data collaborations could greatly accelerate industry progress and ethical standards.

  • 8 Sonnenburg, Soren, et al. "The need for open source software in machine learning." (2007): 2443-2466.

  • +

    Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing[^7] or public-private data collaborations could greatly accelerate industry progress and ethical standards.

    In some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.

    Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.

    -

    For instance, below is an example request from Common Voice users to remove their information:

    - --- - - - - - - - - - - -

    Thank you for downloading the Common Voice dataset. Account holders are free to request deletion of their voice clips at any time. We action this on our side for all future releases and are legally obligated to inform those who have downloaded a historic release so that they can also take action.

    -

    You are receiving this message because one or more account holders have requested that their voice clips be deleted. Their clips are part of the dataset that you downloaded and are associated with the hashed IDs listed below. Please delete them from your downloads in order to fulfill your third party data privacy obligations.

    -

    Thank you for your timely completion.

    -
      -
    • 4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57

    • -
    • 97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7

    • -
    • 969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112

    • -
    • 6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9

    • -
    • 33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564

    • -
    -

    Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate9,10,11 its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.

  • 9 Ginart, Antonio, et al. "Making ai forget you: Data deletion in machine learning." Advances in neural information processing systems 32 (2019).

  • 10 Sekhari, Ayush, et al. "Remember what you want to forget: Algorithms for machine unlearning." Advances in Neural Information Processing Systems 34 (2021): 18075-18086.

  • 11 Guo, Chuan, et al. "Certified data removal from machine learning models." arXiv preprint arXiv:1911.03030 (2019).

  • +

    Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8],[^9],[^10] its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.

    Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.

    6.11 Conclusion

    Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.

    -
    -
    -

    6.12 Helpful References

    -

    1. [3 big problems with datasets in AI and machine learning](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)

    -

    2. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)

    -

    3. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)

    -

    4. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)

    -

    5. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)

    -

    6. [“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)

    -

    7. [Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)](https://arxiv.org/abs/2003.12206)

    -

    8. [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)

    -

    9. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)

    -

    10. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)

    -

    11. [OpenImages](https://storage.googleapis.com/openimages/web/index.html)

    -

    12. [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749)

    -

    13. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)

    -

    14. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)

    -
    - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    14  Embedded MLOps

    -
    - - - -
    - - - - -
    - - -
    - -
    -

    14.1 Introduction

    -

    Explanation: This subsection sets the groundwork for the discussions to follow, elucidating the fundamental concept of MLOps and its critical role in enhancing the efficiency, reliability, and scalability of embedded AI systems. It outlines the unique characteristics of implementing MLOps in an embedded context, emphasizing its significance in the streamlined deployment and management of machine learning models.

    -
      -
    • Overview of MLOps
    • -
    • The importance of MLOps in the embedded domain
    • -
    • Unique challenges and opportunities in embedded MLOps
    • -
    -
    -
    -

    14.2 Deployment Environments

    -

    Explanation: This section focuses on different environments where embedded AI systems can be deployed. It will delve into aspects like edge devices, cloud platforms, and hybrid environments, offering insights into the unique characteristics and considerations of each.

    -
      -
    • Cloud-based deployment: Features and benefits
    • -
    • Edge computing: Characteristics and applications
    • -
    • Hybrid environments: Combining the best of edge and cloud computing
    • -
    • Considerations for selecting an appropriate deployment environment
    • -
    -
    -
    -

    14.3 Deployment Strategies

    -

    Explanation: Here, readers will be introduced to various deployment strategies that facilitate a smooth transition from development to production. It discusses approaches such as blue-green deployments, canary releases, and rolling deployments, which can help in maintaining system stability and minimizing downtime during updates.

    -
      -
    • Overview of different deployment strategies
    • -
    • Blue-green deployments: Definition and benefits
    • -
    • Canary releases: Phased rollouts and monitoring
    • -
    • Rolling deployments: Ensuring continuous service availability
    • -
    • Strategy selection: Factors to consider
    • -
    -
    -
    -

    14.4 Workflow Automation

    -

    Explanation: Automation is at the heart of MLOps, helping to streamline workflows and enhance efficiency. This subsection highlights the significance of workflow automation in embedded MLOps, discussing various strategies and techniques for automating tasks such as testing, deployment, and monitoring, fostering a faster and error-free development lifecycle.

    -
      -
    • Automated testing: unit tests, integration tests
    • -
    • Automated deployment: scripting, configuration management
    • -
    • Continuous monitoring: setting up automated alerts and dashboards
    • -
    • Benefits of workflow automation: speed, reliability, repeatability
    • -
    -
    -
    -

    14.5 Model Versioning

    -

    Explanation: Model versioning is a pivotal aspect of MLOps, facilitating the tracking and management of different versions of machine learning models throughout their lifecycle. This subsection emphasizes the importance of model versioning in embedded systems, where memory and computational resources are limited, offering strategies for effective version management and rollback.

    -
      -
    • Importance of versioning in machine learning pipelines
    • -
    • Tools for model versioning: DVC, MLflow
    • -
    • Strategies for version control: naming conventions, metadata tagging
    • -
    • Rollback strategies: handling model regressions and rollbacks
    • -
    -
    -
    -

    14.6 Model Monitoring and Maintenance

    -

    Explanation: The process of monitoring and maintaining deployed models is crucial to ensure their long-term performance and reliability. This subsection underscores the significance of proactive monitoring and maintenance in embedded systems, discussing methodologies for monitoring model health, performance metrics, and implementing routine maintenance tasks to ensure optimal functionality.

    -
      -
    • The importance of monitoring deployed AI models
    • -
    • Setting up monitoring systems: tools and techniques
    • -
    • Tracking model performance: accuracy, latency, resource usage
    • -
    • Maintenance strategies: periodic updates, fine-tuning
    • -
    • Alerts and notifications: Setting up mechanisms for timely responses to issues
    • -
    • Over the air updates
    • -
    • Responding to anomalies: troubleshooting and resolution strategies
    • -
    -
    -
    -

    14.7 Security and Compliance

    -

    Explanation: Security and compliance are paramount in MLOps, safeguarding sensitive data and ensuring adherence to regulatory requirements. This subsection illuminates the critical role of implementing security measures and ensuring compliance in embedded MLOps, offering insights into best practices for data protection, access control, and regulatory adherence.

    -
      -
    • Security considerations in embedded MLOps: data encryption, secure communications
    • -
    • Compliance requirements: GDPR, HIPAA, and other regulations
    • -
    • Strategies for ensuring compliance: documentation, audits, training
    • -
    • Tools for security and compliance management: SIEM systems, compliance management platforms
    • -
    -
    -
    -

    14.8 Conclusion

    -

    Explanation: As we wrap up this chapter, we consolidate the key takeaways regarding the implementation of MLOps in the embedded domain. This final section seeks to furnish readers with a holistic view of the principles and practices of embedded MLOps, encouraging a thoughtful approach to adopting MLOps strategies in their projects, with a glimpse into the potential future trends in this dynamic field.

    -
      -
    • Recap of key concepts and best practices in embedded MLOps
    • -
    • Challenges and opportunities in implementing MLOps in embedded systems
    • -
    • Future directions: emerging trends and technologies in embedded MLOps
    • -
    - - -
    - -
    - - -
    -
    - -
    - - - - - \ No newline at end of file diff --git a/mlworkflow.html b/mlworkflow.html deleted file mode 100644 index a8d574c84..000000000 --- a/mlworkflow.html +++ /dev/null @@ -1,757 +0,0 @@ - - - - - - - - - - -Embedded AI: Principles, Algorithms, and Applications - 5  ML Workflow - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
    -
    - -
    - -
    - - -
    - - - -
    - -
    -
    -

    5  ML Workflow

    -
    - - - -
    - - - - -
    - - -
    - -

    In this chapter, we’re going to learn about the machine learning workflow. It will set the stages for the later chapters that dive into the details. But to prevent ourselves from missing the forest for the trees, this chapter gives a high level overview of the stpes involved in the ML workflow.

    -

    The ML workflow is a systematic and structured approach that guides professionals and researchers in developing, deploying, and maintaining ML models. This workflow is generally delineated into several critical stages, each contributing towards the effective development of intelligent systems.

    -

    Here’s a broad outline of the stages involved:

    -
    -

    5.1 Overview

    -

    A machine learning (ML) workflow is the process of developing, deploying, and maintaining ML models. It typically consists of the following steps:

    -
      -
    1. Define the problem. What are you trying to achieve with your ML model? Do you want to classify images, predict customer churn, or generate text? Once you have a clear understanding of the problem, you can start to collect data and choose a suitable ML algorithm.
    2. -
    3. Collect and prepare data. ML models are trained on data, so it’s important to collect a high-quality dataset that is representative of the real-world problem you’re trying to solve. Once you have your data, you need to clean it and prepare it for training. This may involve tasks such as removing outliers, imputing missing values, and scaling features.
    4. -
    5. Choose an ML algorithm. There are many different ML algorithms available, each with its own strengths and weaknesses. The best algorithm for your project will depend on the type of data you have and the problem you’re trying to solve.
    6. -
    7. Train the model. Once you have chosen an ML algorithm, you need to train the model on your prepared data. This process can take some time, depending on the size and complexity of your dataset.
    8. -
    9. Evaluate the model. Once the model is trained, you need to evaluate its performance on a held-out test set. This will give you an idea of how well the model will generalize to new data.
    10. -
    11. Deploy the model. Once you’re satisfied with the performance of the model, you can deploy it to production. This may involve integrating the model into a software application or making it available as a web service.
    12. -
    13. Monitor and maintain the model. Once the model is deployed, you need to monitor its performance and make updates as needed. This is because the real world is constantly changing, and your model may need to be updated to reflect these changes.
    14. -
    -

    The ML workflow is an iterative process. Once you have deployed a model, you may find that it needs to be retrained on new data or that the algorithm needs to be adjusted. It’s important to monitor the performance of your model closely and make changes as needed to ensure that it is still meeting your needs. In addition to the above steps, there are a number of other important considerations for ML workflows, such as:

    -
      -
    • Version control: It’s important to track changes to your code and data so that you can easily reproduce your results and revert to previous versions if necessary.
    • -
    • Documentation: It’s important to document your ML workflow so that others can understand and reproduce your work.
    • -
    • Testing: It’s important to test your ML workflow thoroughly to ensure that it is working as expected.
    • -
    • Security: It’s important to consider the security of your ML workflow and data, especially if you are deploying your model to production.
    • -
    -
    -
    -

    5.2 General vs. Embedded AI

    -

    The ML workflow delineated above serves as a comprehensive guide applicable broadly across various platforms and ecosystems, encompassing cloud-based solutions, edge computing, and tinyML. However, when we delineate the nuances of the general ML workflow and contrast it with the workflow in Embedded AI environments, we encounter a series of intricate differences and complexities. These nuances not only elevate the embedded AI workflow to a challenging and captivating domain but also open avenues for remarkable innovations and advancements.

    -

    Now, let’s explore these differences in detail:

    -
      -
    1. Resource Optimization: -
        -
      • General ML Workflow: Generally has the luxury of substantial computational resources available in cloud or data center environments. It focuses more on model accuracy and performance.
      • -
      • Embedded AI Workflow: Needs meticulous planning and execution to optimize the model’s size and computational demands, as they have to operate within the limited resources available in embedded systems. Techniques like model quantization and pruning become essential.
      • -
    2. -
    3. Real-time Processing: -
        -
      • General ML Workflow: The emphasis on real-time processing is usually less, and batch processing of data is quite common.
      • -
      • Embedded AI Workflow: Focuses heavily on real-time data processing, necessitating a workflow where low latency and rapid execution are a priority, especially in applications like autonomous driving and industrial automation.
      • -
    4. -
    5. Data Management and Privacy: -
        -
      • General ML Workflow: Data is typically processed in centralized locations, sometimes requiring extensive data transfer, with a focus on securing data during transit and storage.
      • -
      • Embedded AI Workflow: Promotes edge computing, which facilitates data processing closer to the source, reducing data transmission needs and enhancing privacy by keeping sensitive data localized.
      • -
    6. -
    7. Hardware-Software Integration: -
        -
      • General ML Workflow: Often operates on general-purpose hardware platforms with software development happening somewhat independently.
      • -
      • Embedded AI Workflow: Involves a tighter hardware-software co-design where both are developed in tandem to achieve optimal performance and efficiency, integrating custom chips or utilizing hardware accelerators.
      • -
    8. -
    -
    -
    -

    5.3 Roles & Responsibilities

    -

    As we work through the various tasks at hand, you will realize that there is a lot of complexity. Creating a machine learning solution, particularly for embedded AI systems, is a multidisciplinary endeavor involving various experts and specialists. Here is a list of personnel that are typically involved in the process, along with brief descriptions of their roles:

    -

    Project Manager:

    -
      -
    • Coordinates and manages the overall project.
    • -
    • Ensures all team members are working synergistically.
    • -
    • Responsible for project timelines and milestones.
    • -
    -

    Domain Experts:

    -
      -
    • Provide insights into the specific domain where the AI system will be implemented.
    • -
    • Help in defining project requirements and constraints based on domain-specific knowledge.
    • -
    -

    Data Scientists:

    -
      -
    • Specialize in analyzing data to develop machine learning models.
    • -
    • Responsible for data cleaning, exploration, and feature engineering.
    • -
    -

    Machine Learning Engineers:

    -
      -
    • Focus on the development and deployment of machine learning models.
    • -
    • Collaborate with data scientists to optimize models for embedded systems.
    • -
    -

    Data Engineers:

    -
      -
    • Responsible for managing and optimizing data pipelines.
    • -
    • Work on the storage and retrieval of data used for machine learning model training.
    • -
    -

    Embedded Systems Engineers:

    -
      -
    • Focus on integrating machine learning models into embedded systems.
    • -
    • Optimize system resources for running AI applications.
    • -
    -

    Software Developers:

    -
      -
    • Develop software components that interface with the machine learning models.
    • -
    • Responsible for implementing APIs and other integration points for the AI system.
    • -
    -

    Hardware Engineers:

    -
      -
    • Involved in designing and optimizing the hardware that hosts the embedded AI system.
    • -
    • Collaborate with embedded systems engineers to ensure compatibility.
    • -
    -

    UI/UX Designers:

    -
      -
    • Design the user interface and experience for interacting with the AI system.
    • -
    • Focus on user-centric design and ensuring usability.
    • -
    -

    Quality Assurance (QA) Engineers:

    -
      -
    • Responsible for testing the overall system to ensure it meets quality standards.
    • -
    • Work on identifying bugs and issues before the system is deployed.
    • -
    -

    Ethicists and Legal Advisors:

    -
      -
    • Consult on the ethical implications of the AI system.
    • -
    • Ensure compliance with legal and regulatory requirements related to AI.
    • -
    -

    Operations and Maintenance Personnel:

    -
      -
    • Responsible for monitoring the system after deployment.
    • -
    • Work on maintaining and upgrading the system as needed.
    • -
    -

    Security Specialists:

    -
      -
    • Focus on ensuring the security of the AI system.
    • -
    • Work on identifying and mitigating potential security vulnerabilities.
    • -
    -

    Don’t worry! You don’t have to be a one-stop ninja.

    -

    Understanding the diversified roles and responsibilities is paramount in the journey to building a successful machine learning project. As we traverse the upcoming chapters, we will wear the different hats, embracing the essence and expertise of each role described herein. This immersive method nurtures a deep-seated appreciation for the inherent complexities, thereby facilitating an encompassing grasp of the multifaceted dynamics of embedded AI projects.

    -

    Moreover, this well-rounded insight promotes not only seamless collaboration and unified efforts but also fosters an environment ripe for innovation. It enables us to identify areas where cross-disciplinary insights might foster novel thoughts, nurturing ideas and ushering in breakthroughs in the field. Additionally, being aware of the intricacies of each role allows us to anticipate potential obstacles and strategize effectively, guiding the project towards triumph with foresight and detailed understanding.

    -

    As we advance, we encourage you to hold a deep appreciation for the amalgamation of expertise that contributes to the fruition of a successful machine learning initiative. In later discussions, particularly when we delve into MLOps, we will examine these different facets or personas in greater detail. It’s worth noting at this point that the range of topics touched upon might seem overwhelming. This endeavor aims to provide you with a comprehensive view of the intricacies involved in constructing an embedded AI system, without the expectation of mastering every detail personally.

    - - -
    - -
    - - -
    -
    - -
    - - - - - \ No newline at end of file diff --git a/search.json b/search.json index d33755f17..0b28606b2 100644 --- a/search.json +++ b/search.json @@ -284,7 +284,7 @@ "href": "data_engineering.html#introduction", "title": "6  Data Engineering", "section": "6.1 Introduction", - "text": "6.1 Introduction\nData is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.\nDataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\\(^{1}\\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.\nLooking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.1 It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.1 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency Cepstral Coefficient and its applications: A Review.\" IEEE Access (2022).\nCreating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.\nWe begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!" + "text": "6.1 Introduction\nData is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.\nDataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\\(^{1}\\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.\nLooking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.[^2] It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.\nCreating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.\nWe begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. We’ll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, we’ll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases don’t have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. We’ll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, it’s imperative to devise strategies for managing and storing expansive datasets. By the end of this section, you’ll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Let’s embark on this journey!" }, { "objectID": "data_engineering.html#problem-definition", @@ -305,14 +305,14 @@ "href": "data_engineering.html#data-storage", "title": "6  Data Engineering", "section": "6.4 Data Storage", - "text": "6.4 Data Storage\nData sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.\n\n\n\n\n\n\n\n\n\n\nDatabase\nData Warehouse\nData Lake\n\n\n\n\nPurpose\nOperational and transactional\nAnalytical\nAnalytical\n\n\nData type\nStructured\nStructured\nStructured, semi-structured and/or unstructured\n\n\nScale\nSmall to large volumes of data\nLarge volumes of integrated data\nLarge volumes of diverse data\n\n\nExamples\nMySQL\nGoogle BigQuery, Amazon Redshift, Microsoft Azure Synapse.\nGoogle Cloud Storage, AWS S3, Azure Data Lake Storage\n\n\n\nThe stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.\nData Governance2: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.2 Janssen, Marijn, et al. \"Data governance: Organizing data for trustworthy Artificial Intelligence.\" Government Information Quarterly 37.3 (2020): 101493.\nData governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.\n\n\n\nData Governance\n\n\nFigure source: https://www.databricks.com/discover/data-governance\nSome examples of data governance across different sectors include:\n\nMedicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.\nFinance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.\nGovernment: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.\n\nSpecial data storage considerations for tinyML\nEfficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:\n\nExtracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)3 are commonly used to represent important audio characteristics.\nCreating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.\nVector quantization4 - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.\nSequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.\n\n3 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency Cepstral Coefficient and its applications: A Review.\" IEEE Access (2022).4 Vasuki, A., and P. T. Vanathi. \"A review of vector quantization techniques.\" IEEE Potentials 25.4 (2006): 39-47.This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.\nSelective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 20185), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.5 Rybakov, Oleg, et al. \"Streaming keyword spotting on mobile devices.\" arXiv preprint arXiv:2005.06720 (2020)." + "text": "6.4 Data Storage\nData sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.\n\n\n\n\n\n\n\n\n\n\nDatabase\nData Warehouse\nData Lake\n\n\n\n\nPurpose\nOperational and transactional\nAnalytical\nAnalytical\n\n\nData type\nStructured\nStructured\nStructured, semi-structured and/or unstructured\n\n\nScale\nSmall to large volumes of data\nLarge volumes of integrated data\nLarge volumes of diverse data\n\n\nExamples\nMySQL\nGoogle BigQuery, Amazon Redshift, Microsoft Azure Synapse.\nGoogle Cloud Storage, AWS S3, Azure Data Lake Storage\n\n\n\nThe stored data is often accompanied by metadata, which is defined as ‘data about data’. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.\nData Governance[^1]: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.\nData governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.\n\n\n\nData Governance\n\n\nFigure source: https://www.databricks.com/discover/data-governance\nSome examples of data governance across different sectors include:\n\nMedicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.\nFinance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.\nGovernment: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.\n\nSpecial data storage considerations for tinyML\nEfficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:\n\nExtracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)[^2] are commonly used to represent important audio characteristics.\nCreating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that’s more manageable and efficient for computation and storage.\nVector quantization[^3] - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.\nSequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.\n\nThis format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.\nSelective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018[^4]), the authors discuss adaptation of the model’s intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications." }, { "objectID": "data_engineering.html#data-processing", "href": "data_engineering.html#data-processing", "title": "6  Data Engineering", "section": "6.5 Data Processing", - "text": "6.5 Data Processing\nData processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.”\n\n\n\nA breakdown of tasks that data scientists allocate their time to, highlighting the significant portion spent on data cleaning and organizing.\n\n\nProper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.\nData often comes from diverse sources and can be unstructured or semi-structured. Thus, it’s essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:\n\nNormalizing numerical variables\nEncoding categorical variables\nUsing techniques like dimensionality reduction\n\nData validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation6, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.6 Vasuki, A., and P. T. Vanathi. \"A review of vector quantization techniques.\" IEEE Potentials 25.4 (2006): 39-47.\n\n\n\nA detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training\n\n\nLet’s take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.\nThe MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the model’s requirements, these noises can be removed or intentionally retained.\nMaintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.\nThere is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems." + "text": "6.5 Data Processing\nData processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. “Data preparation accounts for about 60-80% of the work of a data scientist.”\n\n\n\nA breakdown of tasks that data scientists allocate their time to, highlighting the significant portion spent on data cleaning and organizing.\n\n\nProper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.\nData often comes from diverse sources and can be unstructured or semi-structured. Thus, it’s essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:\n\nNormalizing numerical variables\nEncoding categorical variables\nUsing techniques like dimensionality reduction\n\nData validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation[^3], contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.\n\n\n\nA detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training\n\n\nLet’s take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelines—systematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.\nThe MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the model’s requirements, these noises can be removed or intentionally retained.\nMaintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.\nThere is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems." }, { "objectID": "data_engineering.html#data-labeling", @@ -347,7 +347,7 @@ "href": "data_engineering.html#licensing", "title": "6  Data Engineering", "section": "6.10 Licensing", - "text": "6.10 Licensing\nMany high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.\nFor instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 20207). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.7 Birhane, Abeba, and Vinay Uday Prabhu. \"Large image datasets: A pyrrhic win for computer vision?.\" 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2021.\nThe legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.\nNew Data Regulations and Their Implications\nNew data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:\n\nClassifies AI systems by risk.\nMandates development and usage prerequisites.\nEmphasizes data quality, transparency, human oversight, and accountability.\n\nAdditionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing \"unacceptable\" risks, stringent conditions for high-risk systems, and minimal obligations for \"limited risk\" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.\nChallenges in Assembling ML Training Datasets\nComplex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing8 or public-private data collaborations could greatly accelerate industry progress and ethical standards.8 Sonnenburg, Soren, et al. \"The need for open source software in machine learning.\" (2007): 2443-2466.\nIn some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.\nData collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.\nFor instance, below is an example request from Common Voice users to remove their information:\n\n\n\n\n\n\nThank you for downloading the Common Voice dataset. Account holders are free to request deletion of their voice clips at any time. We action this on our side for all future releases and are legally obligated to inform those who have downloaded a historic release so that they can also take action.\nYou are receiving this message because one or more account holders have requested that their voice clips be deleted. Their clips are part of the dataset that you downloaded and are associated with the hashed IDs listed below. Please delete them from your downloads in order to fulfill your third party data privacy obligations.\nThank you for your timely completion.\n\n4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57\n97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7\n969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112\n6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9\n33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564\n\n\n\n\n\n\n\n\n\nHaving the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate9,10,11 its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.9 Ginart, Antonio, et al. \"Making ai forget you: Data deletion in machine learning.\" Advances in neural information processing systems 32 (2019).10 Sekhari, Ayush, et al. \"Remember what you want to forget: Algorithms for machine unlearning.\" Advances in Neural Information Processing Systems 34 (2021): 18075-18086.11 Guo, Chuan, et al. \"Certified data removal from machine learning models.\" arXiv preprint arXiv:1911.03030 (2019).\nDataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering." + "text": "6.10 Licensing\nMany high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.\nFor instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020[^6]). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.\nThe legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.\nNew Data Regulations and Their Implications\nNew data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EU’s Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:\n\nClassifies AI systems by risk.\nMandates development and usage prerequisites.\nEmphasizes data quality, transparency, human oversight, and accountability.\n\nAdditionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing \"unacceptable\" risks, stringent conditions for high-risk systems, and minimal obligations for \"limited risk\" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.\nChallenges in Assembling ML Training Datasets\nComplex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing[^7] or public-private data collaborations could greatly accelerate industry progress and ethical standards.\nIn some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.\nData collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.\nHaving the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8],[^9],[^10] its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.\nDataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering." }, { "objectID": "data_engineering.html#conclusion", @@ -356,13 +356,6 @@ "section": "6.11 Conclusion", "text": "6.11 Conclusion\nData is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications." }, - { - "objectID": "data_engineering.html#helpful-references", - "href": "data_engineering.html#helpful-references", - "title": "6  Data Engineering", - "section": "6.12 Helpful References", - "text": "6.12 Helpful References\n1. [3 big problems with datasets in AI and machine learning](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)\n2. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)\n3. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)\n4. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)\n5. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)\n6. [“Everyone wants to do the model work, not the data work”: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)\n7. [Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)](https://arxiv.org/abs/2003.12206)\n8. [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)\n9. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)\n10. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)\n11. [OpenImages](https://storage.googleapis.com/openimages/web/index.html)\n12. [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749)\n13. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)\n14. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)" - }, { "objectID": "frameworks.html#introduction", "href": "frameworks.html#introduction",