6
+
6.1 Introduction
-Explanation: This section establishes the groundwork, defining data engineering and explaining its importance and role in Embedded AI. A well-rounded introduction will help in establishing the foundation for the readers.
-
-- Definition and Importance of Data Engineering in AI
-- Role of Data Engineering in Embedded AI
-- Synergy with Machine Learning and Deep Learning
-
+Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.
+Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\(^{1}\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.
+Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.1 It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.
1 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).
+Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.
+We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. Weâll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, weâll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases donât have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. Weâll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, itâs imperative to devise strategies for managing and storing expansive datasets. By the end of this section, youâll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Letâs embark on this journey!
-
-6.2 Problem
-Explanation: This section is a crucial starting point in any data engineering project, as it lays the groundwork for the projectâs trajectory and ultimate success. Hereâs a brief explanation of why each subsection within the âProblem Definitionâ is important:
+
+6.2 Problem Definition
+In many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to âData Cascadesâ â events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.
+
+
+
+Despite many ML professionals recognizing the importance of data, numerous practitioners report facing these cascades. This highlights a systemic issue: while the allure of developing advanced models remains, data is often underappreciated.
+Take, for example, Keyword Spotting (KWS). KWS serves as a prime example of TinyML in action and is a critical technology behind voice-enabled interfaces on endpoint devices such as smartphones. Typically functioning as lightweight wake-word engines, these systems are consistently active, listening for a specific phrase to trigger further actions. When we say the phrases âOk Googleâ or âAlexa,â this initiates a process on a microcontroller embedded within the device. Despite their limited resources, these microcontrollers play a pivotal role in enabling seamless voice interactions with devices, often operating in environments with high levels of ambient noise. The uniqueness of the wake-word helps minimize false positives, ensuring that the system is not triggered inadvertently.
+It is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.
+
+
+
+Building a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS modelâs effectiveness is not just about recognizing a word; itâs about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. Itâs about ensuring that a whispered âAlexaâ in the dead of night or a shouted âOk Googleâ in a noisy marketplace are both recognized with equal precision.
+Moreover, many of the current KWS voice assistants support a limited number of languages, leaving a substantial portion of the worldâs linguistic diversity unrepresented. This limitation is partly due to the difficulty in gathering and monetizing data for languages spoken by smaller populations. The long-tail distribution of languages implies that many languages have limited data available, making the development of supportive technologies challenging.
+This level of accuracy and robustness hinges on the availability of data, quality of data, ability to label the data correctly, and ensuring transparency of the data for the end userâall before the data is used to train the model. But it all begins with a clear understanding of the problem statement or definition.
+Generally, in ML, problem definition has a few key steps:
+
+Identifying the problem definition clearly
+Setting clear objectives
+Establishing success benchmark
+Understanding end-user engagement/use
+Understanding the constraints and limitations of deployment
+Followed by finally doing the data collection.
+
+Laying a solid foundation for a project is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring that voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction. Benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights and ensures alignment with market needs. Additionally, when delving into areas like voice assistance, understanding platform constraints is pivotal. Embedded systems, such as microcontrollers, come with inherent limitations in processing power, memory, and energy efficiency. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation.
+In this context, using KWS as an example, we can break each of the steps out as follows:
+
+Identifying the Problem: At its core, KWS aims to detect specific keywords amidst a sea of ambient sounds and other spoken words. The primary problem is to design a system that can recognize these keywords with high accuracy, low latency, and minimal false positives or negatives, especially when deployed on devices with limited computational resources.
+Setting Clear Objectives: The objectives for a KWS system might include:
-- Identifying the Problem
-- Setting Clear Objectives
-- Benchmarks for Success
-- Stakeholder Engagement and Understanding
-- Understanding the Constraints and Limitations of Embedded Systems
-
+- Achieving a specific accuracy rate (e.g., 98% accuracy in keyword detection).
+- Ensuring low latency (e.g., keyword detection and response within 200 milliseconds).
+- Minimizing power consumption to extend battery life on embedded devices.
+- Ensuring the modelâs size is optimized for the available memory on the device.
+
+Benchmarks for Success: Establish clear metrics to measure the success of the KWS system. This could include:
+
+- True Positive Rate: The percentage of correctly identified keywords.
+- False Positive Rate: The percentage of non-keywords incorrectly identified as keywords.
+- Response Time: The time taken from keyword utterance to system response.
+- Power Consumption: Average power used during keyword detection.
+
+Stakeholder Engagement and Understanding: Engage with stakeholders, which might include device manufacturers, hardware and software developers, and end-users. Understand their needs, capabilities, and constraints. For instance:
+
+- Device manufacturers might prioritize low power consumption.
+- Software developers might emphasize ease of integration.
+- End-users would prioritize accuracy and responsiveness.
+
+Understanding the Constraints and Limitations of Embedded Systems: Embedded devices come with their own set of challenges:
+
+- Memory Limitations: KWS models need to be lightweight to fit within the memory constraints of embedded devices. Typically, KWS models might need to be as small as 16KB to fit in the always-on island of the SoC. Moreover, this is just the model size. Additional application code for pre-processing may also need to fit within the memory constraints.
+- Processing Power: The computational capabilities of embedded devices are limited (few hundred MHz of clock speed), so the KWS model must be optimized for efficiency.
+- Power Consumption: Since many embedded devices are battery-powered, the KWS system must be power-efficient.
+- Environmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios.
+
+Data Collection and Analysis: For a KWS system, the quality and diversity of data are paramount. Considerations might include:
+
+- Variety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition.
+- Background Noises: Include data samples with different ambient noises to train the model for real-world scenarios.
+- Keyword Variations: People might either pronounce keywords differently or have slight variations in the wake word itself. Ensure the dataset captures these nuances.
+
+Iterative Feedback and Refinement: Once a prototype KWS system is developed, itâs crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve.
+
6.3 Data Sourcing
-Explanation: This section delves into the first step in data engineering - gathering data. Understanding various data types and sources is vital for developing robust AI systems, especially in the context of embedded systems where resources might be limited.
-
-- Data Sources: crowdsourcing, pre-existing datasets etc.
-- Data Types: Structured, Semi-Structured, and Unstructured
-- Real-time Data Processing in Embedded Systems
-
+The quality and diversity of data gathered is important for developing accurate and robust AI systems. Sourcing high-quality training data requires careful consideration of the objectives, resources, and ethical implications. Data can be obtained from various sources depending on the needs of the project:
+
+6.3.1 Pre-existing datasets
+Platforms like Kaggle and UCI Machine Learning Repository provide a convenient starting point. Pre-existing datasets are a valuable resource for researchers, developers, and businesses alike. One of their primary advantages is cost-efficiency. Creating a dataset from scratch can be both time-consuming and expensive, so having access to ready-made data can save significant resources. Moreover, many of these datasets, like ImageNet, have become standard benchmarks in the machine learning community, allowing for consistent performance comparisons across different models and algorithms. This availability of data means that experiments can be started immediately without any delays associated with data collection and preprocessing. In a fast moving field like ML, this expediency is important.
+The quality assurance that comes with popular pre-existing datasets is important to consider because several datasets have errors in them. For instance, the ImageNet dataset was found to have over 6.4% errors. Given their widespread use, any errors or biases in these datasets are often identified and rectified by the community. This assurance is especially beneficial for students and newcomers to the field, as they can focus on learning and experimentation without worrying about data integrity. Supporting documentation that often accompanies existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process, variable definitions, and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently there is a crisis around improving reproducibility in machine learning systems. When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each otherâs work more rapidly.
+While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, itâs essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes these datasets do not reflect the real-world data.
+In addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues.
+
+
+6.3.2 Web Scraping
+Web scraping refers to automated techniques for extracting data from websites. It typically involves sending HTTP requests to web servers, retrieving HTML content, and parsing that content to extract relevant information. Popular tools and frameworks for web scraping include Beautiful Soup, Scrapy, and Selenium. These tools offer different functionalities, from parsing HTML content to automating web browser interactions, especially for websites that load content dynamically using JavaScript.
+Web scraping can be an effective way to gather large datasets for training machine learning models, particularly when human-labeled data is scarce. For computer vision research, web scraping enables the collection of massive volumes of images and videos. Researchers have used this technique to build influential datasets like ImageNet and OpenImages. For example, one could scrape e-commerce sites to amass product photos for object recognition, or social media platforms to collect user uploads for facial analysis. Even before ImageNet, Stanfordâs LabelMe project scraped Flickr for over 63,000 annotated images covering hundreds of object categories.
+Beyond computer vision, web scraping supports the gathering of textual data for natural language tasks. Researchers can scrape news sites for sentiment analysis data, forums, and review sites for dialogue systems research, or social media for topic modeling. For example, the training data for chatbot ChatGPT was obtained by scraping much of the public internet. GitHub repositories were scraped to train GitHubâs Copilot AI coding assistant.
+Web scraping can also collect structured data like stock prices, weather data, or product information for analytical applications. Once data is scraped, it is essential to store it in a structured manner, often using databases or data warehouses. Proper data management ensures the usability of the scraped data for future analysis and applications.
+However, while web scraping offers numerous advantages, there are significant limitations and ethical considerations to bear in mind. Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. It is also unethical and potentially illegal to scrape copyrighted material or private communications. Ethical web scraping mandates adherence to a websiteâs ârobots.txtâ file, which outlines the sections of the site that can be accessed and scraped by automated bots. To deter automated scraping, many websites implement rate limits. If a bot sends too many requests in a short period, it might be temporarily blocked, restricting the speed of data access. Additionally, the dynamic nature of web content means that data scraped at different intervals might lack consistency, posing challenges for longitudinal studies. Though there are emerging trends like Web Navigation where machine learning algorithms can automatically navigate the website to access the dynamic content.
+For niche subjects, the volume of pertinent data available for scraping might be limited. For example, while scraping for common topics like images of cats and dogs might yield abundant data, searching for rare medical conditions might not be as fruitful. Moreover, the data obtained through scraping is often unstructured and noisy, necessitating thorough preprocessing and cleaning. It is crucial to understand that not all scraped data will be of high quality or accuracy. Employing verification methods, such as cross-referencing with alternate data sources, can enhance data reliability.
+Privacy concerns arise when scraping personal data, emphasizing the need for anonymization. Therefore, it is paramount to adhere to a websiteâs Terms of Service, confine data collection to public domains, and ensure the anonymity of any personal data acquired.
+While web scraping can be a scalable method to amass large training datasets for AI systems, its applicability is confined to specific data types. For example, sourcing data for Inertial Measurement Units (IMU) for gesture recognition is not straightforward through web scraping. At most, one might be able to scrape an existing dataset.
+
+
+6.3.3 Crowdsourcing
+Crowdsourcing for datasets is the practice of obtaining data by using the services of a large number of people, either from a specific community or the general public, typically via the internet. Instead of relying on a small team or specific organization to collect or label data, crowdsourcing leverages the collective effort of a vast, distributed group of participants. Services like Amazon Mechanical Turk enable the distribution of annotation tasks to a large, diverse workforce. This facilitates the collection of labels for complex tasks like sentiment analysis or image recognition that specifically require human judgment.
+Crowdsourcing has emerged as an effective approach for many data collection and problem-solving needs. One major advantage of crowdsourcing is scalabilityâby distributing tasks to a large, global pool of contributors on digital platforms, projects can process huge volumes of data in a short timeframe. This makes crowdsourcing ideal for large-scale data labeling, collection, and analysis.
+In addition, crowdsourcing taps into a diverse group of participants, bringing a wide range of perspectives, cultural insights, and language abilities that can enrich data and enhance creative problem-solving in ways that a more homogenous group may not. Because crowdsourcing draws from a large audience beyond traditional channels, it also tends to be more cost-effective than conventional methods, especially for simpler microtasks.
+Crowdsourcing platforms also allow for great flexibility, as task parameters can be adjusted in real-time based on initial results. This creates a feedback loop for iterative improvements to the data collection process. Complex jobs can be broken down into microtasks and distributed to multiple people, with cross-validation of results by assigning redundant versions of the same task. Ultimately, when thoughtfully managed, crowdsourcing enables community engagement around a collaborative project, where participants find reward in contributing.
+However, while crowdsourcing offers numerous advantages, itâs essential to approach it with a clear strategy. While it provides access to a diverse set of annotators, it also introduces variability in the quality of annotations. Additionally, platforms like Mechanical Turk might not always capture a complete demographic spectrum; often tech-savvy individuals are overrepresented, while children and the elderly may be underrepresented. Itâs crucial to provide clear instructions and possibly even training for the annotators. Periodic checks and validations of the labeled data can help maintain quality. This ties back to the topic of clear Problem Definition that we discussed earlier. Crowdsourcing for datasets also requires careful attention to ethical considerations. Itâs crucial to ensure that participants are informed about how their data will be used and that their privacy is protected. Quality control through detailed protocols, transparency in sourcing, and auditing is essential to ensure reliable outcomes.
+For TinyML, crowdsourcing can pose some unique challenges. TinyML devices are highly specialized for particular tasks within tight constraints. As a result, the data they require tends to be very specific. It may be difficult to obtain such specialized data from a general audience through crowdsourcing. For example, TinyML applications often rely on data collected from certain sensors or hardware. Crowdsourcing would require participants to have access to very specific and consistent devices - like microphones with the same sampling rates. Even for simple audio tasks like keyword spotting, these hardware nuances present obstacles.
+Beyond hardware, the data itself needs high granularity and quality given the limitations of TinyML. It can be hard to ensure this when crowdsourcing from those unfamiliar with the applicationâs context and requirements. There are also potential issues around privacy, real-time collection, standardization, and technical expertise. Moreover, the narrow nature of many TinyML tasks makes accurate data labeling difficult without the proper understanding. Participants may struggle to provide reliable annotations without full context.
+Thus, while crowdsourcing can work well in many cases, the specialized needs of TinyML introduce unique data challenges. Careful planning is required for guidelines, targeting, and quality control. For some applications, crowdsourcing may be feasible, but others may require more focused data collection efforts to obtain relevant, high-quality training data.
+
+
+6.3.4 Synthetic Data
+Synthetic data generation can be useful for addressing some of the limitations of data collection. It involves creating data that wasnât originally captured or observed, but is generated using algorithms, simulations, or other techniques to resemble real-world data. It has become a valuable tool in various fields, particularly in scenarios where real-world data is scarce, expensive, or ethically challenging to obtain (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data that is almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable.
+In many domains, especially emerging ones, there may not be enough real-world data available for analysis or training machine learning models. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly.
+Moreover, in machine learning, especially in deep learning, having a diverse dataset is crucial. Synthetic data can augment existing datasets by introducing variations, thereby enhancing the robustness of models. For example, SpecAugment is an excellent data augmentation technique for Automatic Speech Recognition (ASR) systems.
+Pivacy and confidentiality is also a big issue. Datasets containing sensitive or personal information pose privacy concerns when shared or used. Synthetic data, being artificially generated, doesnât have these direct ties to real individuals, allowing for safer use while preserving essential statistical properties.
+Generating synthetic data, especially once the generation mechanisms have been established, can be a more cost-effective alternative. In the aforementioned security application scenario, synthetic data eliminates the need for breaking multiple windows to gather relevant data.
+Many embedded use-cases deal with unique situations, such as manufacturing plants, that are difficult to simulate. Synthetic data allows researchers complete control over the data generation process, enabling the creation of specific scenarios or conditions that are challenging to capture in real life.
+While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases.
+
-
+
6.4 Data Storage
-Explanation: Data must be stored and managed efficiently to facilitate easy access and processing. This section will provide insights into different data storage options and their respective advantages and challenges in embedded systems.
+Data sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.
+
+
+
+
+
+
+
+
+
+
+Database
+Data Warehouse
+Data Lake
+
+
+
+
+Purpose
+Operational and transactional
+Analytical
+Analytical
+
+
+Data type
+Structured
+Structured
+Structured, semi-structured and/or unstructured
+
+
+Scale
+Small to large volumes of data
+Large volumes of integrated data
+Large volumes of diverse data
+
+
+Examples
+MySQL
+Google BigQuery, Amazon Redshift, Microsoft Azure Synapse.
+Google Cloud Storage, AWS S3, Azure Data Lake Storage
+
+
+
+The stored data is often accompanied by metadata, which is defined as âdata about dataâ. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.
+Data Governance2: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.
2 Janssen, Marijn, et al. "Data governance: Organizing data for trustworthy Artificial Intelligence." Government Information Quarterly 37.3 (2020): 101493.
+Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.
+
+
+
+Figure source: https://www.databricks.com/discover/data-governance
+Some examples of data governance across different sectors include:
-- Data Warehousing
-- Data Lakes
-- Metadata Management
-- Data Governance
+Medicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.
+Finance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.
+Government: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.
-
-
-6.5 Data Processing
-Explanation: Data processing is a pivotal step in transforming raw data into a usable format. This section provides a deep dive into the necessary processes, which include cleaning, integration, and establishing data pipelines, all crucial for streamlining operations in embedded AI systems.
+Special data storage considerations for tinyML
+Efficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:
-- Data Cleaning and Transformation
-- Data Pipelines
-- Batch vs. Stream Processing
+Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)3 are commonly used to represent important audio characteristics.
+Creating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format thatâs more manageable and efficient for computation and storage.
+Vector quantization4 - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.
+Sequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.
+3 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).
4 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.
This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.
+Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 20185), the authors discuss adaptation of the modelâs intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.
5 Rybakov, Oleg, et al. "Streaming keyword spotting on mobile devices." arXiv preprint arXiv:2005.06720 (2020).
-
-6.6 Data Quality
-Explanation: Ensuring data quality is critical to developing reliable AI models. This section outlines various strategies to assure and evaluate data quality.
+
+6.5 Data Processing
+Data processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. âData preparation accounts for about 60-80% of the work of a data scientist.â
+
+
+
+Proper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.
+Data often comes from diverse sources and can be unstructured or semi-structured. Thus, itâs essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:
-- Data Validation
-- Handling Missing Values
-- Outlier Detection
-- Data Provenance
+- Normalizing numerical variables
+- Encoding categorical variables
+- Using techniques like dimensionality reduction
+Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation6, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.
6 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.
+
+
+
+Letâs take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelinesâsystematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.
+The MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the modelâs requirements, these noises can be removed or intentionally retained.
+Maintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.
+There is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems.
-
-6.7 Feature Engineering
-Explanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. Itâs vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance.
+
+6.6 Data Labeling
+Data labeling is an important step in creating high-quality training datasets for machine learning models. Labels provide the ground truth information that allows models to learn relationships between inputs and desired outputs. This section covers key considerations around selecting label types, formats, and content to capture the necessary information for given tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. Ethical treatment of human annotators is also something we emphasize. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies is essential for constructing reliable, useful datasets that can train performant, trustworthy machine learning systems.
+Label Types Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (Johnson-Roberson et al. (2017)).
+
+Johnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. âDriving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?â 2017 IEEE International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.
+
+Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.
+Additionally, annotators can potentially provide metadata that provides insight into how the dataset represents different characteristics of interest (see: Data Transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (Ardila et al. (2020)). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see the breakdown of who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages.
+
+Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. âCommon Voice: A Massively-Multilingual Speech Corpus.â Proceedings of the 12th Conference on Language Resources and Evaluation, May, 4218â22.
+Next, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend both on their use case and their resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire.
+Annotation Methods: Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but lacks scalability. Crowdsourcing enables distributed annotation by non-experts, often through dedicated platforms (Sheng and Zhang (2019)). Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels (Ratner et al. (2018)).
+
+Sheng, Victor S., and Jing Zhang. 2019. âMachine Learning with Crowdsourcing: A Brief Summary of the Past Research and Future Directions.â Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 9837â43. https://doi.org/10.1609/aaai.v33i01.33019837.
+
+Ratner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher RĂ©. 2018. âSnorkel Metal: Weak Supervision for Multi-Task Learning.â Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.
+After deciding on their labelsâ desired content and format, creators begin the annotation process. To collect large numbers of labels from human annotators, creators frequently rely on dedicated annotation platforms, which can connect them to teams of human annotators. When using these platforms, creators may have little insight to annotatorsâ backgrounds and levels of experience with topics of interest. However, some platforms offer access to annotators with specific expertise (e.g. doctors).
+Ensuring Label Quality: There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (Northcutt, Athalye, and Mueller (2021)). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.
+
+Northcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021. âPervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.â arXiv, March. https://doi.org/ https://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via DataCite.
+
+When working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).
+
+Google. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.
+AI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:
-- Importance of Feature Engineering
-- Techniques of Feature Selection
-- Feature Transformation for Embedded Systems
-- Embeddings
-- Real-time Feature Engineering in Embedded Systems
+- Pre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets.
+- Active learning: AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled datasetâs quality while reducing the overall annotation time.
+- Quality control: AI models can be used to identify and flag potential errors in human annotations. This can help to ensure the accuracy and consistency of the labeled dataset.
-
-
-6.8 Data Labeling
-Explanation: Labeling is an essential part of preparing data for supervised learning. This section focuses on various strategies and tools available for data labeling, a vital process in the data preparation phase.
+
+Chapelle, O., B. Scholkopf, and A. Zien Eds. 2009. âSemi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book Reviews].â IEEE Transactions on Neural Networks 20 (3): 542â42. https://doi.org/10.1109/tnn.2009.2015974.
+Here are some examples of how AI-assisted annotation has been proposed to be useful:
-- Manual Data Labeling
-- Ethical Considerations (e.g. OpenAI issues)
-- Automated Data Labeling
-- Labeling Tools
+- Medical imaging: AI-assisted annotation is being used to label medical images, such as MRI scans and X-rays (Krishnan, Rajpurkar, and Topol (2022)). Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are both scarce and it becomes a costly effort. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently.
+
+- Self-driving cars: AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs.
+- Social media: AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.
+
+Krishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022. âSelf-Supervised Learning in Medicine and Healthcare.â Nature Biomedical Engineering 6 (12): 1346â52. https://doi.org/10.1038/s41551-022-00914-1.
+
-
-6.9 Data Version Control
-Explanation: Version control is critical for managing changes and tracking versions of datasets during the development of AI models, facilitating reproducibility and collaboration.
+
+6.7 Data Version Control
+Production systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This proliferating data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data hungry in terms of the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative.
+Data Version Control offers a structured methodology to handle alterations and versions of datasets efficiently. It facilitates the monitoring of modifications, preserves multiple versions, and guarantees reproducibility and traceability in data-centric projects. Furthermore, data version control provides the versatility to review and utilize specific versions as needed, ensuring that each stage of the data processing and model development can be revisited and audited with precision and ease. It has a variety of practical uses -
+Risk Management: Data version control allows transparency and accountability by tracking versions of the dataset.
+Collaboration and Efficiency: Easy access to different versions of the dataset in one place can improve data sharing of specific checkpoints, and enable efficient collaboration.
+Reproducibility: Data version control allows for tracking the performance of models with respect to different versions of the data, and therefore enabling reproducibility.
+Key Concepts
-- Version Control Systems
-- Metadata
+Commits: It is an immutable snapshot of the data at a specific point in time, representing a unique version. Every commit is associated with a unique identifier to allow
+Branches: Branching allows developers and data scientists to diverge from the main line of development and continue to work independently without affecting other branches. This is especially useful when experimenting with new features or models, enabling parallel development and experimentation without the risk of corrupting the stable, main branch.
+Merges: Merges help to integrate changes from different branches while maintaining the integrity of the data.
+Popular Data Version Control Systems
+DVC: It stands for Data Version Control in short, and is an open-source, lightweight tool that works on top of github and supports all kinds of data format. It can seamlessly integrate into the Git workflow, if Git is being used for managing code. It captures the versions of data and models in the Git commits, while storing them on premises or on cloud (e.g. AWS, Google Cloud, Azure). These data and models (e.g. ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data.
+lakeFS: It is an open-source tool that supports the data version control on data lakes. It supports many git-like operations such as branching and merging of data, as well as reverting to previous versions of the data. It also has a unique UI feature which allows exploration and management of data much easier.
+Git LFS: It is useful for data version control on smaller sized datasets. It uses Gitâs inbuilt branching and merging features, but is limited in terms of tracking metrics, reverting to previous versions or integration with data lakes.
-
-6.10 Optimizing Data for Embedded AI
-Explanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments.
+
+6.8 Optimizing Data for Embedded AI
+Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle.
+On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. As a result, creators may design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from:
-- Low-Resource Data Challenges
-- Data Reduction Techniques
-- Optimizing Data Storage and Retrieval
+- Geographically, socially and architecturally diverse neighborhoods
+- Different types of artificial and natural lighting
+- Different seasons and weather conditions
+- Obstructions (e.g. raindrops or delivery boxes obscuring the cameraâs view)
+As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations.
-
-6.11 Challenges in Data Engineering
-Explanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems.
+
+6.9 Data Transparency
+By providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (Pushkarna, Zaldivar, and Kjartansson (2022)), datasheets (Gebru et al. (2021)), data statements (Bender and Friedman (2018)), or Data Nutrition Labels (Holland et al. (2020)). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras).
+
+Pushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022. âData Cards: Purposeful and Transparent Dataset Documentation for Responsible Ai.â 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533231.
+
+Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal DaumĂ© III, and Kate Crawford. 2021. âDatasheets for Datasets.â Communications of the ACM 64 (12): 86â92. https://doi.org/10.1145/3458723.
+
+Bender, Emily M., and Batya Friedman. 2018. âData Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.â Transactions of the Association for Computational Linguistics 6: 587â604. https://doi.org/10.1162/tacl_a_00041.
+
+Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2020. âThe Dataset Nutrition Label.â Data Protection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.
+Keeping track of data provenanceâessentially the origins and the journey of each data point through the data pipelineâis not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesnât just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the modelâs performance and its acceptability among end-users.
+When producing documentation, creators should also clearly specify how users can access the dataset and how the dataset will be maintained over time. For example, users may need to undergo training or receive special permission from the creators before accessing a dataset containing protected information, as is the case with many medical datasets. In some cases, users may not be permitted to directly access the data and must instead submit their model to be trained on the dataset creatorsâ hardware, following a federated learning setup (Aledhari et al. (2020)). Creators may also describe how long the dataset will remain accessible, how the users can submit feedback on any errors that they discover, and whether there are plans to update the dataset.
+
+Aledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020. âFederated Learning: A Survey on Enabling Technologies, Protocols, and Applications.â IEEE Access 8: 140699â725. https://doi.org/10.1109/access.2020.3013541.
+Some laws and regulations promote also data transparency through new requirements for organizations:
-- Scalability
-- Data Security and Privacy
-- Data Bias and Representativity
+- General Data Protection Regulation (GDPR) in European Union: It establishes strict requirements for processing and protecting personal data of EU citizens. It mandates plain language privacy policies that clearly explain what data is collected, why it is used, how long it is stored, and with whom it is shared. GDPR also mandates that privacy notices must include details on legal basis for processing, data transfers, retention periods, rights to access and deletion, and contact info for data controllers.
+- Californiaâs Consumer Privacy Act (CCPA): CCPA requires clear privacy policies and opt-out rights for the sale of personal data. Significantly, it also establishes rights for consumers to request their specific data be disclosed. Businesses must provide copies of collected personal information along with details on what it is used for, what categories are collected, and what third parties receive it. Consumers can identify data points they believe are inaccurate. The law represents a major step forward in empowering personal data access.
+There are several current challenges in ensuring data transparency, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can be difficult to achieve in these cases. Full transparency may also overwhelm the consumers with too much detail. And finally, it is also important to balance the tradeoff between transparency and privacy.
-
-6.12 Promoting Transparency
-Explanation: We explain that as we increasingly use these systems built on the foundation of data, we need to have more transparency in the ecosystem.
+
+6.10 Licensing
+Many high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.
+For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 20207). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.
7 Birhane, Abeba, and Vinay Uday Prabhu. "Large image datasets: A pyrrhic win for computer vision?." 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2021.
+The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.
+New Data Regulations and Their Implications
+New data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EUâs Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:
+
+Classifies AI systems by risk.
+Mandates development and usage prerequisites.
+Emphasizes data quality, transparency, human oversight, and accountability.
+
+Additionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing "unacceptable" risks, stringent conditions for high-risk systems, and minimal obligations for "limited risk" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.
+Challenges in Assembling ML Training Datasets
+Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing8 or public-private data collaborations could greatly accelerate industry progress and ethical standards.
8 Sonnenburg, Soren, et al. "The need for open source software in machine learning." (2007): 2443-2466.
+In some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.
+Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.
+For instance, below is an example request from Common Voice users to remove their information:
+
+
+
+
+
+
+Thank you for downloading the Common Voice dataset. Account holders are free to request deletion of their voice clips at any time. We action this on our side for all future releases and are legally obligated to inform those who have downloaded a historic release so that they can also take action.
+You are receiving this message because one or more account holders have requested that their voice clips be deleted. Their clips are part of the dataset that you downloaded and are associated with the hashed IDs listed below. Please delete them from your downloads in order to fulfill your third party data privacy obligations.
+Thank you for your timely completion.
-- Definition and Importance of Transparency in Data Engineering
-- Transparency in Data Collection and Sourcing
-- Transparency in Data Processing and Analysis
-- Transparency in Model Building and Deployment
-- Transparency in Data Sharing and Usage
-- Tools and Techniques for Ensuring Transparency
-
+4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57
+97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7
+969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112
+6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9
+33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564
+
+
+
+
+
+
+
+
+
+Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate9,10,11 its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.
9 Ginart, Antonio, et al. "Making ai forget you: Data deletion in machine learning." Advances in neural information processing systems 32 (2019).
10 Sekhari, Ayush, et al. "Remember what you want to forget: Algorithms for machine unlearning." Advances in Neural Information Processing Systems 34 (2021): 18075-18086.
11 Guo, Chuan, et al. "Certified data removal from machine learning models." arXiv preprint arXiv:1911.03030 (2019).
+Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.
-
-6.13 Licensing
-Explanation: This section emphasizes why one must understand data licensing issues before they start using the data to train the models.
-
-- Metadata
-- Data Nutrition Project
-- Understanding Licensing
-
+
+6.11 Conclusion
+Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.
-
-6.14 Conclusion
-Explanation: Close up the chapter with a summary of the key topics that we have covered in this section.
-
-- The Future of Data Engineering in Embedded AI
-- Key Takeaways
-
+
+6.12 Helpful References
+1. [3 big problems with datasets in AI and machine learning](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)
+2. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)
+3. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)
+4. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)
+5. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)
+6. [âEveryone wants to do the model work, not the data workâ: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)
+7. [Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)](https://arxiv.org/abs/2003.12206)
+8. [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)
+9. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
+10. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)
+11. [OpenImages](https://storage.googleapis.com/openimages/web/index.html)
+12. [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749)
+13. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)
+14. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+14 Embedded MLOps
+
+
+
+
+
+
+
+
+
+
+14.1 Introduction
+Explanation: This subsection sets the groundwork for the discussions to follow, elucidating the fundamental concept of MLOps and its critical role in enhancing the efficiency, reliability, and scalability of embedded AI systems. It outlines the unique characteristics of implementing MLOps in an embedded context, emphasizing its significance in the streamlined deployment and management of machine learning models.
+
+- Overview of MLOps
+- The importance of MLOps in the embedded domain
+- Unique challenges and opportunities in embedded MLOps
+
+
+
+14.2 Deployment Environments
+Explanation: This section focuses on different environments where embedded AI systems can be deployed. It will delve into aspects like edge devices, cloud platforms, and hybrid environments, offering insights into the unique characteristics and considerations of each.
+
+- Cloud-based deployment: Features and benefits
+- Edge computing: Characteristics and applications
+- Hybrid environments: Combining the best of edge and cloud computing
+- Considerations for selecting an appropriate deployment environment
+
+
+
+14.3 Deployment Strategies
+Explanation: Here, readers will be introduced to various deployment strategies that facilitate a smooth transition from development to production. It discusses approaches such as blue-green deployments, canary releases, and rolling deployments, which can help in maintaining system stability and minimizing downtime during updates.
+
+- Overview of different deployment strategies
+- Blue-green deployments: Definition and benefits
+- Canary releases: Phased rollouts and monitoring
+- Rolling deployments: Ensuring continuous service availability
+- Strategy selection: Factors to consider
+
+
+
+14.4 Workflow Automation
+Explanation: Automation is at the heart of MLOps, helping to streamline workflows and enhance efficiency. This subsection highlights the significance of workflow automation in embedded MLOps, discussing various strategies and techniques for automating tasks such as testing, deployment, and monitoring, fostering a faster and error-free development lifecycle.
+
+- Automated testing: unit tests, integration tests
+- Automated deployment: scripting, configuration management
+- Continuous monitoring: setting up automated alerts and dashboards
+- Benefits of workflow automation: speed, reliability, repeatability
+
+
+
+14.5 Model Versioning
+Explanation: Model versioning is a pivotal aspect of MLOps, facilitating the tracking and management of different versions of machine learning models throughout their lifecycle. This subsection emphasizes the importance of model versioning in embedded systems, where memory and computational resources are limited, offering strategies for effective version management and rollback.
+
+- Importance of versioning in machine learning pipelines
+- Tools for model versioning: DVC, MLflow
+- Strategies for version control: naming conventions, metadata tagging
+- Rollback strategies: handling model regressions and rollbacks
+
+
+
+14.6 Model Monitoring and Maintenance
+Explanation: The process of monitoring and maintaining deployed models is crucial to ensure their long-term performance and reliability. This subsection underscores the significance of proactive monitoring and maintenance in embedded systems, discussing methodologies for monitoring model health, performance metrics, and implementing routine maintenance tasks to ensure optimal functionality.
+
+- The importance of monitoring deployed AI models
+- Setting up monitoring systems: tools and techniques
+- Tracking model performance: accuracy, latency, resource usage
+- Maintenance strategies: periodic updates, fine-tuning
+- Alerts and notifications: Setting up mechanisms for timely responses to issues
+- Over the air updates
+- Responding to anomalies: troubleshooting and resolution strategies
+
+
+
+14.7 Security and Compliance
+Explanation: Security and compliance are paramount in MLOps, safeguarding sensitive data and ensuring adherence to regulatory requirements. This subsection illuminates the critical role of implementing security measures and ensuring compliance in embedded MLOps, offering insights into best practices for data protection, access control, and regulatory adherence.
+
+- Security considerations in embedded MLOps: data encryption, secure communications
+- Compliance requirements: GDPR, HIPAA, and other regulations
+- Strategies for ensuring compliance: documentation, audits, training
+- Tools for security and compliance management: SIEM systems, compliance management platforms
+
+
+
+14.8 Conclusion
+Explanation: As we wrap up this chapter, we consolidate the key takeaways regarding the implementation of MLOps in the embedded domain. This final section seeks to furnish readers with a holistic view of the principles and practices of embedded MLOps, encouraging a thoughtful approach to adopting MLOps strategies in their projects, with a glimpse into the potential future trends in this dynamic field.
+
+- Recap of key concepts and best practices in embedded MLOps
+- Challenges and opportunities in implementing MLOps in embedded systems
+- Future directions: emerging trends and technologies in embedded MLOps
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/mlworkflow.html b/mlworkflow.html
new file mode 100644
index 000000000..a8d574c84
--- /dev/null
+++ b/mlworkflow.html
@@ -0,0 +1,757 @@
+
+
+
+
+
+
+
+
+
+
+Embedded AI: Principles, Algorithms, and Applications - 5 ML Workflow
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+5 ML Workflow
+
+
+
+
+
+
+
+
+
+In this chapter, weâre going to learn about the machine learning workflow. It will set the stages for the later chapters that dive into the details. But to prevent ourselves from missing the forest for the trees, this chapter gives a high level overview of the stpes involved in the ML workflow.
+The ML workflow is a systematic and structured approach that guides professionals and researchers in developing, deploying, and maintaining ML models. This workflow is generally delineated into several critical stages, each contributing towards the effective development of intelligent systems.
+Hereâs a broad outline of the stages involved:
+
+5.1 Overview
+A machine learning (ML) workflow is the process of developing, deploying, and maintaining ML models. It typically consists of the following steps:
+
+- Define the problem. What are you trying to achieve with your ML model? Do you want to classify images, predict customer churn, or generate text? Once you have a clear understanding of the problem, you can start to collect data and choose a suitable ML algorithm.
+- Collect and prepare data. ML models are trained on data, so itâs important to collect a high-quality dataset that is representative of the real-world problem youâre trying to solve. Once you have your data, you need to clean it and prepare it for training. This may involve tasks such as removing outliers, imputing missing values, and scaling features.
+- Choose an ML algorithm. There are many different ML algorithms available, each with its own strengths and weaknesses. The best algorithm for your project will depend on the type of data you have and the problem youâre trying to solve.
+- Train the model. Once you have chosen an ML algorithm, you need to train the model on your prepared data. This process can take some time, depending on the size and complexity of your dataset.
+- Evaluate the model. Once the model is trained, you need to evaluate its performance on a held-out test set. This will give you an idea of how well the model will generalize to new data.
+- Deploy the model. Once youâre satisfied with the performance of the model, you can deploy it to production. This may involve integrating the model into a software application or making it available as a web service.
+- Monitor and maintain the model. Once the model is deployed, you need to monitor its performance and make updates as needed. This is because the real world is constantly changing, and your model may need to be updated to reflect these changes.
+
+The ML workflow is an iterative process. Once you have deployed a model, you may find that it needs to be retrained on new data or that the algorithm needs to be adjusted. Itâs important to monitor the performance of your model closely and make changes as needed to ensure that it is still meeting your needs. In addition to the above steps, there are a number of other important considerations for ML workflows, such as:
+
+- Version control: Itâs important to track changes to your code and data so that you can easily reproduce your results and revert to previous versions if necessary.
+- Documentation: Itâs important to document your ML workflow so that others can understand and reproduce your work.
+- Testing: Itâs important to test your ML workflow thoroughly to ensure that it is working as expected.
+- Security: Itâs important to consider the security of your ML workflow and data, especially if you are deploying your model to production.
+
+
+
+5.2 General vs. Embedded AI
+The ML workflow delineated above serves as a comprehensive guide applicable broadly across various platforms and ecosystems, encompassing cloud-based solutions, edge computing, and tinyML. However, when we delineate the nuances of the general ML workflow and contrast it with the workflow in Embedded AI environments, we encounter a series of intricate differences and complexities. These nuances not only elevate the embedded AI workflow to a challenging and captivating domain but also open avenues for remarkable innovations and advancements.
+Now, letâs explore these differences in detail:
+
+- Resource Optimization:
+
+- General ML Workflow: Generally has the luxury of substantial computational resources available in cloud or data center environments. It focuses more on model accuracy and performance.
+- Embedded AI Workflow: Needs meticulous planning and execution to optimize the modelâs size and computational demands, as they have to operate within the limited resources available in embedded systems. Techniques like model quantization and pruning become essential.
+
+- Real-time Processing:
+
+- General ML Workflow: The emphasis on real-time processing is usually less, and batch processing of data is quite common.
+- Embedded AI Workflow: Focuses heavily on real-time data processing, necessitating a workflow where low latency and rapid execution are a priority, especially in applications like autonomous driving and industrial automation.
+
+- Data Management and Privacy:
+
+- General ML Workflow: Data is typically processed in centralized locations, sometimes requiring extensive data transfer, with a focus on securing data during transit and storage.
+- Embedded AI Workflow: Promotes edge computing, which facilitates data processing closer to the source, reducing data transmission needs and enhancing privacy by keeping sensitive data localized.
+
+- Hardware-Software Integration:
+
+- General ML Workflow: Often operates on general-purpose hardware platforms with software development happening somewhat independently.
+- Embedded AI Workflow: Involves a tighter hardware-software co-design where both are developed in tandem to achieve optimal performance and efficiency, integrating custom chips or utilizing hardware accelerators.
+
+
+
+
+5.3 Roles & Responsibilities
+As we work through the various tasks at hand, you will realize that there is a lot of complexity. Creating a machine learning solution, particularly for embedded AI systems, is a multidisciplinary endeavor involving various experts and specialists. Here is a list of personnel that are typically involved in the process, along with brief descriptions of their roles:
+Project Manager:
+
+- Coordinates and manages the overall project.
+- Ensures all team members are working synergistically.
+- Responsible for project timelines and milestones.
+
+Domain Experts:
+
+- Provide insights into the specific domain where the AI system will be implemented.
+- Help in defining project requirements and constraints based on domain-specific knowledge.
+
+Data Scientists:
+
+- Specialize in analyzing data to develop machine learning models.
+- Responsible for data cleaning, exploration, and feature engineering.
+
+Machine Learning Engineers:
+
+- Focus on the development and deployment of machine learning models.
+- Collaborate with data scientists to optimize models for embedded systems.
+
+Data Engineers:
+
+- Responsible for managing and optimizing data pipelines.
+- Work on the storage and retrieval of data used for machine learning model training.
+
+Embedded Systems Engineers:
+
+- Focus on integrating machine learning models into embedded systems.
+- Optimize system resources for running AI applications.
+
+Software Developers:
+
+- Develop software components that interface with the machine learning models.
+- Responsible for implementing APIs and other integration points for the AI system.
+
+Hardware Engineers:
+
+- Involved in designing and optimizing the hardware that hosts the embedded AI system.
+- Collaborate with embedded systems engineers to ensure compatibility.
+
+UI/UX Designers:
+
+- Design the user interface and experience for interacting with the AI system.
+- Focus on user-centric design and ensuring usability.
+
+Quality Assurance (QA) Engineers:
+
+- Responsible for testing the overall system to ensure it meets quality standards.
+- Work on identifying bugs and issues before the system is deployed.
+
+Ethicists and Legal Advisors:
+
+- Consult on the ethical implications of the AI system.
+- Ensure compliance with legal and regulatory requirements related to AI.
+
+Operations and Maintenance Personnel:
+
+- Responsible for monitoring the system after deployment.
+- Work on maintaining and upgrading the system as needed.
+
+Security Specialists:
+
+- Focus on ensuring the security of the AI system.
+- Work on identifying and mitigating potential security vulnerabilities.
+
+Donât worry! You donât have to be a one-stop ninja.
+Understanding the diversified roles and responsibilities is paramount in the journey to building a successful machine learning project. As we traverse the upcoming chapters, we will wear the different hats, embracing the essence and expertise of each role described herein. This immersive method nurtures a deep-seated appreciation for the inherent complexities, thereby facilitating an encompassing grasp of the multifaceted dynamics of embedded AI projects.
+Moreover, this well-rounded insight promotes not only seamless collaboration and unified efforts but also fosters an environment ripe for innovation. It enables us to identify areas where cross-disciplinary insights might foster novel thoughts, nurturing ideas and ushering in breakthroughs in the field. Additionally, being aware of the intricacies of each role allows us to anticipate potential obstacles and strategize effectively, guiding the project towards triumph with foresight and detailed understanding.
+As we advance, we encourage you to hold a deep appreciation for the amalgamation of expertise that contributes to the fruition of a successful machine learning initiative. In later discussions, particularly when we delve into MLOps, we will examine these different facets or personas in greater detail. Itâs worth noting at this point that the range of topics touched upon might seem overwhelming. This endeavor aims to provide you with a comprehensive view of the intricacies involved in constructing an embedded AI system, without the expectation of mastering every detail personally.
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
\ No newline at end of file
diff --git a/references.html b/references.html
index 2f80a437a..16c1c771a 100644
--- a/references.html
+++ b/references.html
@@ -441,6 +441,18 @@ References
+
+Aledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.
+âFederated Learning: A Survey on Enabling Technologies, Protocols,
+and Applications.â IEEE Access 8: 140699â725. https://doi.org/10.1109/access.2020.3013541.
+
+
+Ardila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael
+Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,
+and Gregor Weber. 2020. âCommon Voice: A Massively-Multilingual
+Speech Corpus.â Proceedings of the 12th Conference on
+Language Resources and Evaluation, May, 4218â22.
+
ARM.com. âThe Future Is Being Built on Arm: Market Diversification
Continues to Drive Strong Royalty and Licensing Growth as Ecosystem
@@ -456,12 +468,33 @@ References
The Datacenter as a Computer: Designing Warehouse-Scale
Machines. Springer Nature.
+
+Bender, Emily M., and Batya Friedman. 2018. âData Statements for
+Natural Language Processing: Toward Mitigating System Bias and Enabling
+Better Science.â Transactions of the Association for
+Computational Linguistics 6: 587â604. https://doi.org/10.1162/tacl_a_00041.
+
+
+Chapelle, O., B. Scholkopf, and A. Zien Eds. 2009.
+âSemi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book
+Reviews].â IEEE Transactions on Neural Networks 20 (3):
+542â42. https://doi.org/10.1109/tnn.2009.2015974.
+
+
+Gebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman
+Vaughan, Hanna Wallach, Hal Daumé III, and Kate Crawford. 2021.
+âDatasheets for Datasets.â Communications of the
+ACM 64 (12): 86â92. https://doi.org/10.1145/3458723.
+
Goodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.
âGenerative Adversarial Networks.â Communications of
the ACM 63 (11): 139â44.
+
+Google. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.
+
Han, Song, Huizi Mao, and William J. Dally. 2016. âDeep
Compression: Compressing Deep Neural Networks with Pruning, Trained
@@ -473,6 +506,11 @@ References
Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 770â78.
+
+Holland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia
+Chmielinski. 2020. âThe Dataset Nutrition Label.â Data
+Protection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.
+
Howard, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun
Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.
@@ -485,6 +523,13 @@ References
AlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model
Size.â arXiv Preprint arXiv:1602.07360.
+
+Johnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur
+Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. âDriving in the
+Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real
+World Tasks?â 2017 IEEE International Conference on Robotics
+and Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.
+
Jouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav
Agrawal, Raminder Bajwa, Sarah Bates, et al. 2017. âIn-Datacenter
@@ -492,6 +537,11 @@ References
Proceedings of the 44th Annual International Symposium on Computer
Architecture, 1â12.
+
+Krishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.
+âSelf-Supervised Learning in Medicine and Healthcare.â
+Nature Biomedical Engineering 6 (12): 1346â52. https://doi.org/10.1038/s41551-022-00914-1.
+
Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.
âImagenet Classification with Deep Convolutional Neural
@@ -509,6 +559,25 @@ References
Computing.â IEEE Transactions on Wireless Communications
19 (1): 447â57.
+
+Northcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.
+âPervasive Label Errors in Test Sets Destabilize Machine Learning
+Benchmarks.â arXiv, March. https://doi.org/
+https://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via
+DataCite.
+
+
+Pushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.
+âData Cards: Purposeful and Transparent Dataset Documentation for
+Responsible Ai.â 2022 ACM Conference on Fairness,
+Accountability, and Transparency. https://doi.org/10.1145/3531146.3533231.
+
+
+Ratner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and
+Christopher RĂ©. 2018. âSnorkel Metal: Weak Supervision for
+Multi-Task Learning.â Proceedings of the Second Workshop on
+Data Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.
+
Rosenblatt, Frank. 1957. The Perceptron, a Perceiving and
Recognizing Automaton Project Para. Cornell Aeronautical
@@ -519,6 +588,12 @@ References
âLearning Representations by Back-Propagating Errors.â
Nature 323 (6088): 533â36.
+
+Sheng, Victor S., and Jing Zhang. 2019. âMachine Learning with
+Crowdsourcing: A Brief Summary of the Past Research and Future
+Directions.â Proceedings of the AAAI Conference on Artificial
+Intelligence 33 (01): 9837â43. https://doi.org/10.1609/aaai.v33i01.33019837.
+
Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion
Jones, Aidan N Gomez, Ćukasz Kaiser, and Illia Polosukhin. 2017.
diff --git a/search.json b/search.json
index 2d82000d4..d33755f17 100644
--- a/search.json
+++ b/search.json
@@ -25,7 +25,7 @@
"href": "contributors.html",
"title": "Contributors",
"section": "",
- "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\nJessica Quayeđ\n\n\nMarcelo Rovaiđ\n\n\nVijay Janapa Reddiđ\n\n\nShvetank Prakashđ\n\n\nMatthew Stewartđ\n\n\nIkechukwu Uchenduđ"
+ "text": "We extend our sincere thanks to the diverse group of individuals who have generously contributed their expertise, insights, and time to enhance both the content and codebase of this project. Below you will find a list of all contributors. If you would like to contribute to this project, please see our GitHub page.\n\n\n\n\n\n\n\n\nsjohri20đ\n\n\nishapirađ\n\n\nJessica Quayeđ\n\n\nMatthew Stewartđ\n\n\nIkechukwu Uchenduđ\n\n\nVijay Janapa Reddiđ\n\n\noishibđ\n\n\n\n\nShvetank Prakashđ\n\n\nMarcelo Rovaiđ"
},
{
"objectID": "copyright.html",
@@ -284,98 +284,84 @@
"href": "data_engineering.html#introduction",
"title": "6Â Data Engineering",
"section": "6.1 Introduction",
- "text": "6.1 Introduction\nExplanation: This section establishes the groundwork, defining data engineering and explaining its importance and role in Embedded AI. A well-rounded introduction will help in establishing the foundation for the readers.\n\nDefinition and Importance of Data Engineering in AI\nRole of Data Engineering in Embedded AI\nSynergy with Machine Learning and Deep Learning"
+ "text": "6.1 Introduction\nData is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.\nDataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\\(^{1}\\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.\nLooking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.1 It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.1Â Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency Cepstral Coefficient and its applications: A Review.\" IEEE Access (2022).\nCreating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.\nWe begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. Weâll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, weâll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases donât have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. Weâll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, itâs imperative to devise strategies for managing and storing expansive datasets. By the end of this section, youâll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Letâs embark on this journey!"
},
{
- "objectID": "data_engineering.html#problem",
- "href": "data_engineering.html#problem",
+ "objectID": "data_engineering.html#problem-definition",
+ "href": "data_engineering.html#problem-definition",
"title": "6Â Data Engineering",
- "section": "6.2 Problem",
- "text": "6.2 Problem\nExplanation: This section is a crucial starting point in any data engineering project, as it lays the groundwork for the projectâs trajectory and ultimate success. Hereâs a brief explanation of why each subsection within the âProblem Definitionâ is important:\n\nIdentifying the Problem\nSetting Clear Objectives\nBenchmarks for Success\nStakeholder Engagement and Understanding\nUnderstanding the Constraints and Limitations of Embedded Systems"
+ "section": "6.2 Problem Definition",
+ "text": "6.2 Problem Definition\nIn many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to âData Cascadesâ â events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.\n\n\n\nA visual representation of the stages in the machine learning pipeline and the potential pitfalls, illustrating how data quality lapses can lead to cascading negative consequences throughout the process.\n\n\nDespite many ML professionals recognizing the importance of data, numerous practitioners report facing these cascades. This highlights a systemic issue: while the allure of developing advanced models remains, data is often underappreciated.\nTake, for example, Keyword Spotting (KWS). KWS serves as a prime example of TinyML in action and is a critical technology behind voice-enabled interfaces on endpoint devices such as smartphones. Typically functioning as lightweight wake-word engines, these systems are consistently active, listening for a specific phrase to trigger further actions. When we say the phrases âOk Googleâ or âAlexa,â this initiates a process on a microcontroller embedded within the device. Despite their limited resources, these microcontrollers play a pivotal role in enabling seamless voice interactions with devices, often operating in environments with high levels of ambient noise. The uniqueness of the wake-word helps minimize false positives, ensuring that the system is not triggered inadvertently.\nIt is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.\n\n\n\nThe seamless integration of Keyword Spotting technology allows users to command their devices with simple voice prompts, even in ambient noise environments\n\n\nBuilding a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS modelâs effectiveness is not just about recognizing a word; itâs about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. Itâs about ensuring that a whispered âAlexaâ in the dead of night or a shouted âOk Googleâ in a noisy marketplace are both recognized with equal precision.\nMoreover, many of the current KWS voice assistants support a limited number of languages, leaving a substantial portion of the worldâs linguistic diversity unrepresented. This limitation is partly due to the difficulty in gathering and monetizing data for languages spoken by smaller populations. The long-tail distribution of languages implies that many languages have limited data available, making the development of supportive technologies challenging.\nThis level of accuracy and robustness hinges on the availability of data, quality of data, ability to label the data correctly, and ensuring transparency of the data for the end userâall before the data is used to train the model. But it all begins with a clear understanding of the problem statement or definition.\nGenerally, in ML, problem definition has a few key steps:\n\nIdentifying the problem definition clearly\nSetting clear objectives\nEstablishing success benchmark\nUnderstanding end-user engagement/use\nUnderstanding the constraints and limitations of deployment\nFollowed by finally doing the data collection.\n\nLaying a solid foundation for a project is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring that voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction. Benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights and ensures alignment with market needs. Additionally, when delving into areas like voice assistance, understanding platform constraints is pivotal. Embedded systems, such as microcontrollers, come with inherent limitations in processing power, memory, and energy efficiency. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation.\nIn this context, using KWS as an example, we can break each of the steps out as follows:\n\nIdentifying the Problem: At its core, KWS aims to detect specific keywords amidst a sea of ambient sounds and other spoken words. The primary problem is to design a system that can recognize these keywords with high accuracy, low latency, and minimal false positives or negatives, especially when deployed on devices with limited computational resources.\nSetting Clear Objectives: The objectives for a KWS system might include:\n\nAchieving a specific accuracy rate (e.g., 98% accuracy in keyword detection).\nEnsuring low latency (e.g., keyword detection and response within 200 milliseconds).\nMinimizing power consumption to extend battery life on embedded devices.\nEnsuring the modelâs size is optimized for the available memory on the device.\n\nBenchmarks for Success: Establish clear metrics to measure the success of the KWS system. This could include:\n\nTrue Positive Rate: The percentage of correctly identified keywords.\nFalse Positive Rate: The percentage of non-keywords incorrectly identified as keywords.\nResponse Time: The time taken from keyword utterance to system response.\nPower Consumption: Average power used during keyword detection.\n\nStakeholder Engagement and Understanding: Engage with stakeholders, which might include device manufacturers, hardware and software developers, and end-users. Understand their needs, capabilities, and constraints. For instance:\n\nDevice manufacturers might prioritize low power consumption.\nSoftware developers might emphasize ease of integration.\nEnd-users would prioritize accuracy and responsiveness.\n\nUnderstanding the Constraints and Limitations of Embedded Systems: Embedded devices come with their own set of challenges:\n\nMemory Limitations: KWS models need to be lightweight to fit within the memory constraints of embedded devices. Typically, KWS models might need to be as small as 16KB to fit in the always-on island of the SoC. Moreover, this is just the model size. Additional application code for pre-processing may also need to fit within the memory constraints.\nProcessing Power: The computational capabilities of embedded devices are limited (few hundred MHz of clock speed), so the KWS model must be optimized for efficiency.\nPower Consumption: Since many embedded devices are battery-powered, the KWS system must be power-efficient.\nEnvironmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios.\n\nData Collection and Analysis: For a KWS system, the quality and diversity of data are paramount. Considerations might include:\n\nVariety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition.\nBackground Noises: Include data samples with different ambient noises to train the model for real-world scenarios.\nKeyword Variations: People might either pronounce keywords differently or have slight variations in the wake word itself. Ensure the dataset captures these nuances.\n\nIterative Feedback and Refinement: Once a prototype KWS system is developed, itâs crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve."
},
{
"objectID": "data_engineering.html#data-sourcing",
"href": "data_engineering.html#data-sourcing",
"title": "6Â Data Engineering",
"section": "6.3 Data Sourcing",
- "text": "6.3 Data Sourcing\nExplanation: This section delves into the first step in data engineering - gathering data. Understanding various data types and sources is vital for developing robust AI systems, especially in the context of embedded systems where resources might be limited.\n\nData Sources: crowdsourcing, pre-existing datasets etc.\nData Types: Structured, Semi-Structured, and Unstructured\nReal-time Data Processing in Embedded Systems"
+ "text": "6.3 Data Sourcing\nThe quality and diversity of data gathered is important for developing accurate and robust AI systems. Sourcing high-quality training data requires careful consideration of the objectives, resources, and ethical implications. Data can be obtained from various sources depending on the needs of the project:\n\n6.3.1 Pre-existing datasets\nPlatforms like Kaggle and UCI Machine Learning Repository provide a convenient starting point. Pre-existing datasets are a valuable resource for researchers, developers, and businesses alike. One of their primary advantages is cost-efficiency. Creating a dataset from scratch can be both time-consuming and expensive, so having access to ready-made data can save significant resources. Moreover, many of these datasets, like ImageNet, have become standard benchmarks in the machine learning community, allowing for consistent performance comparisons across different models and algorithms. This availability of data means that experiments can be started immediately without any delays associated with data collection and preprocessing. In a fast moving field like ML, this expediency is important.\nThe quality assurance that comes with popular pre-existing datasets is important to consider because several datasets have errors in them. For instance, the ImageNet dataset was found to have over 6.4% errors. Given their widespread use, any errors or biases in these datasets are often identified and rectified by the community. This assurance is especially beneficial for students and newcomers to the field, as they can focus on learning and experimentation without worrying about data integrity. Supporting documentation that often accompanies existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process, variable definitions, and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently there is a crisis around improving reproducibility in machine learning systems. When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each otherâs work more rapidly.\nWhile platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, itâs essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes these datasets do not reflect the real-world data.\nIn addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues.\n\n\n6.3.2 Web Scraping\nWeb scraping refers to automated techniques for extracting data from websites. It typically involves sending HTTP requests to web servers, retrieving HTML content, and parsing that content to extract relevant information. Popular tools and frameworks for web scraping include Beautiful Soup, Scrapy, and Selenium. These tools offer different functionalities, from parsing HTML content to automating web browser interactions, especially for websites that load content dynamically using JavaScript.\nWeb scraping can be an effective way to gather large datasets for training machine learning models, particularly when human-labeled data is scarce. For computer vision research, web scraping enables the collection of massive volumes of images and videos. Researchers have used this technique to build influential datasets like ImageNet and OpenImages. For example, one could scrape e-commerce sites to amass product photos for object recognition, or social media platforms to collect user uploads for facial analysis. Even before ImageNet, Stanfordâs LabelMe project scraped Flickr for over 63,000 annotated images covering hundreds of object categories.\nBeyond computer vision, web scraping supports the gathering of textual data for natural language tasks. Researchers can scrape news sites for sentiment analysis data, forums, and review sites for dialogue systems research, or social media for topic modeling. For example, the training data for chatbot ChatGPT was obtained by scraping much of the public internet. GitHub repositories were scraped to train GitHubâs Copilot AI coding assistant.\nWeb scraping can also collect structured data like stock prices, weather data, or product information for analytical applications. Once data is scraped, it is essential to store it in a structured manner, often using databases or data warehouses. Proper data management ensures the usability of the scraped data for future analysis and applications.\nHowever, while web scraping offers numerous advantages, there are significant limitations and ethical considerations to bear in mind. Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. It is also unethical and potentially illegal to scrape copyrighted material or private communications. Ethical web scraping mandates adherence to a websiteâs ârobots.txtâ file, which outlines the sections of the site that can be accessed and scraped by automated bots. To deter automated scraping, many websites implement rate limits. If a bot sends too many requests in a short period, it might be temporarily blocked, restricting the speed of data access. Additionally, the dynamic nature of web content means that data scraped at different intervals might lack consistency, posing challenges for longitudinal studies. Though there are emerging trends like Web Navigation where machine learning algorithms can automatically navigate the website to access the dynamic content.\nFor niche subjects, the volume of pertinent data available for scraping might be limited. For example, while scraping for common topics like images of cats and dogs might yield abundant data, searching for rare medical conditions might not be as fruitful. Moreover, the data obtained through scraping is often unstructured and noisy, necessitating thorough preprocessing and cleaning. It is crucial to understand that not all scraped data will be of high quality or accuracy. Employing verification methods, such as cross-referencing with alternate data sources, can enhance data reliability.\nPrivacy concerns arise when scraping personal data, emphasizing the need for anonymization. Therefore, it is paramount to adhere to a websiteâs Terms of Service, confine data collection to public domains, and ensure the anonymity of any personal data acquired.\nWhile web scraping can be a scalable method to amass large training datasets for AI systems, its applicability is confined to specific data types. For example, sourcing data for Inertial Measurement Units (IMU) for gesture recognition is not straightforward through web scraping. At most, one might be able to scrape an existing dataset.\n\n\n6.3.3 Crowdsourcing\nCrowdsourcing for datasets is the practice of obtaining data by using the services of a large number of people, either from a specific community or the general public, typically via the internet. Instead of relying on a small team or specific organization to collect or label data, crowdsourcing leverages the collective effort of a vast, distributed group of participants. Services like Amazon Mechanical Turk enable the distribution of annotation tasks to a large, diverse workforce. This facilitates the collection of labels for complex tasks like sentiment analysis or image recognition that specifically require human judgment.\nCrowdsourcing has emerged as an effective approach for many data collection and problem-solving needs. One major advantage of crowdsourcing is scalabilityâby distributing tasks to a large, global pool of contributors on digital platforms, projects can process huge volumes of data in a short timeframe. This makes crowdsourcing ideal for large-scale data labeling, collection, and analysis.\nIn addition, crowdsourcing taps into a diverse group of participants, bringing a wide range of perspectives, cultural insights, and language abilities that can enrich data and enhance creative problem-solving in ways that a more homogenous group may not. Because crowdsourcing draws from a large audience beyond traditional channels, it also tends to be more cost-effective than conventional methods, especially for simpler microtasks.\nCrowdsourcing platforms also allow for great flexibility, as task parameters can be adjusted in real-time based on initial results. This creates a feedback loop for iterative improvements to the data collection process. Complex jobs can be broken down into microtasks and distributed to multiple people, with cross-validation of results by assigning redundant versions of the same task. Ultimately, when thoughtfully managed, crowdsourcing enables community engagement around a collaborative project, where participants find reward in contributing.\nHowever, while crowdsourcing offers numerous advantages, itâs essential to approach it with a clear strategy. While it provides access to a diverse set of annotators, it also introduces variability in the quality of annotations. Additionally, platforms like Mechanical Turk might not always capture a complete demographic spectrum; often tech-savvy individuals are overrepresented, while children and the elderly may be underrepresented. Itâs crucial to provide clear instructions and possibly even training for the annotators. Periodic checks and validations of the labeled data can help maintain quality. This ties back to the topic of clear Problem Definition that we discussed earlier. Crowdsourcing for datasets also requires careful attention to ethical considerations. Itâs crucial to ensure that participants are informed about how their data will be used and that their privacy is protected. Quality control through detailed protocols, transparency in sourcing, and auditing is essential to ensure reliable outcomes.\nFor TinyML, crowdsourcing can pose some unique challenges. TinyML devices are highly specialized for particular tasks within tight constraints. As a result, the data they require tends to be very specific. It may be difficult to obtain such specialized data from a general audience through crowdsourcing. For example, TinyML applications often rely on data collected from certain sensors or hardware. Crowdsourcing would require participants to have access to very specific and consistent devices - like microphones with the same sampling rates. Even for simple audio tasks like keyword spotting, these hardware nuances present obstacles.\nBeyond hardware, the data itself needs high granularity and quality given the limitations of TinyML. It can be hard to ensure this when crowdsourcing from those unfamiliar with the applicationâs context and requirements. There are also potential issues around privacy, real-time collection, standardization, and technical expertise. Moreover, the narrow nature of many TinyML tasks makes accurate data labeling difficult without the proper understanding. Participants may struggle to provide reliable annotations without full context.\nThus, while crowdsourcing can work well in many cases, the specialized needs of TinyML introduce unique data challenges. Careful planning is required for guidelines, targeting, and quality control. For some applications, crowdsourcing may be feasible, but others may require more focused data collection efforts to obtain relevant, high-quality training data.\n\n\n6.3.4 Synthetic Data\nSynthetic data generation can be useful for addressing some of the limitations of data collection. It involves creating data that wasnât originally captured or observed, but is generated using algorithms, simulations, or other techniques to resemble real-world data. It has become a valuable tool in various fields, particularly in scenarios where real-world data is scarce, expensive, or ethically challenging to obtain (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data that is almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable.\nIn many domains, especially emerging ones, there may not be enough real-world data available for analysis or training machine learning models. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly.\nMoreover, in machine learning, especially in deep learning, having a diverse dataset is crucial. Synthetic data can augment existing datasets by introducing variations, thereby enhancing the robustness of models. For example, SpecAugment is an excellent data augmentation technique for Automatic Speech Recognition (ASR) systems.\nPivacy and confidentiality is also a big issue. Datasets containing sensitive or personal information pose privacy concerns when shared or used. Synthetic data, being artificially generated, doesnât have these direct ties to real individuals, allowing for safer use while preserving essential statistical properties.\nGenerating synthetic data, especially once the generation mechanisms have been established, can be a more cost-effective alternative. In the aforementioned security application scenario, synthetic data eliminates the need for breaking multiple windows to gather relevant data.\nMany embedded use-cases deal with unique situations, such as manufacturing plants, that are difficult to simulate. Synthetic data allows researchers complete control over the data generation process, enabling the creation of specific scenarios or conditions that are challenging to capture in real life.\nWhile synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases."
},
{
"objectID": "data_engineering.html#data-storage",
"href": "data_engineering.html#data-storage",
"title": "6Â Data Engineering",
"section": "6.4 Data Storage",
- "text": "6.4 Data Storage\nExplanation: Data must be stored and managed efficiently to facilitate easy access and processing. This section will provide insights into different data storage options and their respective advantages and challenges in embedded systems.\n\nData Warehousing\nData Lakes\nMetadata Management\nData Governance"
+ "text": "6.4 Data Storage\nData sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.\n\n\n\n\n\n\n\n\n\n\nDatabase\nData Warehouse\nData Lake\n\n\n\n\nPurpose\nOperational and transactional\nAnalytical\nAnalytical\n\n\nData type\nStructured\nStructured\nStructured, semi-structured and/or unstructured\n\n\nScale\nSmall to large volumes of data\nLarge volumes of integrated data\nLarge volumes of diverse data\n\n\nExamples\nMySQL\nGoogle BigQuery, Amazon Redshift, Microsoft Azure Synapse.\nGoogle Cloud Storage, AWS S3, Azure Data Lake Storage\n\n\n\nThe stored data is often accompanied by metadata, which is defined as âdata about dataâ. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.\nData Governance2: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.2 Janssen, Marijn, et al. \"Data governance: Organizing data for trustworthy Artificial Intelligence.\" Government Information Quarterly 37.3 (2020): 101493.\nData governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.\n\n\n\nData Governance\n\n\nFigure source: https://www.databricks.com/discover/data-governance\nSome examples of data governance across different sectors include:\n\nMedicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.\nFinance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.\nGovernment: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.\n\nSpecial data storage considerations for tinyML\nEfficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:\n\nExtracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)3 are commonly used to represent important audio characteristics.\nCreating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format thatâs more manageable and efficient for computation and storage.\nVector quantization4 - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.\nSequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.\n\n3 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. \"Mel Frequency Cepstral Coefficient and its applications: A Review.\" IEEE Access (2022).4 Vasuki, A., and P. T. Vanathi. \"A review of vector quantization techniques.\" IEEE Potentials 25.4 (2006): 39-47.This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.\nSelective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 20185), the authors discuss adaptation of the modelâs intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.5 Rybakov, Oleg, et al. \"Streaming keyword spotting on mobile devices.\" arXiv preprint arXiv:2005.06720 (2020)."
},
{
"objectID": "data_engineering.html#data-processing",
"href": "data_engineering.html#data-processing",
"title": "6Â Data Engineering",
"section": "6.5 Data Processing",
- "text": "6.5 Data Processing\nExplanation: Data processing is a pivotal step in transforming raw data into a usable format. This section provides a deep dive into the necessary processes, which include cleaning, integration, and establishing data pipelines, all crucial for streamlining operations in embedded AI systems.\n\nData Cleaning and Transformation\nData Pipelines\nBatch vs. Stream Processing"
- },
- {
- "objectID": "data_engineering.html#data-quality",
- "href": "data_engineering.html#data-quality",
- "title": "6Â Data Engineering",
- "section": "6.6 Data Quality",
- "text": "6.6 Data Quality\nExplanation: Ensuring data quality is critical to developing reliable AI models. This section outlines various strategies to assure and evaluate data quality.\n\nData Validation\nHandling Missing Values\nOutlier Detection\nData Provenance"
- },
- {
- "objectID": "data_engineering.html#feature-engineering",
- "href": "data_engineering.html#feature-engineering",
- "title": "6Â Data Engineering",
- "section": "6.7 Feature Engineering",
- "text": "6.7 Feature Engineering\nExplanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. Itâs vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance.\n\nImportance of Feature Engineering\nTechniques of Feature Selection\nFeature Transformation for Embedded Systems\nEmbeddings\nReal-time Feature Engineering in Embedded Systems"
+ "text": "6.5 Data Processing\nData processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. âData preparation accounts for about 60-80% of the work of a data scientist.â\n\n\n\nA breakdown of tasks that data scientists allocate their time to, highlighting the significant portion spent on data cleaning and organizing.\n\n\nProper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.\nData often comes from diverse sources and can be unstructured or semi-structured. Thus, itâs essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:\n\nNormalizing numerical variables\nEncoding categorical variables\nUsing techniques like dimensionality reduction\n\nData validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation6, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.6Â Vasuki, A., and P. T. Vanathi. \"A review of vector quantization techniques.\" IEEE Potentials 25.4 (2006): 39-47.\n\n\n\nA detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training\n\n\nLetâs take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelinesâsystematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.\nThe MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the modelâs requirements, these noises can be removed or intentionally retained.\nMaintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.\nThere is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems."
},
{
"objectID": "data_engineering.html#data-labeling",
"href": "data_engineering.html#data-labeling",
"title": "6Â Data Engineering",
- "section": "6.8 Data Labeling",
- "text": "6.8 Data Labeling\nExplanation: Labeling is an essential part of preparing data for supervised learning. This section focuses on various strategies and tools available for data labeling, a vital process in the data preparation phase.\n\nManual Data Labeling\nEthical Considerations (e.g. OpenAI issues)\nAutomated Data Labeling\nLabeling Tools"
+ "section": "6.6 Data Labeling",
+ "text": "6.6 Data Labeling\nData labeling is an important step in creating high-quality training datasets for machine learning models. Labels provide the ground truth information that allows models to learn relationships between inputs and desired outputs. This section covers key considerations around selecting label types, formats, and content to capture the necessary information for given tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. Ethical treatment of human annotators is also something we emphasize. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies is essential for constructing reliable, useful datasets that can train performant, trustworthy machine learning systems.\nLabel Types Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (Johnson-Roberson et al. (2017)).\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur Sridhar, Karl Rosaen, and Ram Vasudevan. 2017. âDriving in the Matrix: Can Virtual Worlds Replace Human-Generated Annotations for Real World Tasks?â 2017 IEEE International Conference on Robotics and Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\nUnless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.\nAdditionally, annotators can potentially provide metadata that provides insight into how the dataset represents different characteristics of interest (see: Data Transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (Ardila et al. (2020)). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see the breakdown of who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages.\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor Weber. 2020. âCommon Voice: A Massively-Multilingual Speech Corpus.â Proceedings of the 12th Conference on Language Resources and Evaluation, May, 4218â22.\nNext, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend both on their use case and their resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire.\nAnnotation Methods: Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but lacks scalability. Crowdsourcing enables distributed annotation by non-experts, often through dedicated platforms (Sheng and Zhang (2019)). Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels (Ratner et al. (2018)).\n\nSheng, Victor S., and Jing Zhang. 2019. âMachine Learning with Crowdsourcing: A Brief Summary of the Past Research and Future Directions.â Proceedings of the AAAI Conference on Artificial Intelligence 33 (01): 9837â43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher RĂ©. 2018. âSnorkel Metal: Weak Supervision for Multi-Task Learning.â Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\nAfter deciding on their labelsâ desired content and format, creators begin the annotation process. To collect large numbers of labels from human annotators, creators frequently rely on dedicated annotation platforms, which can connect them to teams of human annotators. When using these platforms, creators may have little insight to annotatorsâ backgrounds and levels of experience with topics of interest. However, some platforms offer access to annotators with specific expertise (e.g. doctors).\nEnsuring Label Quality: There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (Northcutt, Athalye, and Mueller (2021)). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021. âPervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks.â arXiv, March. https://doi.org/ https://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via DataCite.\n\nWhen working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).\n\nGoogle. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.\nAI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:\n\nPre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets.\nActive learning: AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled datasetâs quality while reducing the overall annotation time.\nQuality control: AI models can be used to identify and flag potential errors in human annotations. This can help to ensure the accuracy and consistency of the labeled dataset.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009. âSemi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book Reviews].â IEEE Transactions on Neural Networks 20 (3): 542â42. https://doi.org/10.1109/tnn.2009.2015974.\nHere are some examples of how AI-assisted annotation has been proposed to be useful:\n\nMedical imaging: AI-assisted annotation is being used to label medical images, such as MRI scans and X-rays (Krishnan, Rajpurkar, and Topol (2022)). Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are both scarce and it becomes a costly effort. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently.\n\nSelf-driving cars: AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs.\nSocial media: AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022. âSelf-Supervised Learning in Medicine and Healthcare.â Nature Biomedical Engineering 6 (12): 1346â52. https://doi.org/10.1038/s41551-022-00914-1."
},
{
"objectID": "data_engineering.html#data-version-control",
"href": "data_engineering.html#data-version-control",
"title": "6Â Data Engineering",
- "section": "6.9 Data Version Control",
- "text": "6.9 Data Version Control\nExplanation: Version control is critical for managing changes and tracking versions of datasets during the development of AI models, facilitating reproducibility and collaboration.\n\nVersion Control Systems\nMetadata"
+ "section": "6.7 Data Version Control",
+ "text": "6.7 Data Version Control\nProduction systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This proliferating data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data hungry in terms of the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative.\nData Version Control offers a structured methodology to handle alterations and versions of datasets efficiently. It facilitates the monitoring of modifications, preserves multiple versions, and guarantees reproducibility and traceability in data-centric projects. Furthermore, data version control provides the versatility to review and utilize specific versions as needed, ensuring that each stage of the data processing and model development can be revisited and audited with precision and ease. It has a variety of practical uses -\nRisk Management: Data version control allows transparency and accountability by tracking versions of the dataset.\nCollaboration and Efficiency: Easy access to different versions of the dataset in one place can improve data sharing of specific checkpoints, and enable efficient collaboration.\nReproducibility: Data version control allows for tracking the performance of models with respect to different versions of the data, and therefore enabling reproducibility.\nKey Concepts\n\nCommits: It is an immutable snapshot of the data at a specific point in time, representing a unique version. Every commit is associated with a unique identifier to allow\nBranches: Branching allows developers and data scientists to diverge from the main line of development and continue to work independently without affecting other branches. This is especially useful when experimenting with new features or models, enabling parallel development and experimentation without the risk of corrupting the stable, main branch.\nMerges: Merges help to integrate changes from different branches while maintaining the integrity of the data.\n\nPopular Data Version Control Systems\nDVC: It stands for Data Version Control in short, and is an open-source, lightweight tool that works on top of github and supports all kinds of data format. It can seamlessly integrate into the Git workflow, if Git is being used for managing code. It captures the versions of data and models in the Git commits, while storing them on premises or on cloud (e.g. AWS, Google Cloud, Azure). These data and models (e.g. ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data.\nlakeFS: It is an open-source tool that supports the data version control on data lakes. It supports many git-like operations such as branching and merging of data, as well as reverting to previous versions of the data. It also has a unique UI feature which allows exploration and management of data much easier.\nGit LFS: It is useful for data version control on smaller sized datasets. It uses Gitâs inbuilt branching and merging features, but is limited in terms of tracking metrics, reverting to previous versions or integration with data lakes."
},
{
"objectID": "data_engineering.html#optimizing-data-for-embedded-ai",
"href": "data_engineering.html#optimizing-data-for-embedded-ai",
"title": "6Â Data Engineering",
- "section": "6.10 Optimizing Data for Embedded AI",
- "text": "6.10 Optimizing Data for Embedded AI\nExplanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments.\n\nLow-Resource Data Challenges\nData Reduction Techniques\nOptimizing Data Storage and Retrieval"
+ "section": "6.8 Optimizing Data for Embedded AI",
+ "text": "6.8 Optimizing Data for Embedded AI\nCreators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle.\nOn the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. As a result, creators may design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from:\n\nGeographically, socially and architecturally diverse neighborhoods\nDifferent types of artificial and natural lighting\nDifferent seasons and weather conditions\nObstructions (e.g. raindrops or delivery boxes obscuring the cameraâs view)\n\nAs described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations."
},
{
- "objectID": "data_engineering.html#challenges-in-data-engineering",
- "href": "data_engineering.html#challenges-in-data-engineering",
+ "objectID": "data_engineering.html#data-transparency",
+ "href": "data_engineering.html#data-transparency",
"title": "6Â Data Engineering",
- "section": "6.11 Challenges in Data Engineering",
- "text": "6.11 Challenges in Data Engineering\nExplanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems.\n\nScalability\nData Security and Privacy\nData Bias and Representativity"
- },
- {
- "objectID": "data_engineering.html#promoting-transparency",
- "href": "data_engineering.html#promoting-transparency",
- "title": "6Â Data Engineering",
- "section": "6.12 Promoting Transparency",
- "text": "6.12 Promoting Transparency\nExplanation: We explain that as we increasingly use these systems built on the foundation of data, we need to have more transparency in the ecosystem.\n\nDefinition and Importance of Transparency in Data Engineering\nTransparency in Data Collection and Sourcing\nTransparency in Data Processing and Analysis\nTransparency in Model Building and Deployment\nTransparency in Data Sharing and Usage\nTools and Techniques for Ensuring Transparency"
+ "section": "6.9 Data Transparency",
+ "text": "6.9 Data Transparency\nBy providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (Pushkarna, Zaldivar, and Kjartansson (2022)), datasheets (Gebru et al. (2021)), data statements (Bender and Friedman (2018)), or Data Nutrition Labels (Holland et al. (2020)). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras).\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022. âData Cards: Purposeful and Transparent Dataset Documentation for Responsible Ai.â 2022 ACM Conference on Fairness, Accountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal DaumĂ© III, and Kate Crawford. 2021. âDatasheets for Datasets.â Communications of the ACM 64 (12): 86â92. https://doi.org/10.1145/3458723.\n\nBender, Emily M., and Batya Friedman. 2018. âData Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science.â Transactions of the Association for Computational Linguistics 6: 587â604. https://doi.org/10.1162/tacl_a_00041.\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia Chmielinski. 2020. âThe Dataset Nutrition Label.â Data Protection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\nKeeping track of data provenanceâessentially the origins and the journey of each data point through the data pipelineâis not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesnât just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the modelâs performance and its acceptability among end-users.\nWhen producing documentation, creators should also clearly specify how users can access the dataset and how the dataset will be maintained over time. For example, users may need to undergo training or receive special permission from the creators before accessing a dataset containing protected information, as is the case with many medical datasets. In some cases, users may not be permitted to directly access the data and must instead submit their model to be trained on the dataset creatorsâ hardware, following a federated learning setup (Aledhari et al. (2020)). Creators may also describe how long the dataset will remain accessible, how the users can submit feedback on any errors that they discover, and whether there are plans to update the dataset.\n\nAledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020. âFederated Learning: A Survey on Enabling Technologies, Protocols, and Applications.â IEEE Access 8: 140699â725. https://doi.org/10.1109/access.2020.3013541.\nSome laws and regulations promote also data transparency through new requirements for organizations:\n\nGeneral Data Protection Regulation (GDPR) in European Union: It establishes strict requirements for processing and protecting personal data of EU citizens. It mandates plain language privacy policies that clearly explain what data is collected, why it is used, how long it is stored, and with whom it is shared. GDPR also mandates that privacy notices must include details on legal basis for processing, data transfers, retention periods, rights to access and deletion, and contact info for data controllers.\nCaliforniaâs Consumer Privacy Act (CCPA): CCPA requires clear privacy policies and opt-out rights for the sale of personal data. Significantly, it also establishes rights for consumers to request their specific data be disclosed. Businesses must provide copies of collected personal information along with details on what it is used for, what categories are collected, and what third parties receive it. Consumers can identify data points they believe are inaccurate. The law represents a major step forward in empowering personal data access.\n\nThere are several current challenges in ensuring data transparency, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can be difficult to achieve in these cases. Full transparency may also overwhelm the consumers with too much detail. And finally, it is also important to balance the tradeoff between transparency and privacy."
},
{
"objectID": "data_engineering.html#licensing",
"href": "data_engineering.html#licensing",
"title": "6Â Data Engineering",
- "section": "6.13 Licensing",
- "text": "6.13 Licensing\nExplanation: This section emphasizes why one must understand data licensing issues before they start using the data to train the models.\n\nMetadata\nData Nutrition Project\nUnderstanding Licensing"
+ "section": "6.10 Licensing",
+ "text": "6.10 Licensing\nMany high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.\nFor instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 20207). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.7 Birhane, Abeba, and Vinay Uday Prabhu. \"Large image datasets: A pyrrhic win for computer vision?.\" 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2021.\nThe legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.\nNew Data Regulations and Their Implications\nNew data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EUâs Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:\n\nClassifies AI systems by risk.\nMandates development and usage prerequisites.\nEmphasizes data quality, transparency, human oversight, and accountability.\n\nAdditionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing \"unacceptable\" risks, stringent conditions for high-risk systems, and minimal obligations for \"limited risk\" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.\nChallenges in Assembling ML Training Datasets\nComplex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing8 or public-private data collaborations could greatly accelerate industry progress and ethical standards.8 Sonnenburg, Soren, et al. \"The need for open source software in machine learning.\" (2007): 2443-2466.\nIn some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.\nData collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.\nFor instance, below is an example request from Common Voice users to remove their information:\n\n\n\n\n\n\nThank you for downloading the Common Voice dataset. Account holders are free to request deletion of their voice clips at any time. We action this on our side for all future releases and are legally obligated to inform those who have downloaded a historic release so that they can also take action.\nYou are receiving this message because one or more account holders have requested that their voice clips be deleted. Their clips are part of the dataset that you downloaded and are associated with the hashed IDs listed below. Please delete them from your downloads in order to fulfill your third party data privacy obligations.\nThank you for your timely completion.\n\n4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57\n97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7\n969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112\n6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9\n33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564\n\n\n\n\n\n\n\n\n\nHaving the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate9,10,11 its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.9 Ginart, Antonio, et al. \"Making ai forget you: Data deletion in machine learning.\" Advances in neural information processing systems 32 (2019).10 Sekhari, Ayush, et al. \"Remember what you want to forget: Algorithms for machine unlearning.\" Advances in Neural Information Processing Systems 34 (2021): 18075-18086.11 Guo, Chuan, et al. \"Certified data removal from machine learning models.\" arXiv preprint arXiv:1911.03030 (2019).\nDataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering."
},
{
"objectID": "data_engineering.html#conclusion",
"href": "data_engineering.html#conclusion",
"title": "6Â Data Engineering",
- "section": "6.14 Conclusion",
- "text": "6.14 Conclusion\nExplanation: Close up the chapter with a summary of the key topics that we have covered in this section.\n\nThe Future of Data Engineering in Embedded AI\nKey Takeaways"
+ "section": "6.11 Conclusion",
+ "text": "6.11 Conclusion\nData is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications."
+ },
+ {
+ "objectID": "data_engineering.html#helpful-references",
+ "href": "data_engineering.html#helpful-references",
+ "title": "6Â Data Engineering",
+ "section": "6.12 Helpful References",
+ "text": "6.12 Helpful References\n1. [3 big problems with datasets in AI and machine learning](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)\n2. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)\n3. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)\n4. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)\n5. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)\n6. [âEveryone wants to do the model work, not the data workâ: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)\n7. [Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)](https://arxiv.org/abs/2003.12206)\n8. [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)\n9. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)\n10. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)\n11. [OpenImages](https://storage.googleapis.com/openimages/web/index.html)\n12. [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749)\n13. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)\n14. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)"
},
{
"objectID": "frameworks.html#introduction",
@@ -1362,7 +1348,7 @@
"href": "references.html",
"title": "References",
"section": "",
- "text": "ARM.com. âThe Future Is Being Built on Arm: Market Diversification\nContinues to Drive Strong Royalty and Licensing Growth as Ecosystem\nReaches Quarter of a Trillion Chips Milestone â ArmÂź.â https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\nâAutoencoders.â Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353â74.\n\n\nBarroso, Luiz AndrĂ©, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\nâGenerative Adversarial Networks.â Communications of\nthe ACM 63 (11): 139â44.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. âDeep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.â https://arxiv.org/abs/1510.00149.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\nâDeep Residual Learning for Image Recognition.â In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770â78.\n\n\nHoward, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.\nâMobilenets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.â arXiv Preprint arXiv:1704.04861.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. âSqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.â arXiv Preprint arXiv:1602.07360.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017. âIn-Datacenter\nPerformance Analysis of a Tensor Processing Unit.â In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1â12.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\nâImagenet Classification with Deep Convolutional Neural\nNetworks.â Advances in Neural Information Processing\nSystems 25.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. âOptimal Brain\nDamage.â Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. âEdge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.â IEEE Transactions on Wireless Communications\n19 (1): 447â57.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\nâLearning Representations by Back-Propagating Errors.â\nNature 323 (6088): 533â36.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Ćukasz Kaiser, and Illia Polosukhin. 2017.\nâAttention Is All You Need.â Advances in Neural\nInformation Processing Systems 30.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. OâReilly Media."
+ "text": "Aledhari, Mohammed, Rehma Razzak, Reza M. Parizi, and Fahad Saeed. 2020.\nâFederated Learning: A Survey on Enabling Technologies, Protocols,\nand Applications.â IEEE Access 8: 140699â725. https://doi.org/10.1109/access.2020.3013541.\n\n\nArdila, Rosana, Megan Branson, Kelly Davis, Michael Henretty, Michael\nKohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers,\nand Gregor Weber. 2020. âCommon Voice: A Massively-Multilingual\nSpeech Corpus.â Proceedings of the 12th Conference on\nLanguage Resources and Evaluation, May, 4218â22.\n\n\nARM.com. âThe Future Is Being Built on Arm: Market Diversification\nContinues to Drive Strong Royalty and Licensing Growth as Ecosystem\nReaches Quarter of a Trillion Chips Milestone â ArmÂź.â https://www.arm.com/company/news/2023/02/arm-announces-q3-fy22-results.\n\n\nBank, Dor, Noam Koenigstein, and Raja Giryes. 2023.\nâAutoencoders.â Machine Learning for Data Science\nHandbook: Data Mining and Knowledge Discovery Handbook, 353â74.\n\n\nBarroso, Luiz AndrĂ©, Urs Hölzle, and Parthasarathy Ranganathan. 2019.\nThe Datacenter as a Computer: Designing Warehouse-Scale\nMachines. Springer Nature.\n\n\nBender, Emily M., and Batya Friedman. 2018. âData Statements for\nNatural Language Processing: Toward Mitigating System Bias and Enabling\nBetter Science.â Transactions of the Association for\nComputational Linguistics 6: 587â604. https://doi.org/10.1162/tacl_a_00041.\n\n\nChapelle, O., B. Scholkopf, and A. Zien Eds. 2009.\nâSemi-Supervised Learning (Chapelle, o. Et Al., Eds.; 2006) [Book\nReviews].â IEEE Transactions on Neural Networks 20 (3):\n542â42. https://doi.org/10.1109/tnn.2009.2015974.\n\n\nGebru, Timnit, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman\nVaughan, Hanna Wallach, Hal DaumĂ© III, and Kate Crawford. 2021.\nâDatasheets for Datasets.â Communications of the\nACM 64 (12): 86â92. https://doi.org/10.1145/3458723.\n\n\nGoodfellow, Ian, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David\nWarde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2020.\nâGenerative Adversarial Networks.â Communications of\nthe ACM 63 (11): 139â44.\n\n\nGoogle. n.d. Google. Google. https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/.\n\n\nHan, Song, Huizi Mao, and William J. Dally. 2016. âDeep\nCompression: Compressing Deep Neural Networks with Pruning, Trained\nQuantization and Huffman Coding.â https://arxiv.org/abs/1510.00149.\n\n\nHe, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016.\nâDeep Residual Learning for Image Recognition.â In\nProceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, 770â78.\n\n\nHolland, Sarah, Ahmed Hosny, Sarah Newman, Joshua Joseph, and Kasia\nChmielinski. 2020. âThe Dataset Nutrition Label.â Data\nProtection and Privacy. https://doi.org/10.5040/9781509932771.ch-001.\n\n\nHoward, Andrew G, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun\nWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017.\nâMobilenets: Efficient Convolutional Neural Networks for Mobile\nVision Applications.â arXiv Preprint arXiv:1704.04861.\n\n\nIandola, Forrest N, Song Han, Matthew W Moskewicz, Khalid Ashraf,\nWilliam J Dally, and Kurt Keutzer. 2016. âSqueezeNet:\nAlexNet-Level Accuracy with 50x Fewer Parameters and< 0.5 MB Model\nSize.â arXiv Preprint arXiv:1602.07360.\n\n\nJohnson-Roberson, Matthew, Charles Barto, Rounak Mehta, Sharath Nittur\nSridhar, Karl Rosaen, and Ram Vasudevan. 2017. âDriving in the\nMatrix: Can Virtual Worlds Replace Human-Generated Annotations for Real\nWorld Tasks?â 2017 IEEE International Conference on Robotics\nand Automation (ICRA). https://doi.org/10.1109/icra.2017.7989092.\n\n\nJouppi, Norman P, Cliff Young, Nishant Patil, David Patterson, Gaurav\nAgrawal, Raminder Bajwa, Sarah Bates, et al. 2017. âIn-Datacenter\nPerformance Analysis of a Tensor Processing Unit.â In\nProceedings of the 44th Annual International Symposium on Computer\nArchitecture, 1â12.\n\n\nKrishnan, Rayan, Pranav Rajpurkar, and Eric J. Topol. 2022.\nâSelf-Supervised Learning in Medicine and Healthcare.â\nNature Biomedical Engineering 6 (12): 1346â52. https://doi.org/10.1038/s41551-022-00914-1.\n\n\nKrizhevsky, Alex, Ilya Sutskever, and Geoffrey E Hinton. 2012.\nâImagenet Classification with Deep Convolutional Neural\nNetworks.â Advances in Neural Information Processing\nSystems 25.\n\n\nLeCun, Yann, John Denker, and Sara Solla. 1989. âOptimal Brain\nDamage.â Advances in Neural Information Processing\nSystems 2.\n\n\nLi, En, Liekang Zeng, Zhi Zhou, and Xu Chen. 2019. âEdge AI:\nOn-Demand Accelerating Deep Neural Network Inference via Edge\nComputing.â IEEE Transactions on Wireless Communications\n19 (1): 447â57.\n\n\nNorthcutt, Curtis G, Anish Athalye, and Jonas Mueller. 2021.\nâPervasive Label Errors in Test Sets Destabilize Machine Learning\nBenchmarks.â arXiv, March. https://doi.org/ \nhttps://doi.org/10.48550/arXiv.2103.14749 arXiv-issued DOI via\nDataCite.\n\n\nPushkarna, Mahima, Andrew Zaldivar, and Oddur Kjartansson. 2022.\nâData Cards: Purposeful and Transparent Dataset Documentation for\nResponsible Ai.â 2022 ACM Conference on Fairness,\nAccountability, and Transparency. https://doi.org/10.1145/3531146.3533231.\n\n\nRatner, Alex, Braden Hancock, Jared Dunnmon, Roger Goldman, and\nChristopher RĂ©. 2018. âSnorkel Metal: Weak Supervision for\nMulti-Task Learning.â Proceedings of the Second Workshop on\nData Management for End-To-End Machine Learning. https://doi.org/10.1145/3209889.3209898.\n\n\nRosenblatt, Frank. 1957. The Perceptron, a Perceiving and\nRecognizing Automaton Project Para. Cornell Aeronautical\nLaboratory.\n\n\nRumelhart, David E, Geoffrey E Hinton, and Ronald J Williams. 1986.\nâLearning Representations by Back-Propagating Errors.â\nNature 323 (6088): 533â36.\n\n\nSheng, Victor S., and Jing Zhang. 2019. âMachine Learning with\nCrowdsourcing: A Brief Summary of the Past Research and Future\nDirections.â Proceedings of the AAAI Conference on Artificial\nIntelligence 33 (01): 9837â43. https://doi.org/10.1609/aaai.v33i01.33019837.\n\n\nVaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion\nJones, Aidan N Gomez, Ćukasz Kaiser, and Illia Polosukhin. 2017.\nâAttention Is All You Need.â Advances in Neural\nInformation Processing Systems 30.\n\n\nWarden, Pete, and Daniel Situnayake. 2019. Tinyml: Machine Learning\nwith Tensorflow Lite on Arduino and Ultra-Low-Power\nMicrocontrollers. OâReilly Media."
},
{
"objectID": "tools.html#hardware-kits",
6.1 Introduction
-Explanation: This section establishes the groundwork, defining data engineering and explaining its importance and role in Embedded AI. A well-rounded introduction will help in establishing the foundation for the readers.
--
-
- Definition and Importance of Data Engineering in AI -
- Role of Data Engineering in Embedded AI -
- Synergy with Machine Learning and Deep Learning -
Data is the lifeblood of AI systems. Without good data, even the most advanced machine learning algorithms will fail. In this section, we will dive into the intricacies of building high-quality datasets to fuel our AI models. Data engineering encompasses the processes of collecting, storing, processing, and managing data for training machine learning models.
+Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy\(^{1}\), aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case.
+Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.1 It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such higher-order gaps are not immediately obvious but can critically impact model performance.
1 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).
Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort.
+We begin by discussing data collection: Where do we source data, and how do we gather it? Options range from scraping the web, accessing APIs, utilizing sensors and IoT devices, to conducting surveys and gathering user input. These methods reflect real-world practices. Next, we delve into data labeling, including considerations for human involvement. Weâll discuss the trade-offs and limitations of human labeling and explore emerging methods for automated labeling. Following that, weâll address data cleaning and preprocessing, a crucial yet frequently undervalued step in preparing raw data for AI model training. Data augmentation comes next, a strategy for enhancing limited datasets by generating synthetic samples. This is particularly pertinent for embedded systems, as many use cases donât have extensive data repositories readily available for curation. Synthetic data generation emerges as a viable alternative, though it comes with its own set of advantages and disadvantages. Weâll also touch upon dataset versioning, emphasizing the importance of tracking data modifications over time. Data is ever-evolving; hence, itâs imperative to devise strategies for managing and storing expansive datasets. By the end of this section, youâll possess a comprehensive understanding of the entire data pipeline, from collection to storage, essential for operationalizing AI systems. Letâs embark on this journey!
6.2 Problem
-Explanation: This section is a crucial starting point in any data engineering project, as it lays the groundwork for the projectâs trajectory and ultimate success. Hereâs a brief explanation of why each subsection within the âProblem Definitionâ is important:
+6.2 Problem Definition
+In many domains of machine learning, while sophisticated algorithms take center stage, the fundamental importance of data quality is often overlooked. This neglect gives rise to âData Cascadesâ â events where lapses in data quality compound, leading to negative downstream consequences such as flawed predictions, project terminations, and even potential harm to communities.
+Despite many ML professionals recognizing the importance of data, numerous practitioners report facing these cascades. This highlights a systemic issue: while the allure of developing advanced models remains, data is often underappreciated.
+Take, for example, Keyword Spotting (KWS). KWS serves as a prime example of TinyML in action and is a critical technology behind voice-enabled interfaces on endpoint devices such as smartphones. Typically functioning as lightweight wake-word engines, these systems are consistently active, listening for a specific phrase to trigger further actions. When we say the phrases âOk Googleâ or âAlexa,â this initiates a process on a microcontroller embedded within the device. Despite their limited resources, these microcontrollers play a pivotal role in enabling seamless voice interactions with devices, often operating in environments with high levels of ambient noise. The uniqueness of the wake-word helps minimize false positives, ensuring that the system is not triggered inadvertently.
+It is important to appreciate that these keyword spotting technologies are not isolated; they integrate seamlessly into larger systems, processing signals continuously while managing low power consumption. These systems extend beyond simple keyword recognition, evolving to facilitate diverse sound detections, such as the breaking of glass. This evolution is geared towards creating intelligent devices capable of understanding and responding to a myriad of vocal commands, heralding a future where even household appliances can be controlled through voice interactions.
+Building a reliable KWS model is not a straightforward task. It demands a deep understanding of the deployment scenario, encompassing where and how these devices will operate. For instance, a KWS modelâs effectiveness is not just about recognizing a word; itâs about discerning it among various accents and background noises, whether in a bustling cafe or amid the blaring sound of a television in a living room or a kitchen where these devices are commonly found. Itâs about ensuring that a whispered âAlexaâ in the dead of night or a shouted âOk Googleâ in a noisy marketplace are both recognized with equal precision.
+Moreover, many of the current KWS voice assistants support a limited number of languages, leaving a substantial portion of the worldâs linguistic diversity unrepresented. This limitation is partly due to the difficulty in gathering and monetizing data for languages spoken by smaller populations. The long-tail distribution of languages implies that many languages have limited data available, making the development of supportive technologies challenging.
+This level of accuracy and robustness hinges on the availability of data, quality of data, ability to label the data correctly, and ensuring transparency of the data for the end userâall before the data is used to train the model. But it all begins with a clear understanding of the problem statement or definition.
+Generally, in ML, problem definition has a few key steps:
+-
+
Identifying the problem definition clearly
+Setting clear objectives
+Establishing success benchmark
+Understanding end-user engagement/use
+Understanding the constraints and limitations of deployment
+Followed by finally doing the data collection.
+
Laying a solid foundation for a project is essential for its trajectory and eventual success. Central to this foundation is first identifying a clear problem, such as ensuring that voice commands in voice assistance systems are recognized consistently across varying environments. Clear objectives, like creating representative datasets for diverse scenarios, provide a unified direction. Benchmarks, such as system accuracy in keyword detection, offer measurable outcomes to gauge progress. Engaging with stakeholders, from end-users to investors, provides invaluable insights and ensures alignment with market needs. Additionally, when delving into areas like voice assistance, understanding platform constraints is pivotal. Embedded systems, such as microcontrollers, come with inherent limitations in processing power, memory, and energy efficiency. Recognizing these limitations ensures that functionalities, like keyword detection, are tailored to operate optimally, balancing performance with resource conservation.
+In this context, using KWS as an example, we can break each of the steps out as follows:
+-
+
Identifying the Problem: At its core, KWS aims to detect specific keywords amidst a sea of ambient sounds and other spoken words. The primary problem is to design a system that can recognize these keywords with high accuracy, low latency, and minimal false positives or negatives, especially when deployed on devices with limited computational resources.
+Setting Clear Objectives: The objectives for a KWS system might include:
-
-
- Identifying the Problem -
- Setting Clear Objectives -
- Benchmarks for Success -
- Stakeholder Engagement and Understanding -
- Understanding the Constraints and Limitations of Embedded Systems -
- Achieving a specific accuracy rate (e.g., 98% accuracy in keyword detection). +
- Ensuring low latency (e.g., keyword detection and response within 200 milliseconds). +
- Minimizing power consumption to extend battery life on embedded devices. +
- Ensuring the modelâs size is optimized for the available memory on the device. + +
Benchmarks for Success: Establish clear metrics to measure the success of the KWS system. This could include:
+-
+
- True Positive Rate: The percentage of correctly identified keywords. +
- False Positive Rate: The percentage of non-keywords incorrectly identified as keywords. +
- Response Time: The time taken from keyword utterance to system response. +
- Power Consumption: Average power used during keyword detection. +
+Stakeholder Engagement and Understanding: Engage with stakeholders, which might include device manufacturers, hardware and software developers, and end-users. Understand their needs, capabilities, and constraints. For instance:
+-
+
- Device manufacturers might prioritize low power consumption. +
- Software developers might emphasize ease of integration. +
- End-users would prioritize accuracy and responsiveness. +
+Understanding the Constraints and Limitations of Embedded Systems: Embedded devices come with their own set of challenges:
+-
+
- Memory Limitations: KWS models need to be lightweight to fit within the memory constraints of embedded devices. Typically, KWS models might need to be as small as 16KB to fit in the always-on island of the SoC. Moreover, this is just the model size. Additional application code for pre-processing may also need to fit within the memory constraints. +
- Processing Power: The computational capabilities of embedded devices are limited (few hundred MHz of clock speed), so the KWS model must be optimized for efficiency. +
- Power Consumption: Since many embedded devices are battery-powered, the KWS system must be power-efficient. +
- Environmental Challenges: Devices might be deployed in various environments, from quiet bedrooms to noisy industrial settings. The KWS system must be robust enough to function effectively across these scenarios. +
+Data Collection and Analysis: For a KWS system, the quality and diversity of data are paramount. Considerations might include:
+-
+
- Variety of Accents: Collect data from speakers with various accents to ensure wide-ranging recognition. +
- Background Noises: Include data samples with different ambient noises to train the model for real-world scenarios. +
- Keyword Variations: People might either pronounce keywords differently or have slight variations in the wake word itself. Ensure the dataset captures these nuances. +
+Iterative Feedback and Refinement: Once a prototype KWS system is developed, itâs crucial to test it in real-world scenarios, gather feedback, and iteratively refine the model. This ensures that the system remains aligned with the defined problem and objectives. This is important because the deployment scenarios change over time as things evolve.
+
6.3 Data Sourcing
-Explanation: This section delves into the first step in data engineering - gathering data. Understanding various data types and sources is vital for developing robust AI systems, especially in the context of embedded systems where resources might be limited.
--
-
- Data Sources: crowdsourcing, pre-existing datasets etc. -
- Data Types: Structured, Semi-Structured, and Unstructured -
- Real-time Data Processing in Embedded Systems -
The quality and diversity of data gathered is important for developing accurate and robust AI systems. Sourcing high-quality training data requires careful consideration of the objectives, resources, and ethical implications. Data can be obtained from various sources depending on the needs of the project:
+6.3.1 Pre-existing datasets
+Platforms like Kaggle and UCI Machine Learning Repository provide a convenient starting point. Pre-existing datasets are a valuable resource for researchers, developers, and businesses alike. One of their primary advantages is cost-efficiency. Creating a dataset from scratch can be both time-consuming and expensive, so having access to ready-made data can save significant resources. Moreover, many of these datasets, like ImageNet, have become standard benchmarks in the machine learning community, allowing for consistent performance comparisons across different models and algorithms. This availability of data means that experiments can be started immediately without any delays associated with data collection and preprocessing. In a fast moving field like ML, this expediency is important.
+The quality assurance that comes with popular pre-existing datasets is important to consider because several datasets have errors in them. For instance, the ImageNet dataset was found to have over 6.4% errors. Given their widespread use, any errors or biases in these datasets are often identified and rectified by the community. This assurance is especially beneficial for students and newcomers to the field, as they can focus on learning and experimentation without worrying about data integrity. Supporting documentation that often accompanies existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process, variable definitions, and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently there is a crisis around improving reproducibility in machine learning systems. When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each otherâs work more rapidly.
+While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, itâs essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes these datasets do not reflect the real-world data.
+In addition, bias, validity, and reproducibility issues may exist in these datasets and in recent years there is a growing awareness of these issues.
+6.3.2 Web Scraping
+Web scraping refers to automated techniques for extracting data from websites. It typically involves sending HTTP requests to web servers, retrieving HTML content, and parsing that content to extract relevant information. Popular tools and frameworks for web scraping include Beautiful Soup, Scrapy, and Selenium. These tools offer different functionalities, from parsing HTML content to automating web browser interactions, especially for websites that load content dynamically using JavaScript.
+Web scraping can be an effective way to gather large datasets for training machine learning models, particularly when human-labeled data is scarce. For computer vision research, web scraping enables the collection of massive volumes of images and videos. Researchers have used this technique to build influential datasets like ImageNet and OpenImages. For example, one could scrape e-commerce sites to amass product photos for object recognition, or social media platforms to collect user uploads for facial analysis. Even before ImageNet, Stanfordâs LabelMe project scraped Flickr for over 63,000 annotated images covering hundreds of object categories.
+Beyond computer vision, web scraping supports the gathering of textual data for natural language tasks. Researchers can scrape news sites for sentiment analysis data, forums, and review sites for dialogue systems research, or social media for topic modeling. For example, the training data for chatbot ChatGPT was obtained by scraping much of the public internet. GitHub repositories were scraped to train GitHubâs Copilot AI coding assistant.
+Web scraping can also collect structured data like stock prices, weather data, or product information for analytical applications. Once data is scraped, it is essential to store it in a structured manner, often using databases or data warehouses. Proper data management ensures the usability of the scraped data for future analysis and applications.
+However, while web scraping offers numerous advantages, there are significant limitations and ethical considerations to bear in mind. Not all websites permit scraping, and violating these restrictions can lead to legal repercussions. It is also unethical and potentially illegal to scrape copyrighted material or private communications. Ethical web scraping mandates adherence to a websiteâs ârobots.txtâ file, which outlines the sections of the site that can be accessed and scraped by automated bots. To deter automated scraping, many websites implement rate limits. If a bot sends too many requests in a short period, it might be temporarily blocked, restricting the speed of data access. Additionally, the dynamic nature of web content means that data scraped at different intervals might lack consistency, posing challenges for longitudinal studies. Though there are emerging trends like Web Navigation where machine learning algorithms can automatically navigate the website to access the dynamic content.
+For niche subjects, the volume of pertinent data available for scraping might be limited. For example, while scraping for common topics like images of cats and dogs might yield abundant data, searching for rare medical conditions might not be as fruitful. Moreover, the data obtained through scraping is often unstructured and noisy, necessitating thorough preprocessing and cleaning. It is crucial to understand that not all scraped data will be of high quality or accuracy. Employing verification methods, such as cross-referencing with alternate data sources, can enhance data reliability.
+Privacy concerns arise when scraping personal data, emphasizing the need for anonymization. Therefore, it is paramount to adhere to a websiteâs Terms of Service, confine data collection to public domains, and ensure the anonymity of any personal data acquired.
+While web scraping can be a scalable method to amass large training datasets for AI systems, its applicability is confined to specific data types. For example, sourcing data for Inertial Measurement Units (IMU) for gesture recognition is not straightforward through web scraping. At most, one might be able to scrape an existing dataset.
+6.3.3 Crowdsourcing
+Crowdsourcing for datasets is the practice of obtaining data by using the services of a large number of people, either from a specific community or the general public, typically via the internet. Instead of relying on a small team or specific organization to collect or label data, crowdsourcing leverages the collective effort of a vast, distributed group of participants. Services like Amazon Mechanical Turk enable the distribution of annotation tasks to a large, diverse workforce. This facilitates the collection of labels for complex tasks like sentiment analysis or image recognition that specifically require human judgment.
+Crowdsourcing has emerged as an effective approach for many data collection and problem-solving needs. One major advantage of crowdsourcing is scalabilityâby distributing tasks to a large, global pool of contributors on digital platforms, projects can process huge volumes of data in a short timeframe. This makes crowdsourcing ideal for large-scale data labeling, collection, and analysis.
+In addition, crowdsourcing taps into a diverse group of participants, bringing a wide range of perspectives, cultural insights, and language abilities that can enrich data and enhance creative problem-solving in ways that a more homogenous group may not. Because crowdsourcing draws from a large audience beyond traditional channels, it also tends to be more cost-effective than conventional methods, especially for simpler microtasks.
+Crowdsourcing platforms also allow for great flexibility, as task parameters can be adjusted in real-time based on initial results. This creates a feedback loop for iterative improvements to the data collection process. Complex jobs can be broken down into microtasks and distributed to multiple people, with cross-validation of results by assigning redundant versions of the same task. Ultimately, when thoughtfully managed, crowdsourcing enables community engagement around a collaborative project, where participants find reward in contributing.
+However, while crowdsourcing offers numerous advantages, itâs essential to approach it with a clear strategy. While it provides access to a diverse set of annotators, it also introduces variability in the quality of annotations. Additionally, platforms like Mechanical Turk might not always capture a complete demographic spectrum; often tech-savvy individuals are overrepresented, while children and the elderly may be underrepresented. Itâs crucial to provide clear instructions and possibly even training for the annotators. Periodic checks and validations of the labeled data can help maintain quality. This ties back to the topic of clear Problem Definition that we discussed earlier. Crowdsourcing for datasets also requires careful attention to ethical considerations. Itâs crucial to ensure that participants are informed about how their data will be used and that their privacy is protected. Quality control through detailed protocols, transparency in sourcing, and auditing is essential to ensure reliable outcomes.
+For TinyML, crowdsourcing can pose some unique challenges. TinyML devices are highly specialized for particular tasks within tight constraints. As a result, the data they require tends to be very specific. It may be difficult to obtain such specialized data from a general audience through crowdsourcing. For example, TinyML applications often rely on data collected from certain sensors or hardware. Crowdsourcing would require participants to have access to very specific and consistent devices - like microphones with the same sampling rates. Even for simple audio tasks like keyword spotting, these hardware nuances present obstacles.
+Beyond hardware, the data itself needs high granularity and quality given the limitations of TinyML. It can be hard to ensure this when crowdsourcing from those unfamiliar with the applicationâs context and requirements. There are also potential issues around privacy, real-time collection, standardization, and technical expertise. Moreover, the narrow nature of many TinyML tasks makes accurate data labeling difficult without the proper understanding. Participants may struggle to provide reliable annotations without full context.
+Thus, while crowdsourcing can work well in many cases, the specialized needs of TinyML introduce unique data challenges. Careful planning is required for guidelines, targeting, and quality control. For some applications, crowdsourcing may be feasible, but others may require more focused data collection efforts to obtain relevant, high-quality training data.
+6.3.4 Synthetic Data
+Synthetic data generation can be useful for addressing some of the limitations of data collection. It involves creating data that wasnât originally captured or observed, but is generated using algorithms, simulations, or other techniques to resemble real-world data. It has become a valuable tool in various fields, particularly in scenarios where real-world data is scarce, expensive, or ethically challenging to obtain (e.g., TinyML). Various techniques, such as Generative Adversarial Networks (GANs), can produce high-quality synthetic data that is almost indistinguishable from real data. These techniques have advanced significantly, making synthetic data generation increasingly realistic and reliable.
+In many domains, especially emerging ones, there may not be enough real-world data available for analysis or training machine learning models. Synthetic data can fill this gap by producing large volumes of data that mimic real-world scenarios. For instance, detecting the sound of breaking glass might be challenging in security applications where a TinyML device is trying to identify break-ins. Collecting real-world data would require breaking numerous windows, which is impractical and costly.
+Moreover, in machine learning, especially in deep learning, having a diverse dataset is crucial. Synthetic data can augment existing datasets by introducing variations, thereby enhancing the robustness of models. For example, SpecAugment is an excellent data augmentation technique for Automatic Speech Recognition (ASR) systems.
+Pivacy and confidentiality is also a big issue. Datasets containing sensitive or personal information pose privacy concerns when shared or used. Synthetic data, being artificially generated, doesnât have these direct ties to real individuals, allowing for safer use while preserving essential statistical properties.
+Generating synthetic data, especially once the generation mechanisms have been established, can be a more cost-effective alternative. In the aforementioned security application scenario, synthetic data eliminates the need for breaking multiple windows to gather relevant data.
+Many embedded use-cases deal with unique situations, such as manufacturing plants, that are difficult to simulate. Synthetic data allows researchers complete control over the data generation process, enabling the creation of specific scenarios or conditions that are challenging to capture in real life.
+While synthetic data offers numerous advantages, it is essential to use it judiciously. Care must be taken to ensure that the generated data accurately represents the underlying real-world distributions and does not introduce unintended biases.
+6.4 Data Storage
-Explanation: Data must be stored and managed efficiently to facilitate easy access and processing. This section will provide insights into different data storage options and their respective advantages and challenges in embedded systems.
+Data sourcing and data storage go hand-in-hand and it is necessary to store data in a format that facilitates easy access and processing. Depending on the use case, there are various kinds of data storage systems that can be used to store your datasets.
++ | Database | +Data Warehouse | +Data Lake | +
---|---|---|---|
Purpose | +Operational and transactional | +Analytical | +Analytical | +
Data type | +Structured | +Structured | +Structured, semi-structured and/or unstructured | +
Scale | +Small to large volumes of data | +Large volumes of integrated data | +Large volumes of diverse data | +
Examples | +MySQL | +Google BigQuery, Amazon Redshift, Microsoft Azure Synapse. | +Google Cloud Storage, AWS S3, Azure Data Lake Storage | +
The stored data is often accompanied by metadata, which is defined as âdata about dataâ. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, Hugging Face has Dataset Cards. To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset's contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results.
+Data Governance2: With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks.
2 Janssen, Marijn, et al. "Data governance: Organizing data for trustworthy Artificial Intelligence." Government Information Quarterly 37.3 (2020): 101493.
Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts.
+Figure source: https://www.databricks.com/discover/data-governance
+Some examples of data governance across different sectors include:
-
-
- Data Warehousing -
- Data Lakes -
- Metadata Management -
- Data Governance +
Medicine: Health Information Exchanges(HIEs) enable the sharing of health information across different healthcare providers to improve patient care. They implement strict data governance practices to maintain data accuracy, integrity, privacy, and security, complying with regulations such as the Health Insurance Portability and Accountability Act (HIPAA). Governance policies ensure that patient data is only shared with authorized entities and that patients can control access to their information.
+Finance: Basel III Framework is an international regulatory framework for banks. It ensures that banks establish clear policies, practices, and responsibilities for data management, ensuring data accuracy, completeness, and timeliness. Not only does it enable banks to meet regulatory compliance, it also prevents financial crises by more effective management of risks.
+Government: Governments agencies managing citizen data, public records, and administrative information implement data governance to manage data transparently and securely. Social Security System in the US, and Aadhar system in India are good examples of such governance systems.
6.5 Data Processing
-Explanation: Data processing is a pivotal step in transforming raw data into a usable format. This section provides a deep dive into the necessary processes, which include cleaning, integration, and establishing data pipelines, all crucial for streamlining operations in embedded AI systems.
+Special data storage considerations for tinyML
+Efficient Audio Storage Formats: Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve:
-
-
- Data Cleaning and Transformation -
- Data Pipelines -
- Batch vs. Stream Processing +
Extracting acoustic features - Mel-frequency cepstral coefficients (MFCCs)3 are commonly used to represent important audio characteristics.
+Creating Embeddings- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format thatâs more manageable and efficient for computation and storage.
+Vector quantization4 - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information.
+Sequential storage - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data.
3 Abdul, Zrar Kh, and Abdulbasit K. Al-Talabani. "Mel Frequency Cepstral Coefficient and its applications: A Review." IEEE Access (2022).
4 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.
This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio.
+Selective Network Output Storage: Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 20185), the authors discuss adaptation of the modelâs intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications.
5 Rybakov, Oleg, et al. "Streaming keyword spotting on mobile devices." arXiv preprint arXiv:2005.06720 (2020).
6.6 Data Quality
-Explanation: Ensuring data quality is critical to developing reliable AI models. This section outlines various strategies to assure and evaluate data quality.
+6.5 Data Processing
+Data processing refers to the steps involved in transforming raw data into a format that is suitable for feeding into machine learning algorithms. It is a crucial stage in any machine learning workflow, yet often overlooked. Without proper data processing, machine learning models are unlikely to achieve optimal performance. âData preparation accounts for about 60-80% of the work of a data scientist.â
+Proper data cleaning is a crucial step that directly impacts model performance. Real-world data is often dirty - it contains errors, missing values, noise, anomalies, and inconsistencies. Data cleaning involves detecting and fixing these issues to prepare high-quality data for modeling. By carefully selecting appropriate techniques, data scientists can improve model accuracy, reduce overfitting, and enable algorithms to learn more robust patterns. Overall, thoughtful data processing allows machine learning systems to better uncover insights and make predictions from real-world data.
+Data often comes from diverse sources and can be unstructured or semi-structured. Thus, itâs essential to process and standardize it, ensuring it adheres to a uniform format. Such transformations may include:
-
-
- Data Validation -
- Handling Missing Values -
- Outlier Detection -
- Data Provenance +
- Normalizing numerical variables +
- Encoding categorical variables +
- Using techniques like dimensionality reduction
Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation6, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them.
6 Vasuki, A., and P. T. Vanathi. "A review of vector quantization techniques." IEEE Potentials 25.4 (2006): 39-47.
Letâs take a look at an example of a data processing pipeline. In the context of tinyML, the Multilingual Spoken Words Corpus (MSWC) is an example of data processing pipelinesâsystematic and automated workflows for data transformation, storage, and processing. By streamlining the data flow, from raw data to usable datasets, data pipelines enhance productivity and facilitate the rapid development of machine learning models. The MSWC is an expansive and expanding collection of audio recordings of spoken words in 50 different languages, which are collectively used by over 5 billion people. This dataset is intended for academic study and business uses in areas like keyword identification and speech-based search. It is openly licensed under Creative Commons Attribution 4.0 for broad usage.
+The MSWC used a forced alignment method to automatically extract individual word recordings to train keyword-spotting models from the Common Voice project, which features crowdsourced sentence-level recordings. Forced alignment refers to a group of long-standing methods in speech processing that are used to predict when speech phenomena like syllables, words, or sentences start and end within an audio recording. In the MSWC data, crowd-sourced recordings often feature background noises, such as static and wind. Depending on the modelâs requirements, these noises can be removed or intentionally retained.
+Maintaining the integrity of the data infrastructure is a continuous endeavor. This encompasses data storage, security, error handling, and stringent version control. Periodic updates are crucial, especially in dynamic realms like keyword spotting, to adjust to evolving linguistic trends and device integrations.
+There is a boom of data processing pipelines, these are commonly found in ML operations toolchains, which we will discuss in the MLOps chapter. Briefly, these include frameworks like MLOps by Google Cloud. It provides methods for automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment, and infrastructure management, and there are several mechanisms that specifically focus on data processing which is an integral part of these systems.
6.7 Feature Engineering
-Explanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. Itâs vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance.
+6.6 Data Labeling
+Data labeling is an important step in creating high-quality training datasets for machine learning models. Labels provide the ground truth information that allows models to learn relationships between inputs and desired outputs. This section covers key considerations around selecting label types, formats, and content to capture the necessary information for given tasks. It discusses common annotation approaches, from manual labeling to crowdsourcing to AI-assisted methods, and best practices for ensuring label quality through training, guidelines, and quality checks. Ethical treatment of human annotators is also something we emphasize. The integration of AI to accelerate and augment human annotation is also explored. Understanding labeling needs, challenges, and strategies is essential for constructing reliable, useful datasets that can train performant, trustworthy machine learning systems.
+Label Types Labels capture information about key tasks or concepts. Common label types include binary classification, bounding boxes, segmentation masks, transcripts, captions, etc. The choice of label format depends on the use case and resource constraints, as more detailed labels require greater effort to collect (Johnson-Roberson et al. (2017)).
+Unless focused on self-supervised learning, a dataset will likely provide labels addressing one or more tasks of interest. Dataset creators must consider what information labels should capture and how they can practically obtain the necessary labels, given their unique resource constraints. Creators must first decide what type(s) of content labels should capture. For example, a creator interested in car detection would want to label cars in their dataset. Still, they might also consider whether to simultaneously collect labels for other tasks that the dataset could potentially be used for in the future, such as pedestrian detection.
+Additionally, annotators can potentially provide metadata that provides insight into how the dataset represents different characteristics of interest (see: Data Transparency). The Common Voice dataset, for example, includes various types of metadata that provide information about the speakers, recordings, and dataset quality for each language represented (Ardila et al. (2020)). They include demographic splits showing the number of recordings by speaker age range and gender. This allows us to see the breakdown of who contributed recordings for each language. They also include statistics like average recording duration and total hours of validated recordings. These give insights into the nature and size of the datasets for each language. Additionally, quality control metrics like the percentage of recordings that have been validated are useful to know how complete and clean the datasets are. The metadata also includes normalized demographic splits scaled to 100% for comparison across languages. This highlights representation differences between higher and lower resource languages.
+Next, creators must determine the format of those labels. For example, a creator interested in car detection might choose between binary classification labels that say whether a car is present, bounding boxes that show the general locations of any cars, or pixel-wise segmentation labels that show the exact location of each car. Their choice of label format may depend both on their use case and their resource constraints, as finer-grained labels are typically more expensive and time-consuming to acquire.
+Annotation Methods: Common annotation approaches include manual labeling, crowdsourcing, and semi-automated techniques. Manual labeling by experts yields high quality but lacks scalability. Crowdsourcing enables distributed annotation by non-experts, often through dedicated platforms (Sheng and Zhang (2019)). Weakly supervised and programmatic methods can reduce manual effort by heuristically or automatically generating labels (Ratner et al. (2018)).
+After deciding on their labelsâ desired content and format, creators begin the annotation process. To collect large numbers of labels from human annotators, creators frequently rely on dedicated annotation platforms, which can connect them to teams of human annotators. When using these platforms, creators may have little insight to annotatorsâ backgrounds and levels of experience with topics of interest. However, some platforms offer access to annotators with specific expertise (e.g. doctors).
+Ensuring Label Quality: There is no guarantee that the data labels are actually correct. It is possible that despite the best instructions being given to labelers, they still mislabel some images (Northcutt, Athalye, and Mueller (2021)). Strategies like quality checks, training annotators, and collecting multiple labels per datapoint can help ensure label quality. For ambiguous tasks, multiple annotators can help identify controversial datapoints and quantify disagreement levels.
+When working with human annotators, it is important to offer fair compensation and otherwise prioritize ethical treatment, as annotators can be exploited or otherwise harmed during the labeling process (Perrigo, 2023). For example, if a dataset is likely to contain disturbing content, annotators may benefit from having the option to view images in grayscale (Google (n.d.)).
+AI-Assisted Annotation: ML has an insatiable demand for data. Therefore, no amount of data is sufficient data. This raises the question of how we can get more labeled data. Rather than always generating and curating data manually, we can rely on existing AI models to help label datasets more quickly and cheaply, though often with lower quality than human annotation. This can be done in various ways, such as the following:
-
-
- Importance of Feature Engineering -
- Techniques of Feature Selection -
- Feature Transformation for Embedded Systems -
- Embeddings -
- Real-time Feature Engineering in Embedded Systems +
- Pre-annotation: AI models can generate preliminary labels for a dataset using methods such as semi-supervised learning (Chapelle, Scholkopf, and Zien (2009)), which humans can then review and correct. This can save a significant amount of time, especially for large datasets. +
- Active learning: AI models can identify the most informative data points in a dataset, which can then be prioritized for human annotation. This can help improve the labeled datasetâs quality while reducing the overall annotation time. +
- Quality control: AI models can be used to identify and flag potential errors in human annotations. This can help to ensure the accuracy and consistency of the labeled dataset.
6.8 Data Labeling
-Explanation: Labeling is an essential part of preparing data for supervised learning. This section focuses on various strategies and tools available for data labeling, a vital process in the data preparation phase.
+Here are some examples of how AI-assisted annotation has been proposed to be useful:
-
-
- Manual Data Labeling -
- Ethical Considerations (e.g. OpenAI issues) -
- Automated Data Labeling -
- Labeling Tools +
- Medical imaging: AI-assisted annotation is being used to label medical images, such as MRI scans and X-rays (Krishnan, Rajpurkar, and Topol (2022)). Carefully annotating medical datasets is extremely challenging, especially at scale, since domain experts are both scarce and it becomes a costly effort. This can help to train AI models to diagnose diseases and other medical conditions more accurately and efficiently.
+
+ - Self-driving cars: AI-assisted annotation is being used to label images and videos from self-driving cars. This can help to train AI models to identify objects on the road, such as other vehicles, pedestrians, and traffic signs. +
- Social media: AI-assisted annotation is being used to label social media posts, such as images and videos. This can help to train AI models to identify and classify different types of content, such as news, advertising, and personal posts.
6.9 Data Version Control
-Explanation: Version control is critical for managing changes and tracking versions of datasets during the development of AI models, facilitating reproducibility and collaboration.
+6.7 Data Version Control
+Production systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This proliferating data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data hungry in terms of the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative.
+Data Version Control offers a structured methodology to handle alterations and versions of datasets efficiently. It facilitates the monitoring of modifications, preserves multiple versions, and guarantees reproducibility and traceability in data-centric projects. Furthermore, data version control provides the versatility to review and utilize specific versions as needed, ensuring that each stage of the data processing and model development can be revisited and audited with precision and ease. It has a variety of practical uses -
+Risk Management: Data version control allows transparency and accountability by tracking versions of the dataset.
+Collaboration and Efficiency: Easy access to different versions of the dataset in one place can improve data sharing of specific checkpoints, and enable efficient collaboration.
+Reproducibility: Data version control allows for tracking the performance of models with respect to different versions of the data, and therefore enabling reproducibility.
+Key Concepts
-
-
- Version Control Systems -
- Metadata +
Commits: It is an immutable snapshot of the data at a specific point in time, representing a unique version. Every commit is associated with a unique identifier to allow
+Branches: Branching allows developers and data scientists to diverge from the main line of development and continue to work independently without affecting other branches. This is especially useful when experimenting with new features or models, enabling parallel development and experimentation without the risk of corrupting the stable, main branch.
+Merges: Merges help to integrate changes from different branches while maintaining the integrity of the data.
Popular Data Version Control Systems
+DVC: It stands for Data Version Control in short, and is an open-source, lightweight tool that works on top of github and supports all kinds of data format. It can seamlessly integrate into the Git workflow, if Git is being used for managing code. It captures the versions of data and models in the Git commits, while storing them on premises or on cloud (e.g. AWS, Google Cloud, Azure). These data and models (e.g. ML artifacts) are defined in the metadata files, which get updated in every commit. It can allow metrics tracking of models on different versions of the data.
+lakeFS: It is an open-source tool that supports the data version control on data lakes. It supports many git-like operations such as branching and merging of data, as well as reverting to previous versions of the data. It also has a unique UI feature which allows exploration and management of data much easier.
+Git LFS: It is useful for data version control on smaller sized datasets. It uses Gitâs inbuilt branching and merging features, but is limited in terms of tracking metrics, reverting to previous versions or integration with data lakes.
6.10 Optimizing Data for Embedded AI
-Explanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments.
+6.8 Optimizing Data for Embedded AI
+Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle.
+On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. As a result, creators may design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from:
-
-
- Low-Resource Data Challenges -
- Data Reduction Techniques -
- Optimizing Data Storage and Retrieval +
- Geographically, socially and architecturally diverse neighborhoods +
- Different types of artificial and natural lighting +
- Different seasons and weather conditions +
- Obstructions (e.g. raindrops or delivery boxes obscuring the cameraâs view)
As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations.
6.11 Challenges in Data Engineering
-Explanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems.
+6.9 Data Transparency
+By providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (Pushkarna, Zaldivar, and Kjartansson (2022)), datasheets (Gebru et al. (2021)), data statements (Bender and Friedman (2018)), or Data Nutrition Labels (Holland et al. (2020)). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras).
+Keeping track of data provenanceâessentially the origins and the journey of each data point through the data pipelineâis not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesnât just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the modelâs performance and its acceptability among end-users.
+When producing documentation, creators should also clearly specify how users can access the dataset and how the dataset will be maintained over time. For example, users may need to undergo training or receive special permission from the creators before accessing a dataset containing protected information, as is the case with many medical datasets. In some cases, users may not be permitted to directly access the data and must instead submit their model to be trained on the dataset creatorsâ hardware, following a federated learning setup (Aledhari et al. (2020)). Creators may also describe how long the dataset will remain accessible, how the users can submit feedback on any errors that they discover, and whether there are plans to update the dataset.
+Some laws and regulations promote also data transparency through new requirements for organizations:
-
-
- Scalability -
- Data Security and Privacy -
- Data Bias and Representativity +
- General Data Protection Regulation (GDPR) in European Union: It establishes strict requirements for processing and protecting personal data of EU citizens. It mandates plain language privacy policies that clearly explain what data is collected, why it is used, how long it is stored, and with whom it is shared. GDPR also mandates that privacy notices must include details on legal basis for processing, data transfers, retention periods, rights to access and deletion, and contact info for data controllers. +
- Californiaâs Consumer Privacy Act (CCPA): CCPA requires clear privacy policies and opt-out rights for the sale of personal data. Significantly, it also establishes rights for consumers to request their specific data be disclosed. Businesses must provide copies of collected personal information along with details on what it is used for, what categories are collected, and what third parties receive it. Consumers can identify data points they believe are inaccurate. The law represents a major step forward in empowering personal data access.
There are several current challenges in ensuring data transparency, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can be difficult to achieve in these cases. Full transparency may also overwhelm the consumers with too much detail. And finally, it is also important to balance the tradeoff between transparency and privacy.
6.12 Promoting Transparency
-Explanation: We explain that as we increasingly use these systems built on the foundation of data, we need to have more transparency in the ecosystem.
+6.10 Licensing
+Many high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences.
+For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 20207). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage (ImageNet, 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets.
7 Birhane, Abeba, and Vinay Uday Prabhu. "Large image datasets: A pyrrhic win for computer vision?." 2021 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2021.
The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google's book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google's favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds.
+New Data Regulations and Their Implications
+New data regulations also impact licensing practices. The legislative landscape is evolving with regulations like the EUâs Artificial Intelligence Act, which is poised to regulate AI system development and use within the European Union (EU). This legislation:
+-
+
Classifies AI systems by risk.
+Mandates development and usage prerequisites.
+Emphasizes data quality, transparency, human oversight, and accountability.
+
Additionally, the EU Act addresses the ethical dimensions and operational challenges in sectors such as healthcare and finance. Key elements include the prohibition of AI systems posing "unacceptable" risks, stringent conditions for high-risk systems, and minimal obligations for "limited risk" AI systems. The proposed European AI Board will oversee and ensure efficient regulation implementation.
+Challenges in Assembling ML Training Datasets
+Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing8 or public-private data collaborations could greatly accelerate industry progress and ethical standards.
8 Sonnenburg, Soren, et al. "The need for open source software in machine learning." (2007): 2443-2466.
In some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information (APPI) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request.
+Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed.
+For instance, below is an example request from Common Voice users to remove their information:
+Thank you for downloading the Common Voice dataset. Account holders are free to request deletion of their voice clips at any time. We action this on our side for all future releases and are legally obligated to inform those who have downloaded a historic release so that they can also take action. +You are receiving this message because one or more account holders have requested that their voice clips be deleted. Their clips are part of the dataset that you downloaded and are associated with the hashed IDs listed below. Please delete them from your downloads in order to fulfill your third party data privacy obligations. +Thank you for your timely completion.
4497f1df0c6c4e647fa4354ad07a40075cc95a210dafce49ce0c35cd252 e4ec0fad1034e0cc3af869499e6f60ce315fe600ee2e9188722de906f909a21e0ee57 97a8f0a1df086bd5f76343f5f4a511ae39ec98256a0ca48de5c54bc5771 d8c8e32283a11056147624903e9a3ac93416524f19ce0f9789ce7eef2262785cf3af7 969ea94ac5e20bdd7a098747f5dc2f6d203f6b659c0c3b6257dc790dc34 d27ac3f2fafb3910f1ec8d7ebea38c120d4b51688047e352baa957cc35f0f5c69b112 6b5460779f644ad39deffeab6edf939547f206596089d554984abff3d36 a4ecc06e66870958e62299221c09af8cd82864c626708371d72297eaea5955d8e46a9 33275ff207a27708bd1187ff950888da592cac507e01e922c4b9a07d3f6 c2c3fe2ade429958c3702294f446bfbad8c4ebfefebc9e157d358ccc6fcf5275e7564 |
+
---|
+ |
Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample's effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That's a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate9,10,11 its impact on the model's behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system.
9 Ginart, Antonio, et al. "Making ai forget you: Data deletion in machine learning." Advances in neural information processing systems 32 (2019).
10 Sekhari, Ayush, et al. "Remember what you want to forget: Algorithms for machine unlearning." Advances in Neural Information Processing Systems 34 (2021): 18075-18086.
11 Guo, Chuan, et al. "Certified data removal from machine learning models." arXiv preprint arXiv:1911.03030 (2019).
Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering.
6.13 Licensing
-Explanation: This section emphasizes why one must understand data licensing issues before they start using the data to train the models.
--
-
- Metadata -
- Data Nutrition Project -
- Understanding Licensing -
6.11 Conclusion
+Data is the fundamental building block of AI systems. Without quality data, even the most advanced machine learning algorithms will fail. Data engineering encompasses the end-to-end process of collecting, storing, processing and managing data to fuel the development of machine learning models. It begins with clearly defining the core problem and objectives, which guides effective data collection. Data can be sourced from diverse means including existing datasets, web scraping, crowdsourcing and synthetic data generation. Each approach involves tradeoffs between factors like cost, speed, privacy and specificity. Once data is collected, thoughtful labeling through manual or AI-assisted annotation enables the creation of high-quality training datasets. Proper storage in databases, warehouses or lakes facilitates easy access and analysis. Metadata provides contextual details about the data. Data processing transforms raw data into a clean, consistent format ready for machine learning model development. Throughout this pipeline, transparency through documentation and provenance tracking is crucial for ethics, auditability and reproducibility. Data licensing protocols also govern legal data access and use. Key challenges in data engineering include privacy risks, representation gaps, legal restrictions around proprietary data, and the need to balance competing constraints like speed versus quality. By thoughtfully engineering high-quality training data, machine learning practitioners can develop accurate, robust and responsible AI systems, including for embedded and tinyML applications.
6.14 Conclusion
-Explanation: Close up the chapter with a summary of the key topics that we have covered in this section.
--
-
- The Future of Data Engineering in Embedded AI -
- Key Takeaways -
6.12 Helpful References
+1. [3 big problems with datasets in AI and machine learning](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/)
+2. [Common Voice: A Massively-Multilingual Speech Corpus](https://arxiv.org/abs/1912.06670)
+3. [Data Engineering for Everyone](https://arxiv.org/abs/2102.11447)
+4. [DataPerf: Benchmarks for Data-Centric AI Development](https://arxiv.org/abs/2207.10062)
+5. [Deep Spoken Keyword Spotting: An Overview](https://arxiv.org/abs/2111.10592)
+6. [âEveryone wants to do the model work, not the data workâ: Data Cascades in High-Stakes AI](https://research.google/pubs/pub49953/)
+7. [Improving Reproducibility in Machine Learning Research (A Report from the NeurIPS 2019 Reproducibility Program)](https://arxiv.org/abs/2003.12206)
+8. [LabelMe](https://people.csail.mit.edu/torralba/publications/labelmeApplications.pdf)
+9. [Model Cards for Model Reporting](https://arxiv.org/abs/1810.03993)
+10. [Multilingual Spoken Words Corpus](https://openreview.net/pdf?id=c20jiJ5K2H)
+11. [OpenImages](https://storage.googleapis.com/openimages/web/index.html)
+12. [Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https://arxiv.org/abs/2103.14749)
+13. [Small-footprint keyword spotting using deep neural networks](https://ieeexplore.ieee.org/abstract/document/6854370?casa_token=XD6SL8Um1Y0AAAAA:ZxqFThJWLlwDrl1IA374t_YzEvwHNNR-pTWiWV9pyr85rsl-ZZ5BpkElyHo91d3_l8yU0IVIgg)
+14. [SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition](https://arxiv.org/abs/1904.08779)
+ +14 Embedded MLOps
+14.1 Introduction
+Explanation: This subsection sets the groundwork for the discussions to follow, elucidating the fundamental concept of MLOps and its critical role in enhancing the efficiency, reliability, and scalability of embedded AI systems. It outlines the unique characteristics of implementing MLOps in an embedded context, emphasizing its significance in the streamlined deployment and management of machine learning models.
+-
+
- Overview of MLOps +
- The importance of MLOps in the embedded domain +
- Unique challenges and opportunities in embedded MLOps +
14.2 Deployment Environments
+Explanation: This section focuses on different environments where embedded AI systems can be deployed. It will delve into aspects like edge devices, cloud platforms, and hybrid environments, offering insights into the unique characteristics and considerations of each.
+-
+
- Cloud-based deployment: Features and benefits +
- Edge computing: Characteristics and applications +
- Hybrid environments: Combining the best of edge and cloud computing +
- Considerations for selecting an appropriate deployment environment +
14.3 Deployment Strategies
+Explanation: Here, readers will be introduced to various deployment strategies that facilitate a smooth transition from development to production. It discusses approaches such as blue-green deployments, canary releases, and rolling deployments, which can help in maintaining system stability and minimizing downtime during updates.
+-
+
- Overview of different deployment strategies +
- Blue-green deployments: Definition and benefits +
- Canary releases: Phased rollouts and monitoring +
- Rolling deployments: Ensuring continuous service availability +
- Strategy selection: Factors to consider +
14.4 Workflow Automation
+Explanation: Automation is at the heart of MLOps, helping to streamline workflows and enhance efficiency. This subsection highlights the significance of workflow automation in embedded MLOps, discussing various strategies and techniques for automating tasks such as testing, deployment, and monitoring, fostering a faster and error-free development lifecycle.
+-
+
- Automated testing: unit tests, integration tests +
- Automated deployment: scripting, configuration management +
- Continuous monitoring: setting up automated alerts and dashboards +
- Benefits of workflow automation: speed, reliability, repeatability +
14.5 Model Versioning
+Explanation: Model versioning is a pivotal aspect of MLOps, facilitating the tracking and management of different versions of machine learning models throughout their lifecycle. This subsection emphasizes the importance of model versioning in embedded systems, where memory and computational resources are limited, offering strategies for effective version management and rollback.
+-
+
- Importance of versioning in machine learning pipelines +
- Tools for model versioning: DVC, MLflow +
- Strategies for version control: naming conventions, metadata tagging +
- Rollback strategies: handling model regressions and rollbacks +
14.6 Model Monitoring and Maintenance
+Explanation: The process of monitoring and maintaining deployed models is crucial to ensure their long-term performance and reliability. This subsection underscores the significance of proactive monitoring and maintenance in embedded systems, discussing methodologies for monitoring model health, performance metrics, and implementing routine maintenance tasks to ensure optimal functionality.
+-
+
- The importance of monitoring deployed AI models +
- Setting up monitoring systems: tools and techniques +
- Tracking model performance: accuracy, latency, resource usage +
- Maintenance strategies: periodic updates, fine-tuning +
- Alerts and notifications: Setting up mechanisms for timely responses to issues +
- Over the air updates +
- Responding to anomalies: troubleshooting and resolution strategies +
14.7 Security and Compliance
+Explanation: Security and compliance are paramount in MLOps, safeguarding sensitive data and ensuring adherence to regulatory requirements. This subsection illuminates the critical role of implementing security measures and ensuring compliance in embedded MLOps, offering insights into best practices for data protection, access control, and regulatory adherence.
+-
+
- Security considerations in embedded MLOps: data encryption, secure communications +
- Compliance requirements: GDPR, HIPAA, and other regulations +
- Strategies for ensuring compliance: documentation, audits, training +
- Tools for security and compliance management: SIEM systems, compliance management platforms +
14.8 Conclusion
+Explanation: As we wrap up this chapter, we consolidate the key takeaways regarding the implementation of MLOps in the embedded domain. This final section seeks to furnish readers with a holistic view of the principles and practices of embedded MLOps, encouraging a thoughtful approach to adopting MLOps strategies in their projects, with a glimpse into the potential future trends in this dynamic field.
+-
+
- Recap of key concepts and best practices in embedded MLOps +
- Challenges and opportunities in implementing MLOps in embedded systems +
- Future directions: emerging trends and technologies in embedded MLOps +
5 ML Workflow
+In this chapter, weâre going to learn about the machine learning workflow. It will set the stages for the later chapters that dive into the details. But to prevent ourselves from missing the forest for the trees, this chapter gives a high level overview of the stpes involved in the ML workflow.
+The ML workflow is a systematic and structured approach that guides professionals and researchers in developing, deploying, and maintaining ML models. This workflow is generally delineated into several critical stages, each contributing towards the effective development of intelligent systems.
+Hereâs a broad outline of the stages involved:
+5.1 Overview
+A machine learning (ML) workflow is the process of developing, deploying, and maintaining ML models. It typically consists of the following steps:
+-
+
- Define the problem. What are you trying to achieve with your ML model? Do you want to classify images, predict customer churn, or generate text? Once you have a clear understanding of the problem, you can start to collect data and choose a suitable ML algorithm. +
- Collect and prepare data. ML models are trained on data, so itâs important to collect a high-quality dataset that is representative of the real-world problem youâre trying to solve. Once you have your data, you need to clean it and prepare it for training. This may involve tasks such as removing outliers, imputing missing values, and scaling features. +
- Choose an ML algorithm. There are many different ML algorithms available, each with its own strengths and weaknesses. The best algorithm for your project will depend on the type of data you have and the problem youâre trying to solve. +
- Train the model. Once you have chosen an ML algorithm, you need to train the model on your prepared data. This process can take some time, depending on the size and complexity of your dataset. +
- Evaluate the model. Once the model is trained, you need to evaluate its performance on a held-out test set. This will give you an idea of how well the model will generalize to new data. +
- Deploy the model. Once youâre satisfied with the performance of the model, you can deploy it to production. This may involve integrating the model into a software application or making it available as a web service. +
- Monitor and maintain the model. Once the model is deployed, you need to monitor its performance and make updates as needed. This is because the real world is constantly changing, and your model may need to be updated to reflect these changes. +
The ML workflow is an iterative process. Once you have deployed a model, you may find that it needs to be retrained on new data or that the algorithm needs to be adjusted. Itâs important to monitor the performance of your model closely and make changes as needed to ensure that it is still meeting your needs. In addition to the above steps, there are a number of other important considerations for ML workflows, such as:
+-
+
- Version control: Itâs important to track changes to your code and data so that you can easily reproduce your results and revert to previous versions if necessary. +
- Documentation: Itâs important to document your ML workflow so that others can understand and reproduce your work. +
- Testing: Itâs important to test your ML workflow thoroughly to ensure that it is working as expected. +
- Security: Itâs important to consider the security of your ML workflow and data, especially if you are deploying your model to production. +
5.2 General vs. Embedded AI
+The ML workflow delineated above serves as a comprehensive guide applicable broadly across various platforms and ecosystems, encompassing cloud-based solutions, edge computing, and tinyML. However, when we delineate the nuances of the general ML workflow and contrast it with the workflow in Embedded AI environments, we encounter a series of intricate differences and complexities. These nuances not only elevate the embedded AI workflow to a challenging and captivating domain but also open avenues for remarkable innovations and advancements.
+Now, letâs explore these differences in detail:
+-
+
- Resource Optimization:
+
-
+
- General ML Workflow: Generally has the luxury of substantial computational resources available in cloud or data center environments. It focuses more on model accuracy and performance. +
- Embedded AI Workflow: Needs meticulous planning and execution to optimize the modelâs size and computational demands, as they have to operate within the limited resources available in embedded systems. Techniques like model quantization and pruning become essential. +
+ - Real-time Processing:
+
-
+
- General ML Workflow: The emphasis on real-time processing is usually less, and batch processing of data is quite common. +
- Embedded AI Workflow: Focuses heavily on real-time data processing, necessitating a workflow where low latency and rapid execution are a priority, especially in applications like autonomous driving and industrial automation. +
+ - Data Management and Privacy:
+
-
+
- General ML Workflow: Data is typically processed in centralized locations, sometimes requiring extensive data transfer, with a focus on securing data during transit and storage. +
- Embedded AI Workflow: Promotes edge computing, which facilitates data processing closer to the source, reducing data transmission needs and enhancing privacy by keeping sensitive data localized. +
+ - Hardware-Software Integration:
+
-
+
- General ML Workflow: Often operates on general-purpose hardware platforms with software development happening somewhat independently. +
- Embedded AI Workflow: Involves a tighter hardware-software co-design where both are developed in tandem to achieve optimal performance and efficiency, integrating custom chips or utilizing hardware accelerators. +
+
5.3 Roles & Responsibilities
+As we work through the various tasks at hand, you will realize that there is a lot of complexity. Creating a machine learning solution, particularly for embedded AI systems, is a multidisciplinary endeavor involving various experts and specialists. Here is a list of personnel that are typically involved in the process, along with brief descriptions of their roles:
+Project Manager:
+-
+
- Coordinates and manages the overall project. +
- Ensures all team members are working synergistically. +
- Responsible for project timelines and milestones. +
Domain Experts:
+-
+
- Provide insights into the specific domain where the AI system will be implemented. +
- Help in defining project requirements and constraints based on domain-specific knowledge. +
Data Scientists:
+-
+
- Specialize in analyzing data to develop machine learning models. +
- Responsible for data cleaning, exploration, and feature engineering. +
Machine Learning Engineers:
+-
+
- Focus on the development and deployment of machine learning models. +
- Collaborate with data scientists to optimize models for embedded systems. +
Data Engineers:
+-
+
- Responsible for managing and optimizing data pipelines. +
- Work on the storage and retrieval of data used for machine learning model training. +
Embedded Systems Engineers:
+-
+
- Focus on integrating machine learning models into embedded systems. +
- Optimize system resources for running AI applications. +
Software Developers:
+-
+
- Develop software components that interface with the machine learning models. +
- Responsible for implementing APIs and other integration points for the AI system. +
Hardware Engineers:
+-
+
- Involved in designing and optimizing the hardware that hosts the embedded AI system. +
- Collaborate with embedded systems engineers to ensure compatibility. +
UI/UX Designers:
+-
+
- Design the user interface and experience for interacting with the AI system. +
- Focus on user-centric design and ensuring usability. +
Quality Assurance (QA) Engineers:
+-
+
- Responsible for testing the overall system to ensure it meets quality standards. +
- Work on identifying bugs and issues before the system is deployed. +
Ethicists and Legal Advisors:
+-
+
- Consult on the ethical implications of the AI system. +
- Ensure compliance with legal and regulatory requirements related to AI. +
Operations and Maintenance Personnel:
+-
+
- Responsible for monitoring the system after deployment. +
- Work on maintaining and upgrading the system as needed. +
Security Specialists:
+-
+
- Focus on ensuring the security of the AI system. +
- Work on identifying and mitigating potential security vulnerabilities. +
Donât worry! You donât have to be a one-stop ninja.
+Understanding the diversified roles and responsibilities is paramount in the journey to building a successful machine learning project. As we traverse the upcoming chapters, we will wear the different hats, embracing the essence and expertise of each role described herein. This immersive method nurtures a deep-seated appreciation for the inherent complexities, thereby facilitating an encompassing grasp of the multifaceted dynamics of embedded AI projects.
+Moreover, this well-rounded insight promotes not only seamless collaboration and unified efforts but also fosters an environment ripe for innovation. It enables us to identify areas where cross-disciplinary insights might foster novel thoughts, nurturing ideas and ushering in breakthroughs in the field. Additionally, being aware of the intricacies of each role allows us to anticipate potential obstacles and strategize effectively, guiding the project towards triumph with foresight and detailed understanding.
+As we advance, we encourage you to hold a deep appreciation for the amalgamation of expertise that contributes to the fruition of a successful machine learning initiative. In later discussions, particularly when we delve into MLOps, we will examine these different facets or personas in greater detail. Itâs worth noting at this point that the range of topics touched upon might seem overwhelming. This endeavor aims to provide you with a comprehensive view of the intricacies involved in constructing an embedded AI system, without the expectation of mastering every detail personally.
+ + +