From 8d0c7b1a6391d18115bd6627fcc0652466e6428d Mon Sep 17 00:00:00 2001 From: oishib Date: Mon, 2 Oct 2023 12:28:06 -0400 Subject: [PATCH] oishi: added optimizing data text --- data_engineering.qmd | 28 ++++++++-------------------- 1 file changed, 8 insertions(+), 20 deletions(-) diff --git a/data_engineering.qmd b/data_engineering.qmd index 157bda1c3..bca3380cf 100644 --- a/data_engineering.qmd +++ b/data_engineering.qmd @@ -294,17 +294,6 @@ Here are some examples of how AI-assisted annotation has been proposed to be use ![](images/AI_Labeling_approaches.png) - -## Feature Engineering - -Explanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. It's vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance. - -- Importance of Feature Engineering -- Techniques of Feature Selection -- Feature Transformation for Embedded Systems -- Embeddings -- Real-time Feature Engineering in Embedded Systems - ## Data Version Control Production systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This proliferating data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data hungry in terms of the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative. @@ -337,19 +326,18 @@ and therefore enabling reproducibility. ## Optimizing Data for Embedded AI -Explanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments. -- Low-Resource Data Challenges -- Data Reduction Techniques -- Optimizing Data Storage and Retrieval +Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle. + +On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. As a result, creators may design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from: -## Challenges in Data Engineering +- Geographically, socially and architecturally diverse neighborhoods +- Different types of artificial and natural lighting +- Different seasons and weather conditions +- Obstructions (e.g. raindrops or delivery boxes obscuring the camera’s view) -Explanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems. +As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations. -- Scalability -- Data Security and Privacy -- Data Bias and Representativity ## Promoting Transparency