Skip to content

Commit

Permalink
oishi: added optimizing data text
Browse files Browse the repository at this point in the history
  • Loading branch information
oishib authored and mpstewart1 committed Oct 10, 2023
1 parent bfd325d commit 8d0c7b1
Showing 1 changed file with 8 additions and 20 deletions.
28 changes: 8 additions & 20 deletions data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -294,17 +294,6 @@ Here are some examples of how AI-assisted annotation has been proposed to be use

![](images/AI_Labeling_approaches.png)


## Feature Engineering

Explanation: Feature engineering involves selecting and transforming variables to improve the performance of AI models. It's vital in embedded AI systems where computational resources are limited, and optimized feature sets can significantly improve performance.

- Importance of Feature Engineering
- Techniques of Feature Selection
- Feature Transformation for Embedded Systems
- Embeddings
- Real-time Feature Engineering in Embedded Systems

## Data Version Control

Production systems are perpetually inundated with fluctuating and escalating volumes of data, prompting the rapid emergence of numerous data replicas. This proliferating data serves as the foundation for training machine learning models. For instance, a global sales company engaged in sales forecasting continuously receives consumer behavior data. Similarly, healthcare systems formulating predictive models for disease diagnosis are consistently acquiring new patient data. TinyML applications, such as keyword spotting, are highly data hungry in terms of the amount of data generated. Consequently, meticulous tracking of data versions and the corresponding model performance is imperative.
Expand Down Expand Up @@ -337,19 +326,18 @@ and therefore enabling reproducibility.

## Optimizing Data for Embedded AI

Explanation: This section concentrates on optimization techniques specifically suited for embedded systems, focusing on strategies to reduce data volume and enhance storage and retrieval efficiency, crucial for resource-constrained embedded environments.

- Low-Resource Data Challenges
- Data Reduction Techniques
- Optimizing Data Storage and Retrieval
Creators working on embedded systems may have unusual priorities when cleaning their datasets. On the one hand, models may be developed for unusually specific use cases, requiring heavy filtering of datasets. While other natural language models may be capable of turning any speech to text, a model for an embedded system may be focused on a single limited task, such as detecting a keyword. As a result, creators may aggressively filter out large amounts of data because they do not address the task of interest. Additionally, an embedded AI system may be tied to specific hardware devices or environments. For example, a video model may need to process images from a single type of camera, which will only be mounted on doorbells in residential neighborhoods. In this scenario, creators may discard images if they came from a different kind of camera, show the wrong type of scenery, or were taken from the wrong height or angle.

On the other hand, embedded AI systems are often expected to provide especially accurate performance in unpredictable real-world settings. As a result, creators may design datasets specifically to represent variations in potential inputs and promote model robustness. As a result, they may define a narrow scope for their project but then aim for deep coverage within those bounds. For example, creators of the doorbell model mentioned above might try to cover variations in data arising from:

## Challenges in Data Engineering
- Geographically, socially and architecturally diverse neighborhoods
- Different types of artificial and natural lighting
- Different seasons and weather conditions
- Obstructions (e.g. raindrops or delivery boxes obscuring the camera’s view)

Explanation: Understanding potential challenges can help in devising strategies to mitigate them. This section discusses common challenges encountered in data engineering, particularly focusing on embedded systems.
As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations.

- Scalability
- Data Security and Privacy
- Data Bias and Representativity

## Promoting Transparency

Expand Down

0 comments on commit 8d0c7b1

Please sign in to comment.