From 2f66a8a4510b60dbef642306a400fb692bbead32 Mon Sep 17 00:00:00 2001 From: mrdragonbear Date: Tue, 10 Oct 2023 16:26:21 -0400 Subject: [PATCH] Minor updates to references and formatting --- data_engineering.qmd | 20 ++++++++++---------- references.bib | 2 +- 2 files changed, 11 insertions(+), 11 deletions(-) diff --git a/data_engineering.qmd b/data_engineering.qmd index 2899245b..4bd3bb34 100644 --- a/data_engineering.qmd +++ b/data_engineering.qmd @@ -12,7 +12,7 @@ Data is the lifeblood of AI systems. Without good data, even the most advanced m Dataset creators face complex privacy and representation challenges when building high-quality training data, especially for sensitive domains like healthcare. Legally, creators may need to remove direct identifiers like names and ages. Even without legal obligations, removing such information can help build user trust. However, excessive anonymization can compromise dataset utility. Techniques like differential privacy$^{1}$, aggregation, and reducing detail provide alternatives to balance privacy and utility, but have downsides. Creators must strike a thoughtful balance based on use case. -Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases.[^2] It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such [higher-order gaps](https://blog.google/technology/health/healthcare-ai-systems-put-people-center/) are not immediately obvious but can critically impact model performance. +Looking beyond privacy, creators need to proactively assess and address representation gaps that could introduce model biases. It is crucial yet insufficient to ensure diversity across individual variables like gender, race, and accent. Combinations of characteristics also require assessment, as models can struggle when certain intersections are absent. For example, a medical dataset could have balanced gender, age, and diagnosis data individually, but lack enough cases capturing elderly women with a specific condition. Such [higher-order gaps](https://blog.google/technology/health/healthcare-ai-systems-put-people-center/) are not immediately obvious but can critically impact model performance. Creating useful, ethical training data requires holistic consideration of privacy risks and representation gaps. Perfect solutions are elusive. However, conscientious data engineering practices like anonymization, aggregation, undersampling overrepresented groups, and synthesized data generation can help balance competing needs. This facilitates models that are both accurate and socially responsible. Cross-functional collaboration and external audits can also strengthen training data. The challenges are multifaceted, but surmountable with thoughtful effort. @@ -192,7 +192,7 @@ Data sourcing and data storage go hand-in-hand and it is necessary to store data The stored data is often accompanied by metadata, which is defined as 'data about data'. It provides detailed contextual information about the data, such as means of data creation, time of creation, attached data use license etc. For example, [[Hugging Face]{.underline}](https://huggingface.co/) has [[Dataset Cards]{.underline}](https://huggingface.co/docs/hub/datasets-cards). To promote responsible data use, dataset creators should disclose potential biases through the dataset cards. These cards can educate users about a dataset\'s contents and limitations. The cards also give vital context on appropriate dataset usage by highlighting biases and other important details. Having this type of metadata can also allow fast retrieval if structured properly. Once the model is developed and deployed to edge devices, the storage systems can continue to store incoming data, model updates or analytical results. -**Data Governance**[^1]**:** With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks. +**Data Governance:** With a large amount of data storage, it is also imperative to have policies and practices (i.e., data governance) that helps manage data during its life cycle, from acquisition to disposal. Data governance frames the way data is managed and includes making pivotal decisions about data access and control. It involves exercising authority and making decisions concerning data, with the aim to uphold its quality, ensure compliance, maintain security, and derive value. Data governance is operationalized through the development of policies, incentives, and penalties, cultivating a culture that perceives data as a valuable asset. Specific procedures and assigned authorities are implemented to safeguard data quality and monitor its utilization and the related risks. Data governance utilizes three integrative approaches: planning and control, organizational, and risk-based. The planning and control approach, common in IT, aligns business and technology through annual cycles and continuous adjustments, focusing on policy-driven, auditable governance. The organizational approach emphasizes structure, establishing authoritative roles like Chief Data Officers, ensuring responsibility and accountability in governance. The risk-based approach, intensified by AI advancements, focuses on identifying and managing inherent risks in data and algorithms, especially addressing AI-specific issues through regular assessments and proactive risk management strategies, allowing for incidental and preventive actions to mitigate undesired algorithm impacts. @@ -213,17 +213,17 @@ Some examples of data governance across different sectors include: ***Efficient Audio Storage Formats:*** Keyword spotting systems need specialized audio storage formats to enable quick keyword searching in audio data. Traditional formats like WAV and MP3 store full audio waveforms, which require extensive processing to search through. Keyword spotting uses compressed storage optimized for snippet-based search. One approach is to store compact acoustic features instead of raw audio. Such a workflow would involve: -- *Extracting acoustic features* - Mel-frequency cepstral coefficients (MFCCs)[^2] are commonly used to represent important audio characteristics. +- *Extracting acoustic features* - Mel-frequency cepstral coefficients (MFCCs) are commonly used to represent important audio characteristics. - *Creating Embeddings*- Embeddings transform extracted acoustic features into continuous vector spaces, enabling more compact and representative data storage. This representation is essential in converting high-dimensional data, like audio, into a format that's more manageable and efficient for computation and storage. -- *Vector quantization*[^3] - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information. +- *Vector quantization* - This technique is used to represent high-dimensional data, like embeddings, with lower-dimensional vectors, reducing storage needs. Initially, a codebook is generated from the training data to define a set of code vectors representing the original data vectors. Subsequently, each data vector is matched to the nearest codeword according to the codebook, ensuring minimal loss of information. - *Sequential storage* - The audio is fragmented into short frames, and the quantized features (or embeddings) for each frame are stored sequentially to maintain the temporal order, preserving the coherence and context of the audio data. This format enables decoding the features frame-by-frame for keyword matching. Searching the features is faster than decompressing the full audio. -***Selective Network Output Storage:*** Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018[^4]), the authors discuss adaptation of the model's intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications. +***Selective Network Output Storage:*** Another technique for reducing storage is to discard the intermediate audio features stored during training, but not required during inference. The network is run on the full audio during training, however, only the final outputs are stored during inference. In a recent study (Rybakov et al. 2018), the authors discuss adaptation of the model's intermediate data storage structure to incorporate the nature of streaming models that are prevalent in tinyML applications. ## Data Processing @@ -239,7 +239,7 @@ Data often comes from diverse sources and can be unstructured or semi-structured - Encoding categorical variables - Using techniques like dimensionality reduction -Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation[^3], contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them. +Data validation serves a broader role than just ensuring adherence to certain standards like preventing temperature values from falling below absolute zero. These types of issues arise in TinyML because sensors may malfunction or temporarily produce incorrect readings, such transients are not uncommon. Therefore, it is imperative to catch data errors early before they propagate through the data pipeline. Rigorous validation processes, including verifying the initial annotation practices, detecting outliers, and handling missing values through techniques like mean imputation, contribute directly to the quality of datasets. This, in turn, impacts the performance, fairness, and safety of the models trained on them. ![A detailed overview of the Multilingual Spoken Words Corpus (MSWC) data processing pipeline: from raw audio and text data input, through forced alignment for word boundary estimation, to keyword extraction and model training](images/data_engineering_kws2.png) @@ -359,9 +359,9 @@ There are several current challenges in ensuring data transparency, especially b Many high-quality datasets either come from proprietary sources or contain copyrighted information. This introduces licensing as a challenging legal domain. Companies eager to train ML systems must engage in negotiations to obtain licenses that grant legal access to these datasets. Furthermore, licensing terms can impose restrictions on data applications and sharing methods. Failure to comply with these licenses can have severe consequences. -For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020[^6]). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage ([[ImageNet]{.underline}](https://www.image-net.org/#), 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets. +For instance, ImageNet, one of the most extensively utilized datasets for computer vision research, is a case in point. A majority of its images were procured from public online sources without obtaining explicit permissions, sparking ethical concerns (Prabhu and Birhane, 2020). Accessing the ImageNet dataset for corporations requires registration and adherence to its terms of use, which restricts commercial usage ([[ImageNet]{.underline}](https://www.image-net.org/#), 2021). Major players like Google and Microsoft invest significantly in licensing datasets to enhance their ML vision systems. However, the cost factor restricts accessibility for researchers from smaller companies with constrained budgets. -The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is Authors Guild, Inc. v. Google, Inc. This 2005 lawsuit alleged that Google\'s book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google\'s favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds. +The legal domain of data licensing has seen major cases that help define parameters of fair use. A prominent example is *Authors Guild, Inc. v. Google, Inc.* This 2005 lawsuit alleged that Google\'s book scanning project infringed copyrights by displaying snippets without permission. However, the courts ultimately ruled in Google\'s favor, upholding fair use based on the transformative nature of creating a searchable index and showing limited text excerpts. This precedent provides some legal grounds for arguing fair use protections apply to indexing datasets and generating representative samples for machine learning. However, restrictions specified in licenses remain binding, so comprehensive analysis of licensing terms is critical. The case demonstrates why negotiations with data providers are important to enable legal usage within acceptable bounds. **New Data Regulations and Their Implications** @@ -377,13 +377,13 @@ Additionally, the EU Act addresses the ethical dimensions and operational challe **Challenges in Assembling ML Training Datasets** -Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing[^7] or public-private data collaborations could greatly accelerate industry progress and ethical standards. +Complex licensing issues around proprietary data, copyright law, and privacy regulations all constrain options for assembling ML training datasets. But expanding accessibility through more open licensing or public-private data collaborations could greatly accelerate industry progress and ethical standards. In some cases, certain portions of a dataset may need to be removed or obscured in order to comply with data usage agreements or protect sensitive information. For example, a dataset of user information may have names, contact details, and other identifying data that may need to be removed from the dataset, this is well after the dataset has already been actively sourced and used for training models. Similarly, a dataset that includes copyrighted content or trade secrets may need to have those portions filtered out before being distributed. Laws such as the General Data Protection Regulation (GDPR), the California Consumer Privacy Act (CCPA), and the Amended Act on the Protection of Personal Information ([[APPI]{.underline}](https://www.ppc.go.jp/files/pdf/280222_amendedlaw.pdf)) have been passed to guarantee the right to be forgotten. These regulations legally require model providers to erase user data upon request. Data collectors and providers need to be able to take appropriate measures to de-identify or filter out any proprietary, licensed, confidential, or regulated information as needed. In some cases, the users may explicitly request that their data be removed. -Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample\'s effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That\'s a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate[^8]^,^[^9]^,^[^10] its impact on the model\'s behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system. +Having the ability to update the dataset by removing data from the dataset will enable the dataset creators to uphold legal and ethical obligations around data usage and privacy. However, the ability to remove data has some important limitations. We need to think about the fact that some models may have already been trained on the dataset and there is no clear or known way to eliminate a particular data sample\'s effect from the trained network. There is no erase mechanism. Thus, this begs the question, should the model be re-trained from scratch each time a sample is removed? That\'s a costly option. Once data has been used to train a model, simply removing it from the original dataset may not fully eliminate its impact on the model\'s behavior. New research is needed around the effects of data removal on already-trained models and whether full retraining is necessary to avoid retaining artifacts of deleted data. This presents an important consideration when balancing data licensing obligations with efficiency and practicality in an evolving, deployed ML system. Dataset licensing is a multifaceted domain intersecting technology, ethics, and law. As the world around us evolves, understanding these intricacies becomes paramount for anyone building datasets during data engineering. diff --git a/references.bib b/references.bib index 68be11b0..07868fed 100644 --- a/references.bib +++ b/references.bib @@ -207,7 +207,7 @@ @article{Ratner_Hancock_Dunnmon_Goldman_Ré_2018 @article{Sheng_Zhang_2019, title={Machine learning with crowdsourcing: A brief summary of the past research and Future Directions}, volume={33}, DOI={10.1609/aaai.v33i01.33019837}, number={01}, journal={Proceedings of the AAAI Conference on Artificial Intelligence}, author={Sheng, Victor S. and Zhang, Jing}, year={2019}, pages={9837–9843}} -@misc{Google, url={https://blog.google/documents/83/information_quality_content_moderation_white_paper.pdf/}, author={Google}, journal={Google}, publisher={Google}} +@misc{Google, url={https://blog.google/documents/83/}, title={Information quality & content moderation}, author={Google}} @misc{Labelbox, url={https://labelbox.com/}, journal={Labelbox}}