Skip to content

Commit

Permalink
Merge branch 'dev'
Browse files Browse the repository at this point in the history
  • Loading branch information
profvjreddi committed Jan 11, 2025
2 parents d811848 + a60ab60 commit f73497b
Show file tree
Hide file tree
Showing 2 changed files with 5 additions and 7 deletions.
3 changes: 2 additions & 1 deletion contents/core/data_engineering/data_engineering.bib
Original file line number Diff line number Diff line change
Expand Up @@ -481,4 +481,5 @@ @article{pineau2021improving
number = {164},
pages = {1--20},
year = {2021},
}
}

9 changes: 3 additions & 6 deletions contents/core/data_engineering/data_engineering.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -160,12 +160,9 @@ Platforms like [Kaggle](https://www.kaggle.com/) and [UCI Machine Learning Repos

Many of these datasets, such as [ImageNet](https://www.image-net.org/), have become standard benchmarks in the machine learning community, enabling consistent performance comparisons across different models and architectures. For ML system developers, this standardization provides clear metrics for evaluating model improvements and system performance. The immediate availability of these datasets allows teams to begin experimentation and prototyping without delays in data collection and preprocessing.

However, ML practitioners must carefully consider the quality assurance aspects of pre-existing datasets. For instance, the ImageNet dataset was found to have over 6.4% errors [@northcutt2021pervasive]. While popular datasets benefit from community scrutiny that helps identify and correct errors and biases, these issues can significantly impact system performance if not properly addressed. Moreover, as @gebru2018datasheets highlighted in her paper, simply providing a dataset without documentation can lead to misuse and misinterpretation, potentially amplifying biases present in the data."

However, ML practitioners must carefully consider the quality assurance aspects of pre-existing datasets. For instance, the ImageNet dataset was found to have over 6.4% errors [@northcutt2021pervasive]. While popular datasets benefit from community scrutiny that helps identify and correct errors and biases, these issues can significantly impact system performance if not properly addressed. Moreover, as [@gebru2018datasheets] highlighted in her paper, simply providing a dataset without documentation can lead to misuse and misinterpretation, potentially amplifying biases present in the data.


Supporting documentation often accompanying existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process and variable definitions and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently, there is a crisis around improving reproducibility in machine learning systems [@pineau2020improving]. When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each other's work more rapidly.
Supporting documentation often accompanying existing datasets is invaluable, though this generally applies only to widely used datasets. Good documentation provides insights into the data collection process and variable definitions and sometimes even offers baseline model performances. This information not only aids understanding but also promotes reproducibility in research, a cornerstone of scientific integrity; currently, there is a crisis around improving reproducibility in machine learning systems [@pineau2021improving]. When other researchers have access to the same data, they can validate findings, test new hypotheses, or apply different methodologies, thus allowing us to build on each other's work more rapidly.

While platforms like Kaggle and UCI Machine Learning Repository are invaluable resources, it's essential to understand the context in which the data was collected. Researchers should be wary of potential overfitting when using popular datasets, as multiple models might have been trained on them, leading to inflated performance metrics. Sometimes, these [datasets do not reflect the real-world data](https://venturebeat.com/uncategorized/3-big-problems-with-datasets-in-ai-and-machine-learning/).

Expand Down Expand Up @@ -605,7 +602,7 @@ While traditional storage systems provide a foundation for ML workflows, the uni

One of the primary challenges in ML storage is handling large model weights. Modern ML models, especially deep learning models, can have millions or even billions of parameters. For instance, GPT-3, a large language model, has 175 billion parameters, requiring approximately 350 GB of storage just for the model weights [@brown2020language]. Storage systems need to be capable of handling these large, often dense, numerical arrays efficiently, both in terms of storage capacity and access speed. This requirement goes beyond traditional data storage and enters the realm of high-performance computing storage solutions.

The iterative nature of ML development introduces another critical storage consideration: versioning for both datasets and models. Unlike traditional software version control, ML versioning needs to track large binary files efficiently. As data scientists experiment with different model architectures and hyperparameters, they generate numerous versions of models and datasets. Effective storage systems for ML must provide mechanisms to track these changes, revert to previous versions, and maintain reproducibility throughout the ML lifecycle [@dutta2020mlops]. This capability is essential not only for development efficiency but also for regulatory compliance and model auditing in production environments.
The iterative nature of ML development introduces another critical storage consideration: versioning for both datasets and models. Unlike traditional software version control, ML versioning needs to track large binary files efficiently. As data scientists experiment with different model architectures and hyperparameters, they generate numerous versions of models and datasets. Effective storage systems for ML must provide mechanisms to track these changes, revert to previous versions, and maintain reproducibility throughout the ML lifecycle. This capability is essential not only for development efficiency but also for regulatory compliance and model auditing in production environments.

Distributed training, often necessary for large models or datasets, generates substantial intermediate data, including partial model updates, gradients, and checkpoints. Storage systems for ML need to handle frequent, possibly concurrent, read and write operations of these intermediate results. Moreover, they should provide low-latency access to support efficient synchronization between distributed workers. This requirement pushes storage systems to balance between high throughput for large data transfers and low latency for quick synchronization operations.

Expand Down Expand Up @@ -755,7 +752,7 @@ Data governance extends beyond traditional data management practices. It address

One of the primary aspects of data governance in ML systems is security and access control. This involves implementing measures to protect data from unauthorized access or breaches. In ML systems, this often means implementing fine-grained access controls and encrypting data both at rest and in transit. For instance, in a healthcare ML application, not all researchers may need access to all patient data. A well-designed governance framework would ensure that data scientists only have access to the specific data required for their work.

Privacy protection is another element of data governance. ML models often require large amounts of data, which can conflict with individual privacy rights. Techniques such as differential privacy can be employed to protect individual privacy while maintaining the statistical utility of the data. This approach adds carefully calibrated noise to the data, making it difficult to identify individuals while preserving the overall patterns necessary for effective model training.
Privacy protection is another element of data governance. ML models often require large amounts of data, which can conflict with individual privacy rights. Techniques such as differential privacy can be employed to protect individual privacy while maintaining the statistical utility of the data. This approach adds carefully calibrated noise to the data, making it difficult to identify individuals while preserving the overall patterns necessary for effective model training [@dwork2008differential].

Regulatory compliance forms a non-negotiable part of data governance in many industries. For example, the [General Data Protection Regulation (GDPR)](https://gdpr-info.eu/) in Europe and the [Health Insurance Portability and Accountability Act (HIPAA)](https://www.cdc.gov/phlp/php/resources/health-insurance-portability-and-accountability-act-of-1996-hipaa.html) in the U.S. healthcare sector impose strict requirements on data handling and usage. These regulations may dictate how data can be collected, used, stored, and deleted. Compliance with these regulations often necessitates features like the ability to provide individuals with copies of their data or to delete it upon request.

Expand Down

0 comments on commit f73497b

Please sign in to comment.