diff --git a/data_engineering.qmd b/data_engineering.qmd index bca3380c..fc5a1717 100644 --- a/data_engineering.qmd +++ b/data_engineering.qmd @@ -339,16 +339,21 @@ On the other hand, embedded AI systems are often expected to provide especially As described above, creators may consider crowdsourcing or synthetically generating data to include these different kinds of variations. -## Promoting Transparency +## Data Transparency -Explanation: We explain that as we increasingly use these systems built on the foundation of data, we need to have more transparency in the ecosystem. -- Definition and Importance of Transparency in Data Engineering -- Transparency in Data Collection and Sourcing -- Transparency in Data Processing and Analysis -- Transparency in Model Building and Deployment -- Transparency in Data Sharing and Usage -- Tools and Techniques for Ensuring Transparency +By providing clear, detailed documentation, creators can help developers understand how best to use their datasets. Several groups have suggested standardized documentation formats for datasets, such as Data Cards (@Pushkarna_Zaldivar_Kjartansson_2022), datasheets (@Gebru_Morgenstern_Vecchione_Vaughan_Wallach_III_Crawford_2021), data statements (@Bender_Friedman_2018), or Data Nutrition Labels (@Holland_Hosny_Newman_Joseph_Chmielinski_2020). When releasing a dataset, creators may describe what kinds of data they collected, how they collected and labeled it, and what kinds of use cases may be a good or poor fit for the dataset. Quantitatively, it may be appropriate to provide a breakdown of how well the dataset represents different groups (e.g. different gender groups, different cameras). + +Keeping track of data provenance—essentially the origins and the journey of each data point through the data pipeline—is not merely a good practice but an essential requirement for data quality. Data provenance contributes significantly to the transparency of machine learning systems. Transparent systems make it easier to scrutinize data points, enabling better identification and rectification of errors, biases, or inconsistencies. For instance, if a ML model trained on medical data is underperforming in particular areas, tracing back the data provenance can help identify whether the issue is with the data collection methods, the demographic groups represented in the data, or other factors. This level of transparency doesn’t just help in debugging the system but also plays a crucial role in enhancing the overall data quality. By improving the reliability and credibility of the dataset, data provenance also enhances the model’s performance and its acceptability among end-users. + +When producing documentation, creators should also clearly specify how users can access the dataset and how the dataset will be maintained over time. For example, users may need to undergo training or receive special permission from the creators before accessing a dataset containing protected information, as is the case with many medical datasets. In some cases, users may not be permitted to directly access the data and must instead submit their model to be trained on the dataset creators’ hardware, following a federated learning setup (@Aledhari_Razzak_Parizi_Saeed_2020). Creators may also describe how long the dataset will remain accessible, how the users can submit feedback on any errors that they discover, and whether there are plans to update the dataset. + +Some laws and regulations promote also data transparency through new requirements for organizations: + +- General Data Protection Regulation (GDPR) in European Union: It establishes strict requirements for processing and protecting personal data of EU citizens. It mandates plain language privacy policies that clearly explain what data is collected, why it is used, how long it is stored, and with whom it is shared. GDPR also mandates that privacy notices must include details on legal basis for processing, data transfers, retention periods, rights to access and deletion, and contact info for data controllers. +- California's Consumer Privacy Act (CCPA): CCPA requires clear privacy policies and opt-out rights for the sale of personal data. Significantly, it also establishes rights for consumers to request their specific data be disclosed. Businesses must provide copies of collected personal information along with details on what it is used for, what categories are collected, and what third parties receive it. Consumers can identify data points they believe are inaccurate. The law represents a major step forward in empowering personal data access. + +There are several current challenges in ensuring data transparency, especially because it requires significant time and financial resources. Data systems are also quite complex, and full transparency can be difficult to achieve in these cases. Full transparency may also overwhelm the consumers with too much detail. And finally, it is also important to balance the tradeoff between transparency and privacy. ## Licensing