Updated img2dataset to pull from the Spawning-Inc fork #29
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR replaces the
img2dataset
dependency with a fork that includes the datadiligence python package, which makes it easy to respect opt-out requests. There is an unmerged PR in the main img2dataset repository which includes these changes.How datadiligence can improve the quality of datacomp submissions:
Machine-readable opt-outs and EU TDM Article 4.
Spawning is working with the EU Copyright office to make it easy for researchers to comply with opt-out requirements for commercial models. Respecting these methods now increases the likelihood that submissions released with commercially-friendly licenses can be used in the future.
Ignoring the wishes of site owners can provoke antagonistic responses.
With over 12 billion images, the potential for new and noticeably large amounts of automated traffic to sites and services is high. This traffic comes with costs - it increases expenses for site owners over the lifetime of the dataset (see ttps://www.vice.com/en/article/dy3vmx/an-ai-scraping-tool-is-overwhelming-websites-with-traffic and https://github.com/rom1504/img2dataset/issues/293).
If left unchecked, this traffic will likely inspire defensive and antagonistic responses similar to DDoS mitigation. By respecting opted-out domains, we can lower the risk of provoking these responses, thus maintaining the quality of the datacomp dataset over a longer period of time.
The Spawning API does more than just filter opt-outs.
The Spawning API also filters URLs from unsafe domains (phishing, malware, spam, etc). The content on these domains is not only likely low-quality, but also dangerous.
The datadiligence package is easy to incorporate and requires no change to existing workflows. In fact, img2dataset already respects one opt-out method by default. Datadiligence includes that method, other proposed standards (eg TDMRep), and will continue to incorporate new methods as they become adopted.
The goal of this competition is to curate a high-quality dataset, and we believe the datadiligence package can contribute.
This PR also includes a small bug fix for a recently introduced bug for macOS. The recent change in the huggingface-hub dependency appears to not have been made in the
environment_osx.yml
file.