Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updated img2dataset to pull from the Spawning-Inc fork #29

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

Padge91
Copy link

@Padge91 Padge91 commented Jul 12, 2023

This PR replaces the img2dataset dependency with a fork that includes the datadiligence python package, which makes it easy to respect opt-out requests. There is an unmerged PR in the main img2dataset repository which includes these changes.

How datadiligence can improve the quality of datacomp submissions:

  1. Machine-readable opt-outs and EU TDM Article 4.
    Spawning is working with the EU Copyright office to make it easy for researchers to comply with opt-out requirements for commercial models. Respecting these methods now increases the likelihood that submissions released with commercially-friendly licenses can be used in the future.

  2. Ignoring the wishes of site owners can provoke antagonistic responses.
    With over 12 billion images, the potential for new and noticeably large amounts of automated traffic to sites and services is high. This traffic comes with costs - it increases expenses for site owners over the lifetime of the dataset (see ttps://www.vice.com/en/article/dy3vmx/an-ai-scraping-tool-is-overwhelming-websites-with-traffic and https://github.com/rom1504/img2dataset/issues/293).

    If left unchecked, this traffic will likely inspire defensive and antagonistic responses similar to DDoS mitigation. By respecting opted-out domains, we can lower the risk of provoking these responses, thus maintaining the quality of the datacomp dataset over a longer period of time.

  3. The Spawning API does more than just filter opt-outs.
    The Spawning API also filters URLs from unsafe domains (phishing, malware, spam, etc). The content on these domains is not only likely low-quality, but also dangerous.

The datadiligence package is easy to incorporate and requires no change to existing workflows. In fact, img2dataset already respects one opt-out method by default. Datadiligence includes that method, other proposed standards (eg TDMRep), and will continue to incorporate new methods as they become adopted.

The goal of this competition is to curate a high-quality dataset, and we believe the datadiligence package can contribute.

This PR also includes a small bug fix for a recently introduced bug for macOS. The recent change in the huggingface-hub dependency appears to not have been made in the environment_osx.yml file.

…e datadiligence package to remove opt-outs of various means. Fixed byg with envcironment_osx.yml not having its hugginface dependency updated correctly.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant