Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add CC BY 4.0 to terms of use #1508

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

add CC BY 4.0 to terms of use #1508

wants to merge 4 commits into from

Conversation

wassname
Copy link

@wassname wassname commented Feb 12, 2023

You can see a similar statement is used in Wikipedia terms of use and it makes sure that the user contributed data is clear to be released under a CC with no disputes. Ideally it's added at the start, so it's good to add it now.

@wassname
Copy link
Author

This follows a conversation with Huu Nguyen on discord

@github-actions
Copy link

pre-commit failed.
Please run pre-commit run --all-files locally and commit the changes.
Find more information in the repository's CONTRIBUTING.md

@wannaphong
Copy link
Contributor

wannaphong commented Feb 12, 2023

Why? the data of Open Assistant is CC BY 4.0. https://projects.laion.ai/Open-Assistant/docs/faq#can-i-download-the-data

@wassname
Copy link
Author

wassname commented Feb 12, 2023 via email

@olliestanley
Copy link
Collaborator

The problem here is that you have written CC BY-SA in the ToS but this is not what has been discussed previously. CC BY is not the same as CC BY-SA.

@wannaphong
Copy link
Contributor

The problem here is that you have written CC BY-SA in the ToS but this is not what has been discussed previously. CC BY is not the same as CC BY-SA.

Yes, CC BY-SA is not the same as CC BY.

@huu4ontocord
Copy link
Collaborator

Hey hey! I think it should be CC-BY-4.0 as this is consistent with other LAION datasets. Sorry my bad if I miscommunicated. And thank you for doing this PR!

@andreaskoepf
Copy link
Collaborator

andreaskoepf commented Feb 12, 2023

Short technical question: For >99% of our users we don't have a real name, only an e-mail address or a discord-id and of course a display name for the website (which is automatically generated during e-mail signup). What counts as "Attribution", i.e. where/how will we list the (currently) >22k users by name? I guess most users would prefer not to have their e-mail address published...

@bitplane
Copy link
Collaborator

bitplane commented Feb 12, 2023

Short technical question: For >99% of our users we don't have a real name, only an e-mail address or a discord-id and of course a display name for the website (which is automatically generated during e-mail signup). What counts as "Attribution", i.e. where/how will we list the (currently) >22k users by name? I guess most users would prefer not to have their e-mail address published...

Oh this is a good point. I had a look on Wikipedia and they don't allow attribution to Wikipedia itself, you link to the article where the page history and list of contributors can be found. So this may not be viable. If you're gonna have attribution then you have to hold people's information perpetually. This will have GDPR implications; you'll have to honour "right to removal / be forgotten" requests as a data controller for as long as you're in control of it. Huggingface, as data processors will only be forced to do this if LAION don't/can't. It's promising "I'll react to volumes of frivolous deletion requests, within 28 days, or face hefty fines"

A more open license that doesn't require attribution would be preferable IMO. A few options:

  • Users giving LAION the data under CC0 and the data being released the same way seems the fairest way to do it.
  • Users giving the right for LAION to redistribute under a CC-BY with attribution to them. This seems a bit crappy - users have to attribute LAION, but they don't get the same back.
  • The worst would be a generic copyright grant (web2.0 data highwayman approach) where the data could be closed up at any time. Even if data was released under CC0 it feels uneven.

@wassname
Copy link
Author

wassname commented Feb 12, 2023

Right, I get the point now. Yeah it makes sense to make the output unencumbered by share alike. CC BY-SA is problematic and as bitplane pointed out even CC-BY can be burdensome.

We could go CC-BY and just include our best effort list of all user display names and put it on the website and in the dataset. Perhaps even saying that if you don't provide you name you waive right to attribution.

The makehuman project uses CC0 for it's output, so perhaps that would be the best. In that case I should change the FAQ too.

What do people think?

@wannaphong
Copy link
Contributor

I think you see Common Voice project. CommonVoice project is use CC-0 and It's a best project for speech dataset.

I don't sure about CC-0 with text corpus but I think if the corpus can be CC-0, It will a best corpus.

@Sobsz
Copy link

Sobsz commented Feb 13, 2023

Changing the license now would be problematic, since all past contributors would need to be asked for permission. Granted, it's also not very clear to contributors that the current license is CC BY, so we may be in a pickle regardless.

@wassname
Copy link
Author

wassname commented Feb 23, 2023

Personally I'm convinced that CC-0 is better, but if we can't get consensus we should merge this CC BY change right now. It's noncontroversial and puts us on a better legal good footing and can be changed to CC-0 later.

Then we can make a new issue to debate CC-0, and change it if we get consensus.

So reviewers, let's merge it!

Copy link
Contributor

@wannaphong wannaphong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chnage to CC BY

@wassname
Copy link
Author

Sorry I though I had done that. Now it's CC BY 4.0

@olliestanley olliestanley changed the title add CC BY-SA 4.0 to terms of use add CC BY 4.0 to terms of use Feb 24, 2023
@andreaskoepf
Copy link
Collaborator

andreaskoepf commented Jun 8, 2023

I think we effectively ask users to provide inputs CC-0, maybe something like the following should be added to the terms to make this clearer:

"If the user's input constitutes a work protected by copyright, the user grants LAION a simple, temporally, spatially and factually unrestricted right to use the input. In particular, LAION is authorized to use the user's inputs for the development and improvement of large language models."

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants