-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add CC BY 4.0 to terms of use #1508
base: main
Are you sure you want to change the base?
Conversation
This follows a conversation with Huu Nguyen on discord |
❌ pre-commit failed. |
Why? the data of Open Assistant is CC BY 4.0. https://projects.laion.ai/Open-Assistant/docs/faq#can-i-download-the-data |
Sure that's the intention but the user hasn't agreed that thier
contribution is. For that we can include it in the terms of use. It's what
most websites do.
…On Sun, 12 Feb 2023, 4:51 pm Wannaphong Phatthiyaphaibun, < ***@***.***> wrote:
Why? Open Assistant is CC BY 4.0.
https://projects.laion.ai/Open-Assistant/docs/faq#can-i-download-the-data
—
Reply to this email directly, view it on GitHub
<#1508 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAINOYQ6SI45AUJCJDN77QTWXCP7FANCNFSM6AAAAAAUZEPVZE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
The problem here is that you have written CC BY-SA in the ToS but this is not what has been discussed previously. CC BY is not the same as CC BY-SA. |
Yes, CC BY-SA is not the same as CC BY. |
Hey hey! I think it should be CC-BY-4.0 as this is consistent with other LAION datasets. Sorry my bad if I miscommunicated. And thank you for doing this PR! |
Short technical question: For >99% of our users we don't have a real name, only an e-mail address or a discord-id and of course a display name for the website (which is automatically generated during e-mail signup). What counts as "Attribution", i.e. where/how will we list the (currently) >22k users by name? I guess most users would prefer not to have their e-mail address published... |
Oh this is a good point. I had a look on Wikipedia and they don't allow attribution to Wikipedia itself, you link to the article where the page history and list of contributors can be found. So this may not be viable. If you're gonna have attribution then you have to hold people's information perpetually. This will have GDPR implications; you'll have to honour "right to removal / be forgotten" requests as a data controller for as long as you're in control of it. Huggingface, as data processors will only be forced to do this if LAION don't/can't. It's promising "I'll react to volumes of frivolous deletion requests, within 28 days, or face hefty fines" A more open license that doesn't require attribution would be preferable IMO. A few options:
|
Right, I get the point now. Yeah it makes sense to make the output unencumbered by share alike. CC BY-SA is problematic and as bitplane pointed out even CC-BY can be burdensome. We could go CC-BY and just include our best effort list of all user display names and put it on the website and in the dataset. Perhaps even saying that if you don't provide you name you waive right to attribution. The makehuman project uses CC0 for it's output, so perhaps that would be the best. In that case I should change the FAQ too. What do people think? |
I think you see Common Voice project. CommonVoice project is use CC-0 and It's a best project for speech dataset. I don't sure about CC-0 with text corpus but I think if the corpus can be CC-0, It will a best corpus. |
Changing the license now would be problematic, since all past contributors would need to be asked for permission. Granted, it's also not very clear to contributors that the current license is CC BY, so we may be in a pickle regardless. |
Personally I'm convinced that CC-0 is better, but if we can't get consensus we should merge this CC BY change right now. It's noncontroversial and puts us on a better legal good footing and can be changed to CC-0 later. Then we can make a new issue to debate CC-0, and change it if we get consensus. So reviewers, let's merge it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Chnage to CC BY
Sorry I though I had done that. Now it's CC BY 4.0 |
I think we effectively ask users to provide inputs CC-0, maybe something like the following should be added to the terms to make this clearer: "If the user's input constitutes a work protected by copyright, the user grants LAION a simple, temporally, spatially and factually unrestricted right to use the input. In particular, LAION is authorized to use the user's inputs for the development and improvement of large language models." |
You can see a similar statement is used in Wikipedia terms of use and it makes sure that the user contributed data is clear to be released under a CC with no disputes. Ideally it's added at the start, so it's good to add it now.