Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Versioning, naming, and distributing the cliqs dataset #10

Open
rht opened this issue Mar 20, 2017 · 0 comments
Open

Versioning, naming, and distributing the cliqs dataset #10

rht opened this issue Mar 20, 2017 · 0 comments

Comments

@rht
Copy link
Contributor

rht commented Mar 20, 2017

By naming, I meant, we should create a DOI, for each release, so that the data can be cited on its own. This can be done with either https://guides.github.com/activities/citable-code/ (using Zenodo) or from researchgate.

As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).

As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).

Edit: add urls for better access

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant