You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).
As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).
Edit: add urls for better access
The text was updated successfully, but these errors were encountered:
By naming, I meant, we should create a DOI, for each release, so that the data can be cited on its own. This can be done with either https://guides.github.com/activities/citable-code/ (using Zenodo) or from researchgate.
As with versioning, it is to ensure that changes are being tracked, happen immutably, and citation is linked to a specific version. I think git still works fine with the current corpora, but I am undecided with which data versioning tool is best (git-annex? dat? git-lfs?). I have downloaded the entire ud-treebanks-v1.4, found the entire *.conllu to be 3 GB. I gzipped each of them, which result in 480 MB (this is about the size of the entire IETF RFC's!).
As with distributing, in addition to the FTP server, torrent suffices (as it has been used in cern opendata and datagovuk).
Edit: add urls for better access
The text was updated successfully, but these errors were encountered: