-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use DataLad to improve access to the Donders Repository datasets #3
Comments
What is not clear to me to what extent this is something that every user should do themselves (like organizing their data in BIDS), or to what extent this can be done centrally? But I certainly like the idea! |
If I were to publish this script on a more visible location, then anyone with access to the data could indeed do this themselves (assuming they can run bash scripts). My idea is that we could make access to these datasets easier and thereby to expose them more. If we do it, then we could ask the datalad team to get our datasets also featured on https://www.datalad.org/datasets.html, and for users it would be as simple as The idea was actually triggered yesterday in the OHBM-OSR emergent session where @amarquand and @saigerutherford presented their work, which made me think: "how can we support work like this by others using Donders datasets". I don't mind giving it a try myself to make a prototype. |
Oh, and besides exposing our data better to others (the goal I mention above), I think it would also contribute to improved skills and more efficient data handling in the DCCN itself. |
This is a great idea! I agree that things could be improved in terms of providing access to Donders datasets. But do we really need to add two more layers (GitHub + datalad) for this? Maybe we could just streamline the access to Donders Repository and add meta-data to make it discoverable? |
I don't know yet how the aggregation of datasets on https://www.datalad.org/datasets.html works. I know that openneuro datasets are managed in http://github.com/openneurodatasets. The idea of "streamline the access to the DR" is great but probably involves some serious work of the ICT team responsible for the DR and the RDR. I know they are currently very busy with scaling up to http://data.ru.nl (that is the RDR). I therefore don't think this would be high on their priority list. But if we first implement a prototype ourselves, we can build up the case for them integrating it. |
Oh, and regarding metadata: the metadata of the collections in the DR (which is exposed on data.donders.ru.nl) is already shared with Narcis, Google Dataset Search, and others. For me it is mainly the |
So basically it would allow for git-like access to datasets with all files still kept at the Donders repository except for the readme/license/manifest? But then wouldn't people still have to register at DR to access the linked files? |
yes and yes (partially). The readme/license/manifest all basically contain public metadata that you can also glean when visiting https://data.donders.ru.nl, so those can be shared elsewhere (e.g. on github). These three files are also not added by the authors, but by the system upon collection publication. The data files themselves need to be downloaded from https://public.data.donders.ru.nl (which does not require authentication) or from https://webdav.data.donders.ru.nl (which does). For potentially identifiable data people still have to sign up on the repository and agree with the Data Use Agreement for the specific collection. After agreeing to the DUA they can authenticate on webdav, and datalad can download all files for them. You could consider webdav as an alternative for cyberduck, but then with all version control features, and being able to selectively download specific files or drop them again (i.e. remove them locally, but keeping the pointer to the original online file). |
Have a look at https://github.com/Donders-Institute-Data You can do
to get all files (takes a while), or
to get a specific file. The dcmn example is large (21GB), the other examples are small. They all get the data from the public webdav interface, i.e. all have a license that allows redistribution. The decision whether data on the repository can be accessed via the public webdav server depends on the license: if the data can be redistributed, the collection will appear on the public webdav server. Regardless of the DUA, all collections will also appear on the non-public webdav server. |
OK, I get the idea. To me, the biggest benefit of the proposed mechanism is then that we can potentially add exposure by mirroring the public datasets on GitHub and adding them to the Datalad collection. The command-line checkout by itself might be useful, but then the fact that you need to install datalad to use it kind of diminishes its value. I mean, you could also use curl or other similar tools. The downside is that we would also need to watch for updates on the datasets to make sure that the mirror is up to date. |
@saigerutherford had planned to give a FAM on large data management with git-annex (which underlies datalad), but that had to be canceled. I hope she will present some time soon; I think she would be better able to explain the benefits. |
Describe the issue
DataLad is a data management solution based on git and git-annex. I short, it has a command line interface and allows someone to check out a dataset without downloading all files at once. The large datafiles are technically symbolic links and the actual data is only downloaded when needed. Just like a git repository, a DataLad dataset can contain files (like a README), but the large files can be downloaded from elsewhere. For that it stores what the original URL of a large file is.
I propose that we use DataLad to improve the access to our published datasets. This will benefit people within the Donders, and will promote our data outside the Donders.
Describe yourself
Robert Oostenveld, Associate PI, megmethods
To Reproduce
This is what I have implemented so far
Expected behavior
The resulting datasets (with actual README, LICENSE and MANIFEST, and URL links to all data) could be uploaded to our github organization just like https://github.com/openneurodatasets, subsequently we could add them to https://github.com/datalad/datasets.datalad.org and somehow have them appear on https://www.datalad.org/datasets.html.
The text was updated successfully, but these errors were encountered: