Use DataLad to improve access to the Donders Repository datasets #3

robertoostenveld · 2021-06-22T10:49:36Z

Describe the issue

DataLad is a data management solution based on git and git-annex. I short, it has a command line interface and allows someone to check out a dataset without downloading all files at once. The large datafiles are technically symbolic links and the actual data is only downloaded when needed. Just like a git repository, a DataLad dataset can contain files (like a README), but the large files can be downloaded from elsewhere. For that it stores what the original URL of a large file is.

I propose that we use DataLad to improve the access to our published datasets. This will benefit people within the Donders, and will promote our data outside the Donders.

Describe yourself

Robert Oostenveld, Associate PI, megmethods

To Reproduce

This is what I have implemented so far

INST=di
OU=dcmn
DATASET=DSC_0004546_823_v1

IDENTIFIER=$INST.$OU.$DATASET
mkdir $IDENTIFIER
cd $IDENTIFIER

PREFIX=https://public.data.donders.ru.nl/$OU/$DATASET

# these three should be present as files, not as links 
datalad download-url "$PREFIX"/MANIFEST.txt
datalad download-url "$PREFIX"/LICENSE.txt
datalad download-url "$PREFIX"/README.txt

# force is needed since the directory is not empty (any more)
datalad create --force

cat MANIFEST.txt | while read LINE ; do
LINE=($LINE) 		# make an array out of it
HASH=${LINE[@]:0:1}	# first element
FILE=${LINE[@]:1}	# all subsequent elements
DIR=`dirname "$FILE"`
BASE=`basename "$FILE"`
if [ ! -e "$FILE" ]; then
if [ ! -z "$DIR" ] ; then
mkdir -p "$DIR"
fi
datalad download-url "$PREFIX"/"$FILE" -O "$FILE"
fi
done

Expected behavior

The resulting datasets (with actual README, LICENSE and MANIFEST, and URL links to all data) could be uploaded to our github organization just like https://github.com/openneurodatasets, subsequently we could add them to https://github.com/datalad/datasets.datalad.org and somehow have them appear on https://www.datalad.org/datasets.html.

The text was updated successfully, but these errors were encountered:

marcelzwiers · 2021-06-23T05:30:27Z

What is not clear to me to what extent this is something that every user should do themselves (like organizing their data in BIDS), or to what extent this can be done centrally? But I certainly like the idea!

robertoostenveld · 2021-06-23T08:24:04Z

If I were to publish this script on a more visible location, then anyone with access to the data could indeed do this themselves (assuming they can run bash scripts).

My idea is that we could make access to these datasets easier and thereby to expose them more. If we do it, then we could ask the datalad team to get our datasets also featured on https://www.datalad.org/datasets.html, and for users it would be as simple as datalad clone <dataset> && cd <dataset> && datalad get .

The idea was actually triggered yesterday in the OHBM-OSR emergent session where @amarquand and @saigerutherford presented their work, which made me think: "how can we support work like this by others using Donders datasets".

I don't mind giving it a try myself to make a prototype.

robertoostenveld · 2021-06-23T08:25:14Z

Oh, and besides exposing our data better to others (the goal I mention above), I think it would also contribute to improved skills and more efficient data handling in the DCCN itself.

achetverikov · 2021-06-23T08:59:22Z

This is a great idea! I agree that things could be improved in terms of providing access to Donders datasets. But do we really need to add two more layers (GitHub + datalad) for this? Maybe we could just streamline the access to Donders Repository and add meta-data to make it discoverable?

robertoostenveld · 2021-06-23T09:40:30Z

I don't know yet how the aggregation of datasets on https://www.datalad.org/datasets.html works. I know that openneuro datasets are managed in http://github.com/openneurodatasets.

The idea of "streamline the access to the DR" is great but probably involves some serious work of the ICT team responsible for the DR and the RDR. I know they are currently very busy with scaling up to http://data.ru.nl (that is the RDR). I therefore don't think this would be high on their priority list. But if we first implement a prototype ourselves, we can build up the case for them integrating it.

robertoostenveld · 2021-06-23T09:43:37Z

Oh, and regarding metadata: the metadata of the collections in the DR (which is exposed on data.donders.ru.nl) is already shared with Narcis, Google Dataset Search, and others. For me it is mainly the datalad mechanism that is attractive, not the behaviour in a general web browser and generic search engine.

achetverikov · 2021-06-23T10:23:34Z

So basically it would allow for git-like access to datasets with all files still kept at the Donders repository except for the readme/license/manifest? But then wouldn't people still have to register at DR to access the linked files?

robertoostenveld · 2021-06-23T10:53:54Z

yes and yes (partially).

The readme/license/manifest all basically contain public metadata that you can also glean when visiting https://data.donders.ru.nl, so those can be shared elsewhere (e.g. on github). These three files are also not added by the authors, but by the system upon collection publication.

The data files themselves need to be downloaded from https://public.data.donders.ru.nl (which does not require authentication) or from https://webdav.data.donders.ru.nl (which does). For potentially identifiable data people still have to sign up on the repository and agree with the Data Use Agreement for the specific collection.

After agreeing to the DUA they can authenticate on webdav, and datalad can download all files for them. You could consider webdav as an alternative for cyberduck, but then with all version control features, and being able to selectively download specific files or drop them again (i.e. remove them locally, but keeping the pointer to the original online file).

robertoostenveld · 2021-06-23T11:12:30Z

Have a look at https://github.com/Donders-Institute-Data

You can do

datalad clone https://github.com/Donders-Institute-Data/dcmn.DSC_0004546_823_v1.git
cd dcmn.DSC_0004546_823_v1
datalad get . -r

to get all files (takes a while), or

datalad get Simon_data/pp01_ready4analysis_withCSD.mat

to get a specific file. The dcmn example is large (21GB), the other examples are small. They all get the data from the public webdav interface, i.e. all have a license that allows redistribution.

The decision whether data on the repository can be accessed via the public webdav server depends on the license: if the data can be redistributed, the collection will appear on the public webdav server. Regardless of the DUA, all collections will also appear on the non-public webdav server.

achetverikov · 2021-06-23T11:45:08Z

OK, I get the idea. To me, the biggest benefit of the proposed mechanism is then that we can potentially add exposure by mirroring the public datasets on GitHub and adding them to the Datalad collection. The command-line checkout by itself might be useful, but then the fact that you need to install datalad to use it kind of diminishes its value. I mean, you could also use curl or other similar tools. The downside is that we would also need to watch for updates on the datasets to make sure that the mirror is up to date.

robertoostenveld · 2021-06-23T11:50:48Z

@saigerutherford had planned to give a FAM on large data management with git-annex (which underlies datalad), but that had to be canceled. I hope she will present some time soon; I think she would be better able to explain the benefits.

robertoostenveld added the data management issue related to research data management label Jul 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use DataLad to improve access to the Donders Repository datasets #3

Use DataLad to improve access to the Donders Repository datasets #3

robertoostenveld commented Jun 22, 2021

marcelzwiers commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021 •

edited

Loading

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

Use DataLad to improve access to the Donders Repository datasets #3

Use DataLad to improve access to the Donders Repository datasets #3

Comments

robertoostenveld commented Jun 22, 2021

marcelzwiers commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021 • edited Loading

achetverikov commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021

robertoostenveld commented Jun 23, 2021 •

edited

Loading