Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use DataLad to improve access to the Donders Repository datasets #3

Open
robertoostenveld opened this issue Jun 22, 2021 · 11 comments
Open
Labels
data management issue related to research data management

Comments

@robertoostenveld
Copy link
Member

Describe the issue

DataLad is a data management solution based on git and git-annex. I short, it has a command line interface and allows someone to check out a dataset without downloading all files at once. The large datafiles are technically symbolic links and the actual data is only downloaded when needed. Just like a git repository, a DataLad dataset can contain files (like a README), but the large files can be downloaded from elsewhere. For that it stores what the original URL of a large file is.

I propose that we use DataLad to improve the access to our published datasets. This will benefit people within the Donders, and will promote our data outside the Donders.

Describe yourself

Robert Oostenveld, Associate PI, megmethods

To Reproduce

This is what I have implemented so far

INST=di
OU=dcmn
DATASET=DSC_0004546_823_v1

IDENTIFIER=$INST.$OU.$DATASET
mkdir $IDENTIFIER
cd $IDENTIFIER

PREFIX=https://public.data.donders.ru.nl/$OU/$DATASET

# these three should be present as files, not as links 
datalad download-url "$PREFIX"/MANIFEST.txt
datalad download-url "$PREFIX"/LICENSE.txt
datalad download-url "$PREFIX"/README.txt

# force is needed since the directory is not empty (any more)
datalad create --force

cat MANIFEST.txt | while read LINE ; do
LINE=($LINE) 		# make an array out of it
HASH=${LINE[@]:0:1}	# first element
FILE=${LINE[@]:1}	# all subsequent elements
DIR=`dirname "$FILE"`
BASE=`basename "$FILE"`
if [ ! -e "$FILE" ]; then
if [ ! -z "$DIR" ] ; then
mkdir -p "$DIR"
fi
datalad download-url "$PREFIX"/"$FILE" -O "$FILE"
fi
done

Expected behavior

The resulting datasets (with actual README, LICENSE and MANIFEST, and URL links to all data) could be uploaded to our github organization just like https://github.com/openneurodatasets, subsequently we could add them to https://github.com/datalad/datasets.datalad.org and somehow have them appear on https://www.datalad.org/datasets.html.

@marcelzwiers
Copy link
Collaborator

What is not clear to me to what extent this is something that every user should do themselves (like organizing their data in BIDS), or to what extent this can be done centrally? But I certainly like the idea!

@robertoostenveld
Copy link
Member Author

If I were to publish this script on a more visible location, then anyone with access to the data could indeed do this themselves (assuming they can run bash scripts).

My idea is that we could make access to these datasets easier and thereby to expose them more. If we do it, then we could ask the datalad team to get our datasets also featured on https://www.datalad.org/datasets.html, and for users it would be as simple as datalad clone <dataset> && cd <dataset> && datalad get .

The idea was actually triggered yesterday in the OHBM-OSR emergent session where @amarquand and @saigerutherford presented their work, which made me think: "how can we support work like this by others using Donders datasets".

I don't mind giving it a try myself to make a prototype.

@robertoostenveld
Copy link
Member Author

Oh, and besides exposing our data better to others (the goal I mention above), I think it would also contribute to improved skills and more efficient data handling in the DCCN itself.

@achetverikov
Copy link

This is a great idea! I agree that things could be improved in terms of providing access to Donders datasets. But do we really need to add two more layers (GitHub + datalad) for this? Maybe we could just streamline the access to Donders Repository and add meta-data to make it discoverable?

@robertoostenveld
Copy link
Member Author

I don't know yet how the aggregation of datasets on https://www.datalad.org/datasets.html works. I know that openneuro datasets are managed in http://github.com/openneurodatasets.

The idea of "streamline the access to the DR" is great but probably involves some serious work of the ICT team responsible for the DR and the RDR. I know they are currently very busy with scaling up to http://data.ru.nl (that is the RDR). I therefore don't think this would be high on their priority list. But if we first implement a prototype ourselves, we can build up the case for them integrating it.

@robertoostenveld
Copy link
Member Author

Oh, and regarding metadata: the metadata of the collections in the DR (which is exposed on data.donders.ru.nl) is already shared with Narcis, Google Dataset Search, and others. For me it is mainly the datalad mechanism that is attractive, not the behaviour in a general web browser and generic search engine.

@achetverikov
Copy link

So basically it would allow for git-like access to datasets with all files still kept at the Donders repository except for the readme/license/manifest? But then wouldn't people still have to register at DR to access the linked files?

@robertoostenveld
Copy link
Member Author

yes and yes (partially).

The readme/license/manifest all basically contain public metadata that you can also glean when visiting https://data.donders.ru.nl, so those can be shared elsewhere (e.g. on github). These three files are also not added by the authors, but by the system upon collection publication.

The data files themselves need to be downloaded from https://public.data.donders.ru.nl (which does not require authentication) or from https://webdav.data.donders.ru.nl (which does). For potentially identifiable data people still have to sign up on the repository and agree with the Data Use Agreement for the specific collection.

After agreeing to the DUA they can authenticate on webdav, and datalad can download all files for them. You could consider webdav as an alternative for cyberduck, but then with all version control features, and being able to selectively download specific files or drop them again (i.e. remove them locally, but keeping the pointer to the original online file).

@robertoostenveld
Copy link
Member Author

robertoostenveld commented Jun 23, 2021

Have a look at https://github.com/Donders-Institute-Data

You can do

datalad clone https://github.com/Donders-Institute-Data/dcmn.DSC_0004546_823_v1.git
cd dcmn.DSC_0004546_823_v1
datalad get . -r

to get all files (takes a while), or

datalad get Simon_data/pp01_ready4analysis_withCSD.mat

to get a specific file. The dcmn example is large (21GB), the other examples are small. They all get the data from the public webdav interface, i.e. all have a license that allows redistribution.

The decision whether data on the repository can be accessed via the public webdav server depends on the license: if the data can be redistributed, the collection will appear on the public webdav server. Regardless of the DUA, all collections will also appear on the non-public webdav server.

@achetverikov
Copy link

OK, I get the idea. To me, the biggest benefit of the proposed mechanism is then that we can potentially add exposure by mirroring the public datasets on GitHub and adding them to the Datalad collection. The command-line checkout by itself might be useful, but then the fact that you need to install datalad to use it kind of diminishes its value. I mean, you could also use curl or other similar tools. The downside is that we would also need to watch for updates on the datasets to make sure that the mirror is up to date.

@robertoostenveld
Copy link
Member Author

@saigerutherford had planned to give a FAM on large data management with git-annex (which underlies datalad), but that had to be canceled. I hope she will present some time soon; I think she would be better able to explain the benefits.

@robertoostenveld robertoostenveld added the data management issue related to research data management label Jul 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data management issue related to research data management
Projects
None yet
Development

No branches or pull requests

3 participants