Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[reproducibility] (Re-)generation of biosystems.txt and DisGeNET_diseases.txt #12

Open
cthoyt opened this issue Feb 8, 2021 · 0 comments

Comments

@cthoyt
Copy link

cthoyt commented Feb 8, 2021

The documentation says that this file was created from ChEMBL 24, PubChem, and DisGeNet . There have been several releases since with more data, which could improve the goodness and utility of your models.
However, it's not clear how these resource files were created. To assess the correctness of the work, it would also be necessary to show that the pipeline for getting data is not only reproducible, but makes sense. Seeing the code that does this gives insights into the special cases that might have been encountered and how they're handled, that would make your data output different from one that somebody would make by following your work as a guide, but without access to your code.

This should also apply to the two resources that you ask the user to download.

Caveat: While ChEMBL has versioned downloads, PubChem's rolling release only allows for the download of the most recent months/days. I'm not sure about DisGeNet. I know this might make it impossible to reproduce the generation of the exact datasets, which is why it's also good to have the dumps in this repo, so thanks for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant