-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add disease ontology #473
base: master
Are you sure you want to change the base?
Add disease ontology #473
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test data and unittests, but this doesn't block data import.
|
||
- ### License | ||
|
||
This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website.
|
||
### Schema Overview | ||
|
||
The schema representing reaction, metabolite and microbiome data from VMH is defined in [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list.mcf) and [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list_enum.mcf). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
schema will be stored in ChemicalComounds.mcf and ChemicalCompoundsEnum.mcf
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@spiekos , should it be chemical_compounds.mcf or disease.mcf?
icdoID: C:DiseaseOntology->ICDO | ||
meshID: C:DiseaseOntology->MESH | ||
nciID: C:DiseaseOntology->NCI | ||
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these SNOMEDCTUS20200901 different columns in the file? Why are they stored under such odd column names?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the columns represent different version of the snow med, so SNOMEDCTUS20200901 refers to the 09/01/2020 version of the data
|
||
## About the import | ||
|
||
### Artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test data and unittests
|
||
|
||
def main(): | ||
file_input = sys.argv[1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
too much is stored under main. Put into other functions and then call in main
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late review, but just a few small things! Also, seems to be missing unit tests and test datasets.
|
||
### Overview | ||
|
||
The Disease Ontology database provides a standardized ontology for human diseases, for the purposes of consistency and reusability. It has contains extensive cross mapping of DO terms to other databases, namely, MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM. More information on the database can be found [here](https://disease-ontology.org). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"has contains" -> "contains"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed!
query_str = """ | ||
SELECT DISTINCT ?id ?element_name | ||
WHERE {{ | ||
?element typeOf MeSHDescriptor . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if there is a typo or "MeSHDescriptor" is supposed to be cased like that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, its MeSHDescriptor - https://datacommons.org/browser/MeSHDescriptor
Add About the Import section
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update the README.md to respond to the comments.
|
||
### Notes and Caveats | ||
|
||
The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm that the only caveat of the dataset and data cleaning process is that it needs to be converted from an .owl file to a .csv file. Was there nothing acknowledged by Disease Ontology documentation itself or any strange things that you encountered in cleaning the data? E.g. here is where you should note that a node can have multiple parents.
|
||
### Download Data | ||
|
||
The human disease ontology data can be downloaded from their official github repository [here](https://www.vmh.life/#human/all). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please create a shell script, which downloads the data. If the data is converted from .owl to .csv outside of your format_disease_ontology.py
script, then also do that here.
|
||
### Artifacts | ||
|
||
#### Scripts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory.
|
||
`format_disease_ontology.py` | ||
|
||
##### Test Script |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update all of the script and file names to match what you added to the directory
To test format_refseq_chromosome_id_to_dcid.py run: | ||
|
||
``` | ||
python format_disease_ontology.py input_file.owl expected_output.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update this command to reflect file names from the download file and what you want the final csv file to be named
|
||
##### Test File | ||
|
||
`input_file.txt` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing small input file and expected output file that can be used to fun the script to test that it generates the expected output.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.tmcf is now updated. Please run json tests using the updated file.
@@ -0,0 +1,25 @@ | |||
ICD10_id,element,MESH_id,element_name,subClassOf,IAO_0000115,hasAlternativeId,hasExactSynonym,id,label,ICDO,MESH,NCI,SNOMEDCTUS20200901,UMLSCUI,ICD9CM,SNOMEDCTUS20200301,ICD10CM,SNOMEDCTUS20180301,GARD,OMIM,ORDO,EFO,MEDDRA,SNOMEDCTUS20190901 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file needs to be updated to the current expected output of the script. This is currently an old version.
df_do['ICD9CM'] = "ICD10/" + df_do['ICD9CM'].apply(str) | ||
return df_do | ||
|
||
def main(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please confirm that this is the most up-to-date version of the script that includes changes like how to handle lists of texts values that have commas within a single cell value and other changes that we've previously discussed.
alternativeDiseaseOntologyID : C:DiseaseOntology->hasAlternativeId | ||
diseaseSynonym: C:DiseaseOntology->hasExactSynonym | ||
internationalClassificationOfDiseaseID: C:DiseaseOntology->ICDO | ||
medicalSubjectHeadingDescriptorID: C:DiseaseOntology->MESH |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you please confirm that all of these IDs start with D. If not, then they aren't all MeSH Descriptors and we should switch to using the more general "medicalSubjectHeadingID" property here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The csv test file output is old. The Text values are not properly formatted. Please update the expected output file for the test and confirm that the current format_disease_ontology.py script is up to date. Please confirm that if you download the github repo and run the test script that it passes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add missing diseaseOntologyID property
@@ -0,0 +1,24 @@ | |||
Node: E:DiseaseOntology->E1 | |||
typeOf: dcs:Disease |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line diseaseOntologyID: C:DiseaseOntology->diseaseOntologyID
Disease Ontology data import is all set for a review !