Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disease ontology #473

Open
wants to merge 31 commits into
base: master
Choose a base branch
from

Conversation

suhana13
Copy link

@suhana13 suhana13 commented Aug 3, 2021

Disease Ontology data import is all set for a review !

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test data and unittests, but this doesn't block data import.


- ### License

This data is under a Creative Commons Public Domain Dedication [CC0 1.0 Universal license](https://disease-ontology.org/resources/do-resources).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing link to Disease Ontology website where it states that it's under the Creative Commons Public Domain license

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiekos , I'm sorry I didn't quite understand that because https://disease-ontology.org/resources/do-resources directs the user to the license page on DO website.


### Schema Overview

The schema representing reaction, metabolite and microbiome data from VMH is defined in [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list.mcf) and [DO.mcf](https://raw.githubusercontent.com/suhana13/ISB-project/main/combined_list_enum.mcf).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema will be stored in ChemicalComounds.mcf and ChemicalCompoundsEnum.mcf

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spiekos , should it be chemical_compounds.mcf or disease.mcf?

icdoID: C:DiseaseOntology->ICDO
meshID: C:DiseaseOntology->MESH
nciID: C:DiseaseOntology->NCI
snowmedctusID: C:DiseaseOntology->SNOMEDCTUS20200901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these SNOMEDCTUS20200901 different columns in the file? Why are they stored under such odd column names?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the columns represent different version of the snow med, so SNOMEDCTUS20200901 refers to the 09/01/2020 version of the data


## About the import

### Artifacts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing test data and unittests



def main():
file_input = sys.argv[1]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

too much is stored under main. Put into other functions and then call in main

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@chejennifer chejennifer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late review, but just a few small things! Also, seems to be missing unit tests and test datasets.


### Overview

The Disease Ontology database provides a standardized ontology for human diseases, for the purposes of consistency and reusability. It has contains extensive cross mapping of DO terms to other databases, namely, MeSH, ICD, NCI’s thesaurus, SNOMED and OMIM. More information on the database can be found [here](https://disease-ontology.org).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"has contains" -> "contains"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed!

query_str = """
SELECT DISTINCT ?id ?element_name
WHERE {{
?element typeOf MeSHDescriptor .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if there is a typo or "MeSHDescriptor" is supposed to be cased like that?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the README.md to respond to the comments.


### Notes and Caveats

The original format of the data was `.owl` and it was converted to a `.csv` file prior to ingestion into Data Commons.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm that the only caveat of the dataset and data cleaning process is that it needs to be converted from an .owl file to a .csv file. Was there nothing acknowledged by Disease Ontology documentation itself or any strange things that you encountered in cleaning the data? E.g. here is where you should note that a node can have multiple parents.


### Download Data

The human disease ontology data can be downloaded from their official github repository [here](https://www.vmh.life/#human/all). The data is in `.owl` format and had to be parsed into a `.csv` format (see [Notes and Caveats](#notes-and-caveats) for additional information on formatting).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create a shell script, which downloads the data. If the data is converted from .owl to .csv outside of your format_disease_ontology.py script, then also do that here.


### Artifacts

#### Scripts
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add short descriptions to all scripts and files. Internally link the scripts and files to itself in the directory.


`format_disease_ontology.py`

##### Test Script
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update all of the script and file names to match what you added to the directory

To test format_refseq_chromosome_id_to_dcid.py run:

```
python format_disease_ontology.py input_file.owl expected_output.csv
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update this command to reflect file names from the download file and what you want the final csv file to be named


##### Test File

`input_file.txt`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing small input file and expected output file that can be used to fun the script to test that it generates the expected output.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.tmcf is now updated. Please run json tests using the updated file.

@@ -0,0 +1,25 @@
ICD10_id,element,MESH_id,element_name,subClassOf,IAO_0000115,hasAlternativeId,hasExactSynonym,id,label,ICDO,MESH,NCI,SNOMEDCTUS20200901,UMLSCUI,ICD9CM,SNOMEDCTUS20200301,ICD10CM,SNOMEDCTUS20180301,GARD,OMIM,ORDO,EFO,MEDDRA,SNOMEDCTUS20190901
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs to be updated to the current expected output of the script. This is currently an old version.

df_do['ICD9CM'] = "ICD10/" + df_do['ICD9CM'].apply(str)
return df_do

def main():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please confirm that this is the most up-to-date version of the script that includes changes like how to handle lists of texts values that have commas within a single cell value and other changes that we've previously discussed.

alternativeDiseaseOntologyID : C:DiseaseOntology->hasAlternativeId
diseaseSynonym: C:DiseaseOntology->hasExactSynonym
internationalClassificationOfDiseaseID: C:DiseaseOntology->ICDO
medicalSubjectHeadingDescriptorID: C:DiseaseOntology->MESH
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please confirm that all of these IDs start with D. If not, then they aren't all MeSH Descriptors and we should switch to using the more general "medicalSubjectHeadingID" property here.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The csv test file output is old. The Text values are not properly formatted. Please update the expected output file for the test and confirm that the current format_disease_ontology.py script is up to date. Please confirm that if you download the github repo and run the test script that it passes.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add missing diseaseOntologyID property

@@ -0,0 +1,24 @@
Node: E:DiseaseOntology->E1
typeOf: dcs:Disease
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update script to include a column of text values with the disease ontology ids eg "DOID:0060329" then add to the tmcf the line diseaseOntologyID: C:DiseaseOntology->diseaseOntologyID

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants