-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MeSH data #507
Add MeSH data #507
Conversation
This is the documentation for the Medical Subject Headings (MeSH) import. |
from xml.etree.ElementTree import parse | ||
|
||
|
||
def format_mesh_xml(mesh_xml): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not validated the logic of the Python code. I'm assuming @spiekos will do that :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The text values are not properly denoted by including them in quotes. Anytime there is a potential to be a comma in a text value please include the additional brackets of /"<my_text_value>/" around your value so that the commas in the value are not inappropriately parsed. The Descriptor output file contains a lot of duplicate info: please simplify and remove.
Updated mesh_record.tmcf and mesh_pubchem.tmcf to have correct mappings. There is no need to have descriptor_ID information in this file as we are only mapping to ChemicalCompounds to MeSHRecords.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tests missing for mesh_record.csv and mesh_pubchem.csv. Please add.
scripts/biomedical/mesh/README.md
Outdated
|
||
In order to run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh.xml` file, which spits out four different | ||
csv files, each relating to descriptor, concept, qualifier and term. | ||
In order to run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand a little more about how you use the pubchem file to establish the mapping between the MeSHRecord and it's corresponding ChemicalCompound. A couple of sentences explicitly stating the goal and how it was accomplished is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is close. Just. a few more comments about the tests that need to be resolved.
@@ -0,0 +1,18 @@ | |||
DescriptorID,ConceptID,ConceptName,ScopeNote,DateCreated,DateRevised,DateEstablished,Concept_dcid,Descriptor_dcid | |||
D000001,M0000001,"""Calcimycin""","""An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.""",1974-11-19,2016-05-27,1984-01-01,bio/M0000001,bio/D000001 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of these test.csv files are formatted correctly - all of the quotes around the text values is incorrect. Please update your test files so they actually reflect the final output files. This needs to be done for all .csv test files.
def test_main(self): | ||
"""Test in the main function""" | ||
# Read in the expected output files into pandas dataframes | ||
df1_expected = pd.read_csv('unit-tests/mesh_record_test.csv') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a reason why these test files are so big? Can you please limit it to ~10 entities for each? They currently are not viewable on GitHub.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated mesh_pharma_concept.tmcf so that it's schema is mapping to the correct type of node. The node type that this data is referring to is a MeSH Supplementary Record.
Scripts and tests missing for the newly added 5 tmcf + csv pairs. Shell scripts and the README.md also need to be updated to reflect the addition of these data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated mesh_pharma_concept.tmcf so that it's schema is mapping to the correct type of node. The node type that this data is referring to is a MeSH Supplementary Record.
Scripts and tests missing for the newly added 5 tmcf + csv pairs. Shell scripts and the README.md also need to be updated to reflect the addition of these data.
This is now taken care of as part of #1000 |
No description provided.