Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MeSH data #507

Closed
wants to merge 54 commits into from
Closed

Add MeSH data #507

wants to merge 54 commits into from

Conversation

suhana13
Copy link

No description provided.

@spiekos spiekos assigned pradh and shifucun and unassigned Spaceenter Jun 6, 2022
@spiekos spiekos requested review from pradh and shifucun June 6, 2022 18:30
@spiekos spiekos assigned spiekos and suhana13 and unassigned pradh and shifucun Jun 6, 2022
@spiekos spiekos requested a review from chejennifer June 6, 2022 18:31
@spiekos
Copy link
Contributor

spiekos commented Jun 6, 2022

This is the documentation for the Medical Subject Headings (MeSH) import.

scripts/biomedical/mesh/README.md Outdated Show resolved Hide resolved
scripts/biomedical/mesh/format_mesh.py Show resolved Hide resolved
scripts/biomedical/mesh/format_mesh.py Outdated Show resolved Hide resolved
from xml.etree.ElementTree import parse


def format_mesh_xml(mesh_xml):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not validated the logic of the Python code. I'm assuming @spiekos will do that :)

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text values are not properly denoted by including them in quotes. Anytime there is a potential to be a comma in a text value please include the additional brackets of /"<my_text_value>/" around your value so that the commas in the value are not inappropriately parsed. The Descriptor output file contains a lot of duplicate info: please simplify and remove.

Updated mesh_record.tmcf and mesh_pubchem.tmcf to have correct mappings. There is no need to have descriptor_ID information in this file as we are only mapping to ChemicalCompounds to MeSHRecords.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tests missing for mesh_record.csv and mesh_pubchem.csv. Please add.

scripts/biomedical/mesh/mesh_test.py Show resolved Hide resolved
scripts/biomedical/mesh/README.md Show resolved Hide resolved
scripts/biomedical/mesh/README.md Show resolved Hide resolved

In order to run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh.xml` file, which spits out four different
csv files, each relating to descriptor, concept, qualifier and term.
In order to run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand a little more about how you use the pubchem file to establish the mapping between the MeSHRecord and it's corresponding ChemicalCompound. A couple of sentences explicitly stating the goal and how it was accomplished is sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close. Just. a few more comments about the tests that need to be resolved.

@@ -0,0 +1,18 @@
DescriptorID,ConceptID,ConceptName,ScopeNote,DateCreated,DateRevised,DateEstablished,Concept_dcid,Descriptor_dcid
D000001,M0000001,"""Calcimycin""","""An ionophorous, polyether antibiotic from Streptomyces chartreusensis. It binds and transports CALCIUM and other divalent cations across membranes and uncouples oxidative phosphorylation while inhibiting ATPase of rat liver mitochondria. The substance is used mostly as a biochemical tool to study the role of divalent cations in various biological systems.""",1974-11-19,2016-05-27,1984-01-01,bio/M0000001,bio/D000001
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of these test.csv files are formatted correctly - all of the quotes around the text values is incorrect. Please update your test files so they actually reflect the final output files. This needs to be done for all .csv test files.

def test_main(self):
"""Test in the main function"""
# Read in the expected output files into pandas dataframes
df1_expected = pd.read_csv('unit-tests/mesh_record_test.csv')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why these test files are so big? Can you please limit it to ~10 entities for each? They currently are not viewable on GitHub.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated mesh_pharma_concept.tmcf so that it's schema is mapping to the correct type of node. The node type that this data is referring to is a MeSH Supplementary Record.

Scripts and tests missing for the newly added 5 tmcf + csv pairs. Shell scripts and the README.md also need to be updated to reflect the addition of these data.

Copy link
Contributor

@spiekos spiekos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated mesh_pharma_concept.tmcf so that it's schema is mapping to the correct type of node. The node type that this data is referring to is a MeSH Supplementary Record.

Scripts and tests missing for the newly added 5 tmcf + csv pairs. Shell scripts and the README.md also need to be updated to reflect the addition of these data.

@google-cla google-cla bot added cla: no and removed cla: yes labels Mar 5, 2024
@spiekos spiekos closed this May 15, 2024
@spiekos
Copy link
Contributor

spiekos commented May 15, 2024

This is now taken care of as part of #1000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants