-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add MeSH data #507
Closed
Closed
Add MeSH data #507
Changes from 36 commits
Commits
Show all changes
54 commits
Select commit
Hold shift + click to select a range
1ccd2da
feat: add format_mesh.py
6a2bf2c
feat: add mesh tmcfs
fcadfbe
style: run linter
413c141
feat: add readme
4e2dd22
feat: add comments and fix dcids
7e02eb1
feat: add property
78ae772
Update README.md
spiekos 6ed4421
Add info about tMCFs
spiekos f8aa970
Update README.md
spiekos 1d06659
Update README.md
spiekos 2ffc455
Merge branch 'master' into add_mesh_data
spiekos 2a988e0
Create download.sh
spiekos 63a1a16
Update README.md
spiekos 9d39b9d
Update README.md
spiekos ed2b5d8
Update output file names
spiekos 593139e
Update README.md
spiekos 173fd8d
Update download link
spiekos 4c2c6fa
Update column name
spiekos 6c7f9b9
Update property names
spiekos 4ad5765
Update property names
spiekos e31dc71
Update property names
spiekos 235c149
Update property names
spiekos 6a44e48
Update mesh_descriptor.tmcf
spiekos 29303ff
Update mesh_descriptor.tmcf
spiekos 79250e2
Update README.md
spiekos c8a7855
feat: add more properties for mesh data
c5566b1
feat: add properties in tmcfs
4d549af
add unit tests
940b460
Update README.md
spiekos aed39b3
formatting: add double quotes
01bb180
fix indentation on readme
94be1f0
style: add comments and text-value quotes
36b27af
add pubchem-mesh mappings
db2d138
add mapping py script
60b22cb
update Readme
7af7fde
add property to MeSHRecord
spiekos 3377334
Update mesh_pubchem.tmcf
spiekos 3a265cc
update mesh py script
4a3d1a8
feat: add test data for mesh record and pubchem mapping
a34ce67
update test data for mesh
7de6c17
feat: add test file for mesh record
7f5996c
update readme
d35b536
Update mesh_record.tmcf
spiekos c799ce0
Update README.md
spiekos 0753ac2
Update mesh_pubchem.tmcf
spiekos 0a5c610
Update README.md
spiekos c1518df
feat: add pharmacological class script
ba70fe2
feat:add mesh qualifier and pharma scripts
a8b9c06
feat: add tmcfs for qualifier and pharma class
33415e5
Update tmcf
spiekos 49b8353
fix typo
spiekos 01c25af
update format of tmcf
spiekos 092876a
feat: add illegal char check
3b30247
Merge branch 'master' into add_mesh_data
spiekos File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
# Importing Medical Subject Headings (MeSH) data from NCBI | ||
|
||
## Table of Contents | ||
|
||
- [Importing Medical Subject Headings (MeSH) data from NCBI](#importing-medical-subject-headings-mesh-data-from-ncbi) | ||
- [About the Dataset](#about-the-dataset) | ||
- [Download Data](#download-data) | ||
- [Overview](#overview) | ||
- [Notes and Caveats](#notes-and-caveats) | ||
- [License](#license) | ||
- [About the import](#about-the-import) | ||
- [Artifacts](#artifacts) | ||
- [Scripts](#scripts) | ||
- [Files](#files) | ||
- [Schema Artifacts](#schema) | ||
- [Scripts](#scripts) | ||
- [Output Schema MCF Files](#output-schema-mcf-files) | ||
- [Examples](#examples) | ||
- [Run Tests](#run-testers) | ||
- [Import](#import) | ||
- [Schema Generation](#schema-generation) | ||
|
||
## About the Dataset | ||
|
||
“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Record and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/). | ||
Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including | ||
MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds). | ||
|
||
### Download Data | ||
|
||
All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). For the purpose of mapping all mesh terms with each other, two xml files are used, namely: `desc2022.xml` and `supp2022.xml`. | ||
The csv version of the file containing PubChem Compound ID and names can also be downloaded by running[`download.sh`](download.sh) | ||
|
||
### Overview | ||
|
||
This directory stores the scripts used to convert the xml obtained from the NCBI webpage into five different csv files, each describing the relation between records, concepts, terms, qualifiers and descriptors, and generating dcids for each. | ||
The MeSH data stores the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and record IDs, joining on MeSH record name/PubChem compoundID. | ||
|
||
- For mapping the MeSH descriptor ID with the MeSH record ID, the [supplementary file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml) is used. | ||
- For mapping the MeSH descriptor ID with each of the three other IDs: concept ID, term ID, qualifier ID, the [descriptor file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml) is used. | ||
- For mapping the PubChem compound ID with the MeSH record and descriptor ID, the [pubchem file](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH) is used. | ||
|
||
### Notes and Caveats | ||
|
||
The main main file and the mesh supplementary file are both XML formatted. In addition, they're about 300-600 GB worth of storage. This is one the major contributors of extended run time for the scripts. Extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system. | ||
|
||
In order to run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh.xml` file, which spits out four different | ||
csv files, each relating to descriptor, concept, qualifier and term. | ||
In order to run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the | ||
`mesh-pubchem.csv` file which maps the record to descriptor and to the pubchem compound ID, and spits out two csv files. | ||
|
||
### License | ||
|
||
Any works found on National Library of Medicine (NLM) Web sites may be freely used or reproduced without permission in the U.S. More information about the license can be found [here](https://www.nlm.nih.gov/web_policies.html). | ||
|
||
## About the import | ||
|
||
### Artifacts | ||
|
||
#### Scripts | ||
|
||
[`format_mesh.py`](format_mesh.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF. | ||
[`format_mesh_record.py`](format_mesh_record.py) converts the supplementary MeSH record file into a csv mapped to MeSH descriptor ID, | ||
and it maps the MeSH records to pubchem compound IDs resulting in a second separate csv. | ||
[`download.sh`](download.sh) downloads all the files from the NCBI webpage and stores them in the scratch directory. | ||
[`mesh_run.sh`](mesh_run.sh) runs all the python commands generating six csv files in total. | ||
|
||
#### tMCFs | ||
|
||
The tMCF files that map each column in the corresponding CSV file to the appropriate property can be found [here](tmcf). They include: | ||
|
||
- [`mesh_concept.tmcf`](tmcf/mesh_concept.tmcf) | ||
- [`mesh_descriptor.tmcf`](tmcf/mesh_descriptor.tmcf) | ||
- [`mesh_qualifier.tmcf`](tmcf/mesh_qualifier.tmcf) | ||
- [`mesh_term.tmcf`](tmcf/mesh_term.tmcf) | ||
- [`mesh_pubchem.tmcf`](tmcf/mesh_pubchem.tmcf) | ||
- [`mesh_record.tmcf`](tmcf/mesh_record.tmcf) | ||
|
||
### Schema | ||
|
||
spiekos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Each of the four csv + tMCF pair generated is an import of the MeSH ontology mapping to one of the four following entities: [MeSHConcept](https://datacommons.org/browser/MeSHConcept), [MeSHDescriptor](https://datacommons.org/browser/MeSHDescriptor), [MeSHQualifier](https://datacommons.org/browser/MeSHQualifier), or [MeSHTerm](https://datacommons.org/browser/MeSHTerm). | ||
|
||
## Examples | ||
spiekos marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
To generate the four formatted csv files from xml: | ||
|
||
1. Download the data to `scratch/`. | ||
|
||
``` | ||
bash download.sh | ||
``` | ||
|
||
2. Generate cleaned CSV files | ||
|
||
``` | ||
bash mesh_run.sh | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#!/bin/bash | ||
|
||
mkdir -p scratch; cd scratch | ||
# downloads the mesh xml file | ||
curl -o mesh.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml | ||
# downloads the mesh record xml file | ||
curl -o mesh-record.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml | ||
# downloads the pubchem compound ID and name csv file | ||
curl -o mesh-pubchem.csv https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH | ||
|
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please expand a little more about how you use the pubchem file to establish the mapping between the MeSHRecord and it's corresponding ChemicalCompound. A couple of sentences explicitly stating the goal and how it was accomplished is sufficient.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!