Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add MeSH data #507

Closed
wants to merge 54 commits into from
Closed
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
54 commits
Select commit Hold shift + click to select a range
1ccd2da
feat: add format_mesh.py
Sep 13, 2021
6a2bf2c
feat: add mesh tmcfs
Sep 13, 2021
fcadfbe
style: run linter
Sep 17, 2021
413c141
feat: add readme
Sep 17, 2021
4e2dd22
feat: add comments and fix dcids
Sep 20, 2021
7e02eb1
feat: add property
Sep 20, 2021
78ae772
Update README.md
spiekos Jun 6, 2022
6ed4421
Add info about tMCFs
spiekos Jun 6, 2022
f8aa970
Update README.md
spiekos Jun 6, 2022
1d06659
Update README.md
spiekos Jun 6, 2022
2ffc455
Merge branch 'master' into add_mesh_data
spiekos Jun 6, 2022
2a988e0
Create download.sh
spiekos Jun 6, 2022
63a1a16
Update README.md
spiekos Jun 6, 2022
9d39b9d
Update README.md
spiekos Jun 6, 2022
ed2b5d8
Update output file names
spiekos Jun 6, 2022
593139e
Update README.md
spiekos Jun 6, 2022
173fd8d
Update download link
spiekos Jun 7, 2022
4c2c6fa
Update column name
spiekos Jun 7, 2022
6c7f9b9
Update property names
spiekos Jun 7, 2022
4ad5765
Update property names
spiekos Jun 13, 2022
e31dc71
Update property names
spiekos Jun 13, 2022
235c149
Update property names
spiekos Jun 13, 2022
6a44e48
Update mesh_descriptor.tmcf
spiekos Jun 13, 2022
29303ff
Update mesh_descriptor.tmcf
spiekos Jun 13, 2022
79250e2
Update README.md
spiekos Jun 14, 2022
c8a7855
feat: add more properties for mesh data
Jun 15, 2022
c5566b1
feat: add properties in tmcfs
Jun 15, 2022
4d549af
add unit tests
Jul 6, 2022
940b460
Update README.md
spiekos Aug 1, 2022
aed39b3
formatting: add double quotes
Aug 1, 2022
01bb180
fix indentation on readme
Aug 1, 2022
94be1f0
style: add comments and text-value quotes
Aug 2, 2022
36b27af
add pubchem-mesh mappings
Aug 3, 2022
db2d138
add mapping py script
Aug 3, 2022
60b22cb
update Readme
Aug 3, 2022
7af7fde
add property to MeSHRecord
spiekos Aug 23, 2022
3377334
Update mesh_pubchem.tmcf
spiekos Aug 23, 2022
3a265cc
update mesh py script
Aug 24, 2022
4a3d1a8
feat: add test data for mesh record and pubchem mapping
Aug 29, 2022
a34ce67
update test data for mesh
Aug 29, 2022
7de6c17
feat: add test file for mesh record
Aug 29, 2022
7f5996c
update readme
Aug 29, 2022
d35b536
Update mesh_record.tmcf
spiekos Sep 20, 2022
c799ce0
Update README.md
spiekos Sep 20, 2022
0753ac2
Update mesh_pubchem.tmcf
spiekos Sep 20, 2022
0a5c610
Update README.md
spiekos Sep 20, 2022
c1518df
feat: add pharmacological class script
Sep 26, 2022
ba70fe2
feat:add mesh qualifier and pharma scripts
Sep 27, 2022
a8b9c06
feat: add tmcfs for qualifier and pharma class
Sep 27, 2022
33415e5
Update tmcf
spiekos Nov 16, 2022
49b8353
fix typo
spiekos Dec 2, 2022
01c25af
update format of tmcf
spiekos Dec 2, 2022
092876a
feat: add illegal char check
Aug 14, 2023
3b30247
Merge branch 'master' into add_mesh_data
spiekos Mar 5, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
97 changes: 97 additions & 0 deletions scripts/biomedical/mesh/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
# Importing Medical Subject Headings (MeSH) data from NCBI

## Table of Contents

- [Importing Medical Subject Headings (MeSH) data from NCBI](#importing-medical-subject-headings-mesh-data-from-ncbi)
- [About the Dataset](#about-the-dataset)
- [Download Data](#download-data)
- [Overview](#overview)
- [Notes and Caveats](#notes-and-caveats)
- [License](#license)
- [About the import](#about-the-import)
- [Artifacts](#artifacts)
- [Scripts](#scripts)
- [Files](#files)
- [Schema Artifacts](#schema)
- [Scripts](#scripts)
- [Output Schema MCF Files](#output-schema-mcf-files)
- [Examples](#examples)
- [Run Tests](#run-testers)
- [Import](#import)
- [Schema Generation](#schema-generation)

## About the Dataset

“The Medical Subject Headings (MeSH) thesaurus is a controlled and hierarchically-organized vocabulary produced by the National Library of Medicine. It is used for indexing, cataloging, and searching of biomedical and health-related information”. Data Commons includes the Concept, Descriptor, Qualifier, Record and Term elements of MeSH as described [here](https://www.nlm.nih.gov/mesh/xml_data_elements.html). More information about the dataset can be found on the official National Center for Biotechnology (NCBI) [website](https://www.ncbi.nlm.nih.gov/mesh/).
Pubchem is one of the largest reservoirs of chemical compound information. It is mapped to many other medical ontologies, including
MeSH. More information about compound IDs and other properties can be found on their official [website](https://pubchemdocs.ncbi.nlm.nih.gov/compounds).

### Download Data

All the terminology referenced in the MeSH data can be downloaded in various formats [here](https://www.nlm.nih.gov/databases/download/mesh.html). The current xml files version can also be downloaded by running [`download.sh`](download.sh). For the purpose of mapping all mesh terms with each other, two xml files are used, namely: `desc2022.xml` and `supp2022.xml`.
The csv version of the file containing PubChem Compound ID and names can also be downloaded by running[`download.sh`](download.sh)

### Overview

This directory stores the scripts used to convert the xml obtained from the NCBI webpage into five different csv files, each describing the relation between records, concepts, terms, qualifiers and descriptors, and generating dcids for each.
The MeSH data stores the vocabulary thesaurus used for indexing articles for PubMed. In addition, the scripts are used to map ther PubChem compound IDs to the MeSH descriptor and record IDs, joining on MeSH record name/PubChem compoundID.

- For mapping the MeSH descriptor ID with the MeSH record ID, the [supplementary file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml) is used.
- For mapping the MeSH descriptor ID with each of the three other IDs: concept ID, term ID, qualifier ID, the [descriptor file](https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml) is used.
- For mapping the PubChem compound ID with the MeSH record and descriptor ID, the [pubchem file](https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH) is used.

### Notes and Caveats

The main main file and the mesh supplementary file are both XML formatted. In addition, they're about 300-600 GB worth of storage. This is one the major contributors of extended run time for the scripts. Extracting the information from XML formatted tags and converting it into well-formatted csv involve a lot of computationally heavy steps, which depends on the RAM of the user's system.

In order to run the script [`format_mesh.py`](format_mesh.py), the user requires the `mesh.xml` file, which spits out four different
csv files, each relating to descriptor, concept, qualifier and term.
In order to run the script [`format_mesh_record.py`](format_mesh_record.py), the user requires the `mesh_record.xml` file and the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please expand a little more about how you use the pubchem file to establish the mapping between the MeSHRecord and it's corresponding ChemicalCompound. A couple of sentences explicitly stating the goal and how it was accomplished is sufficient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

`mesh-pubchem.csv` file which maps the record to descriptor and to the pubchem compound ID, and spits out two csv files.

### License

Any works found on National Library of Medicine (NLM) Web sites may be freely used or reproduced without permission in the U.S. More information about the license can be found [here](https://www.nlm.nih.gov/web_policies.html).

## About the import

### Artifacts

#### Scripts

[`format_mesh.py`](format_mesh.py) converts the original xml into four formatted csv files, which each can be imported alongside it's matching tMCF.
[`format_mesh_record.py`](format_mesh_record.py) converts the supplementary MeSH record file into a csv mapped to MeSH descriptor ID,
and it maps the MeSH records to pubchem compound IDs resulting in a second separate csv.
[`download.sh`](download.sh) downloads all the files from the NCBI webpage and stores them in the scratch directory.
[`mesh_run.sh`](mesh_run.sh) runs all the python commands generating six csv files in total.

#### tMCFs

The tMCF files that map each column in the corresponding CSV file to the appropriate property can be found [here](tmcf). They include:

- [`mesh_concept.tmcf`](tmcf/mesh_concept.tmcf)
- [`mesh_descriptor.tmcf`](tmcf/mesh_descriptor.tmcf)
- [`mesh_qualifier.tmcf`](tmcf/mesh_qualifier.tmcf)
- [`mesh_term.tmcf`](tmcf/mesh_term.tmcf)
- [`mesh_pubchem.tmcf`](tmcf/mesh_pubchem.tmcf)
- [`mesh_record.tmcf`](tmcf/mesh_record.tmcf)

### Schema

spiekos marked this conversation as resolved.
Show resolved Hide resolved
Each of the four csv + tMCF pair generated is an import of the MeSH ontology mapping to one of the four following entities: [MeSHConcept](https://datacommons.org/browser/MeSHConcept), [MeSHDescriptor](https://datacommons.org/browser/MeSHDescriptor), [MeSHQualifier](https://datacommons.org/browser/MeSHQualifier), or [MeSHTerm](https://datacommons.org/browser/MeSHTerm).

## Examples
spiekos marked this conversation as resolved.
Show resolved Hide resolved

To generate the four formatted csv files from xml:

1. Download the data to `scratch/`.

```
bash download.sh
```

2. Generate cleaned CSV files

```
bash mesh_run.sh
```
10 changes: 10 additions & 0 deletions scripts/biomedical/mesh/download.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
#!/bin/bash

mkdir -p scratch; cd scratch
# downloads the mesh xml file
curl -o mesh.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/desc2022.xml
# downloads the mesh record xml file
curl -o mesh-record.xml https://nlmpubs.nlm.nih.gov/projects/mesh/MESH_FILES/xmlmesh/supp2022.xml
# downloads the pubchem compound ID and name csv file
curl -o mesh-pubchem.csv https://ftp.ncbi.nlm.nih.gov/pubchem/Compound/Extras/CID-MeSH

Loading