- Edited by Wei-Hung Weng (MIT CSAIL)
- Please contact the author with errors found.
- ckbjimmy {AT} mit {DOT} edu
This repository contains the codes for the method presented in the paper Clinical Text Summarization with Syntax-Based Negation and Semantic Concept Identification
we utilized the power of computational linguistics with human experts-curated knowledge base for identifying clinical concepts with their corresponding negation information in the clinical narrative texts.
We used the medical knowledge base UMLS along with Semantic Network, and take the advantage from the language hierarchical structure, the constituency tree, in order to identify the clinically relevant concepts and the negation information, which is extremely important for summarization.
In this project, we used Stanford CoreNLP, Apache clinical Text Analysis and Knowledge Extraction System (cTAKES), Unified Medical Language System (UMLS) and Semantic Network, to identify clinical concepts in the narrative texts. We also performed the negation detection in the clinical sentences through sentence pruning, syntactic analysis and parsing using Apache OpenNLP and Stanford Tregex/Tsurgeon. Then, we constructed itemized lists of clinically important concepts using the information generated from concept identification and negation.
We provide setup.sh
to download, install and configure most of the dependencies.
The process is about 5 minutes.
However, you still need to request the access to UMLS by yourself to ensure that you can run semantic concept identification.
-
The complete implementation with all dependencies configured is also available at
https://www.dropbox.com/s/zw0uod64nt5wsf0/clneg.zip?dl=0
(without UMLS account and password).
- Request the UMLS access (this step will require few days for NIH/NLM to inspect your access application)
- Run
sh setup.sh
in the first time - Add your UMLS account/password to
./src/ctakes/bin/pipeline.sh
after-Dctakes.umlsuser=
and-Dctakes.umlspw=
- Run
sh run_corenlp.sh
andsh run_opennlp.sh
to initialize Stanford CoreNLP server and Apache OpenNLP server (make sure they are running in the background at port 9000 and 8080, respectively) - Open the other terminal and run
python main.py [file_path]
../data/dev.txt
for development set../data/test_ready.txt
for evaluation set../data/1.txt
for testing note (as well as2.txt
,3.txt
) (The notes are no longer be available here. Please request the DUA of MIMIC-III database for using the notes)
- We modularized some mutable components into files
- To design and add more rules for Tregex/Tsurgeon, please edit
tree_rules.py
- To add more negation terms, please edit
./data/neg_list_complete.txt
, which is the modified version of the originalmultilingual_lexicon-en-de-fr-sv.csv
open sources with our annotations - To change the clinical semantic types for filtering, please edit
concept_extraction.py
- Baseline can be obatined by running
negex.py
- Please check the terminal screen for sentence parsing, and check the
./data/final_output
- For the process of development, please check the jupyter notebook
nlp_dev.ipynb
undersrc
folder
If you use this code, please kindly cite the paper for this GitHub project (see below for BibTex):
@article{weng2020clinical,
title={Clinical Text Summarization with Syntax-Based Negation and Semantic Concept Identification},
author={Weng, Wei-Hung and Chung, Yu-An and Schrasing Tong},
journal={arXiv preprint arXiv:2003.00353},
year={2020}
}