A rule-base NLP system to extract quantitative smoking information (e.g., pack-year) from clinical notes
The package is used for extraction of quantitative smoking information from clinical notes.
- pack per day
- smoking year
- quit year (e.g., quit for 10 years)
- year at quit (e.g., quit at 2008)
- pack-year
- java 1.8
For input, we expect a CSV file with encoding as UTF-8. The data table should have no header (only real data in table) and 5 columns as
- note ID
- patient ID
- note Date
- note Type
- note text (You can use dummy text for 1-4)
The output is a TSV with encoding as UTF-8.
- note ID
- patient ID
- note Date
- note Type
- extracted data type
- extracted data value
- a snippet of where the extracted value located in text (50 characters before and after the value) (1-4 is copied from input data)
- change to the project directory
- run java -jar RuleBaseSmokingInfoExtraction.jar or use the run.sh (modify arguments necessary)
- we provide sample.csv for testing, see run.sh
- this is a rule-based system
- we are keeping update rules to cover special cases
- we released the RuleBaseSmokingInfoExtraction.jar
- we will release source code
Yang X, Yang H, Lyu T, Yang S, Guo Y, Bian J, Xu H, Wu Y.
A Natural Language Processing Tool to Extract Quantitative Smoking Status from Clinical Narratives.
2020 IEEE International Conference on Healthcare Informatics (ICHI), 2020, pp. 1-2.
doi: 10.1109/ICHI48887.2020.9374369.
PMID: 33173920; PMCID: PMC7654916.