Skip to content

biodatageeks/PhenoRerank

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PhenoRerank

PhenoRerank contains the source code and pre-processed datasets for benchmarking the phenotype annotators and re-ranking the results. To facilitate the benchmarking, we provide the wrapper code of several existing annotators including OBO, NCBO, Monarch Initiative, Doc2hpo, MetaMap, Clinphen, NeuralCR, TrackHealth. We developed a re-ranking model that can boost the performance of the annotators in particular for precision. It filters out the false positives based on the contextual information. It is pre-trained on the pretext task defined on the textual data in Human Phenotype Ontology (i.e. term names, synonyms, definitions, and comments). It can also be fine-tuned on a specific dataset for further improvement.

Getting Started

The following instructions will help you setup the programs as well as the datasets, and re-produce the benchmarking results.

Prerequisities

Firstly, you need to install a Python Interpreter (tested 3.6.10) and the following packages:

  • numpy (tested 1.18.5)
  • scipy (tested 1.8.0)
  • pandas (tested 1.0.5)
  • ftfy (tested 5.7)
  • apiclient (tested 1.0.4)
  • pyyaml (tested 6.0.1)
  • pymetamap [optional] (tested 0.1)
  • clinphen [optional] (tested 1.28)

Download the external programs

  • Run the script install.sh to download and configure the external programs for benchmark.
  • Follow the instructions here to install MetaMap and make sure that the locations of programs skrmedpostctl and wsdserverctl are added to $PATH
  • Follow the instructions here to install the dependencies of NeuralCR and download the model parameters. Then make a copy or create a soft link of the model_params in the folder you are going to run benchmark.

Obtain the API keys for some online tools

Follow the guidelines to get the API keys for NCBO and TrackHealth. Then assign to the API_KEY global variable in the wrapper util/ncbo.py and util/trkhealth.py.

Locate the Pre-Generated Dataset

After cloning the repository and configuring the programs, you can download the pre-generated datasets and pre-trained model here.

Filename Description
biolarkgsc.csv Pre-processed BiolarkGSC+ dataset with document-level annotations
biolarkgsc_locs.csv Pre-processed BiolarkGSC+ dataset with mention-level annotations
copd.csv Pre-processed COPD-HPO dataset with document-level annotations
copd_locs.csv Pre-processed COPD-HPO dataset with mention-level annotations

You can load a dataset into a Pandas DataFrame using the following code snippet.

import pandas as pd
data = pd.read_csv('XXX.csv', sep='\t', dtype={'id': str}, encoding='utf-8')

A Simple Example

You can benchmark annotator ncbo on biolarkgsc dataset using the following command:

python benchmark.py ncbo biolarkgsc -i ./data

This command will search the dataset file biolarkgsc.csv in the path ./data and output the result of annotator ncbo in the file biolarkgsc_ncbo_preds.csv

Re-rank the result

Please download the pre-trained model hpo_bert_onto.pth or copy yours to the working folder in advance. Also, prepare the pre-processed HPO dictionary file hpo_labels.csv and your prediction file in the same folder where you run the following command.

python rerank.py --model bert_onto -u biolarkgsc --onto hpo_labels.csv --resume hpo_bert_onto.pth

Evaluation

Once the prediction files are ready, please rename them appropriately. Then you can evaluate the results for comparison using the following commands.

python eval.py biolarkgsc method1.csv method2.csv method3.csv

Fine-tuning

For the sake of the best performance, you can fine-tune the re-ranking model on your own dataset if the dataset has sentence-/mention-level annotations. Use the following commands to firstly convert the dataset into appropriate format for training the re-ranking model.

python rerank.py -m train --noeval --model bert_onto --pretrained true -u biolarkgsc -f csv --onto hpo_labels.csv --pooler none --pdrop 0.1 --do_norm --norm_type batch --initln --earlystop --lr 0.0002 --maxlen 384 -j 10 -z 8 -g 0

Dataset Re-Generation

You can re-generate the dataset from the annotations of BiolarkGSC+ and COPD using the following command:

python gendata.py -u biolarkgsc
python gendata.py -u copd

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%