This repository holds code needed to perform Information Extraction using BERT models. To do this, pretrained Transformer models are loaded and then fine-tuned using the Python HuggingFace library. Documentation can be found here: HuggingFace Transformers Docs. The following instructions will outline the process of setting up an environment, downloading dependencies, and then traninig & testing a BERT model for the Named Entity Recognition (NER) task.
First you must create a conda environment (preferably Python 3.9) and install the dependencies needed to this environment. This can be done with the following two statements:
conda create -n bert_nlp python=3.9
pip install -r requirements.txt
Where bert_nlp
is the conda environment name and the requirements.txt
file can be located in this folder.
Next you will need to setup a configurations file that lets the training and evaluation processes know where data and models are to be stored and loaded. An example configuration can entitled biobert_barilla_properties.json
can be found in the properties folder. Make sure that the structure and keys of your JSON dictionary match this example. Some important values in this file include:
- pretrain_path: Path to folder containing pretrained model. For this challenge, it will be called
biobert-base-cased-v1.2
and will be located in the models folder. - out_folder: Folder that your finetuned model is to be saved to.
- data_dir: Folder that contains the training and test data. For this challenge, the data are in the data/Barilla_NER folder.
The rest of the parameters can be left the same for the first assignment.
Once the necessary files have been located and the configurations file is complete, training and testing can take place. The model can be finetuned for the NER task by navigating to the bert_ner
folder and running the following command:
python train_ner_model.py -config ../../path/to/biobert_ner_properties.json
This script will load the data, do preprocessing and tokenization, train the model, and then save it to the out_folder
specified in the configurations. Once a model has been trained and saved, we can run evaluation by navigating to the bert_ner
folder and using the following command:
python evaluate_ner_model.py -config ../../path/to/biobert_ner_properties.json
The purpose of this lab is to familiarize you with training and evaluating BERT models using the HuggingFace library. For this, you will need to complete a few main steps:
- Setup a conda environment in Python 3.9
- Install dependencies needed for this code.
- Setup configuration file with attributes needed for training and evaluation.
- Run training process and produce a saved model.
- Evaluate model against the test set.
- If time permits, try testing different hyper-parameter values such as learning_rate, weight_decay, batch_size, and num_epochs. These values can be changed in the configurations file.
- Create a folder
n2c2
in your scratch directory. Inside that, make a folderpretrain
. - Put the pretrained model
biobert-base-cased-v1.2
in the folderpretrain
you just made. - In your home directory, make a folder
data
(unless you have one already). Insidedata
, make another folderner_data_formatted
and put thetrain.tsv
and 'test.tsv' files in it. - The trained models will be outputted into
$SCRATCH/n2c2/Models
with the name as specified in the configuration file. - The tensorboard logging data will be outputted to
$SCRATCH/n2c2/ray_results/
with the name as specified in the hyperparameter search functiontrain_ner_model.py
. You can usetensorboard --logdir=<log folder>
to view the tensorboard.
Good luck!