A flexible pytorch template for Natural Language Processing based on Bert.
Now, it just support the NLC
(Natural Language Classification), NLI
(Natural Language Inference) and other simple classification mission. It will support the NER
and Machine Comprehension
in the future
- Python >= 3.5 (3.6 recommended)
- PyTorch >= 0.4
- tqdm
- tensorboard >= 1.7.0 (Optional for TensorboardX) or tensorboard >= 1.14 (Optional for pytorch.utils.tensorboard)
- tensorboardX >= 1.2 (Optional for TensorboardX), see [Tensorboard Visualization][#tensorboardx-visualization]
- Write your own processor like
ATECProcessor
.
class ATECProcessor(BaseBertProcessor):
def __init__(self, logger, config, data_name, data_path, bert_vocab_file, max_len=50, query_max_len=20,
target_max_len=20, do_lower_case=True, test_split=0.0, training=True):
self.skip_row = 0
super().__init__(logger, config, data_name, data_path, bert_vocab_file, max_len, query_max_len,
target_max_len, do_lower_case, test_split, training)
def get_labels(self):
"""See base class."""
return [u'0', u'1']
def split_line(self, line):
line = line.strip().split('\t')
q, t, label = line[1], line[2], line[-1]
return q, t, label
- You also should realize the interface
get_labels
,split_line
and the variableself.skip_row
.
- Move your data into the directory
data/RAW/
. - Create New configuration File
config/{DataName}_{ModelName}/config.json
likeconfig/ATEC_BERT/config.json
.
{DataName}
is representative of the name of DataSet{ModelName}
is representative of the name of Model
- Adjust the
processor
configuration. The content of theconfig.json
is as follows.
{
"n_gpu": 1,
"seed" : 28,
"processor": {
"type": "ATECProcessor", ## the name of Processor
"args": {
"data_name": "ATEC", ## the name of DataName
"bert_vocab_file": "bert-base-chinese",
"data_path": "atec_nlp_sim_train.csv", ## the name of DataSet File Name
"test_split": 0.2,
"max_len": 63,
"query_max_len": 20,
"target_max_len": 20,
"do_lower_case" : true
}
},
....
}
{DataName}_{ModelName}
==>ATEC_BERT
python train.py --config {DataName}_{ModelName}/config.json
python test.py -r saved/models/{DataName}_{ModelName}/timestamp/~.pth
python eval.py -r saved/models/{DataName}_{ModelName}/timestamp/~.pth -e $EVAL_DATA_PATH
python predict.py -r saved/models/{DataName}_{ModelName}/timestamp/~.pth -s1 str1 -s2 str2
python service.py -r saved/models/{DataName}_{ModelName}/timestamp/~.pth
It is more complicated.
python ParameterSearch.py -sm random
Todo: ......
The code in this repo is an ATEC example of the template.
Try python train.py -c ATEC_BERT/config.json
to run code.
Config files are in .json
format:
{
"n_gpu": 1, // number of GPUs to use for training.
"seed" : 28, // random seed
"processor": {
"type": "ATECProcessor",
"args": {
"data_name": "ATEC",
"bert_vocab_file": "bert-base-chinese",
"data_path": "atec_nlp_sim_train.csv", // dataset path
"test_split": 0.2, // size of test dataset.
"max_len": 63, // the max length of bert input
"query_max_len": 20, // not use
"target_max_len": 20, // not use
"do_lower_case" : true
}
},
"data_loader": {
"type": "BertDataLoader", // selecting data loader
"args": {
"batch_size": 96,
"shuffle": true, // shuffle training data before splitting
"validation_split": 0.1, // size of validation dataset. float(portion) or int(number of samples)
"num_workers": 2 // number of cpu processes to be used for data loading
}
},
"arch": {
"type": "BertOrigin", // name of model architecture to train
"args": { // args of model architecture
"pretrained_model_name_or_path": "bert-base-chinese" // bert pretrained_model_name_or_path
}
},
"optimizer": {
"type": "BertAdam",
"args":{
"lr" : 1e-4,
"warmup" : 0.1 ,
"schedule": "warmup_linear"
}
},
"loss": {
"type": "cross_entropy_loss",
"args": {
"weights": [0.223, 1] // loss weight
}
},
"metrics": [
"F1","acc"
],
"trainer": {
"epochs": 50, // number of training epochs
"save_dir": "saved/", // checkpoints are saved in save_dir/models/name
"save_period": 1, // save checkpoints every save_freq epochs
"verbosity": 2, // 0: quiet, 1: per epoch, 2: full
"monitor": "max val_F1", // mode and metric for model performance monitoring. set 'off' to disable.
"early_stop": 5, // number of epochs to wait before early stop. set 0 to disable.
"gradient_accumulation_steps": 1,
"tensorboardX": true // enable tensorboardX visualization
}
}
Modify the configurations in .json
config files, then run:
python train.py --config ATEC_BERT/config.json
You can resume from a previously saved checkpoint by:
python train.py --resume path/to/checkpoint
You can enable multi-GPU training by setting n_gpu
argument of the config file to larger number.
If configured to use smaller number of gpu than available, first n devices will be used by default.
Specify indices of available GPUs by cuda environmental variable.
python train.py --device 2,3 -c ATEC_BERT/config.json
This is equivalent to
CUDA_VISIBLE_DEVICES=2,3 python train.py -c ATEC_BERT/config.py
Todo: ......
- Rename trainer to agent
- Finish the document
- Finish the ParameterSearch Function.
- Monitor and control the agent by Wechat
- Realize the NER
- Realize the Machine Comprehension
This project is inspired by the project pytorch-template by victoresque