1.1 machine learning systems work following two step orders:
A. train the model based on labeled dataset.
B. make inference based on the above trained model.
Thus, whatever we do with the model, we first should train the model.
Afterwards, we can make prediction on testset
, or any text you want to predict with.
1.2 why BiLSTM+CRF?
For this question, please refer to the following materials.
- Neural Architectures for Named Entity Recognition
- CRF Layer on the Top of BiLSTM - 1
- CRF Layer on the Top of BiLSTM - 2
- CRF Layer on the Top of BiLSTM - 3
- CRF Layer on the Top of BiLSTM - 4
- CRF Layer on the Top of BiLSTM - 5
- CRF Layer on the Top of BiLSTM - 6
- 如何理解LSTM后接CRF?
Don't
change the item key name at will.
Use # to comment out the configure item.
mode=api_service
# string: train/test/interactive_predict/api_service
datasets_fold=data/example_datasets3
train_file=train.csv
dev_file=dev.csv
test_file=test.csv
delimiter=t
# string: (t: "\t";"table")|(b: "backspace";" ")|(other, e.g., '|||', ...)
use_pretrained_embedding=False
token_emb_dir=data/example_datasets3/word.emb
Be aware of the following three path.
vocabs_dir=data/example_datasets3/vocabs
log_dir=data/example_datasets3/logs
checkpoints_dir=checkpoints/BILSTM-CRFs-datasets3
Be very careful of the following settings.
label_scheme=BIO
# string: BIO/BIESO
The system support at max 2 level of the label scheme. You need to make some modification on the source to adapt to more complicated labeling schemes.
label_level=2
# int, 1:BIO/BIESO; 2:BIO/BIESO + suffix
# max to 2
hyphen=_
# string: -|_, for connecting the prefix and suffix: `B_PER', `I_LOC'
The suffix for the second level labels.
suffix=[NR,NS,NT]
# unnecessary if label_level=1
labeling_level:
- for English: (word: hello),(char: h)
- for Chinese: (word: 你好),(char: 你)
labeling_level=word
# string: word/char
To measure the performance of the model, you have to specify the metrics. Following are the most used indicators.
Note that the f1
is compulsory.
You can define any other metrics, in the codes.
measuring_metrics=[precision,recall,f1,accuracy]
# string: accuracy|precision|recall|f1
use_crf=True
cell_type=LSTM
# LSTM, GRU
biderectional=True
encoder_layers=1
embedding_dim
must be consistent with the one in token_emb_dir
file.
embedding_dim=100
hidden_dim=100
cautions! set a LARGE number as possible as u can.
The max_sequence_length
will be fix after training,
and during inferring, those texts having length larger than this will be truncated.
max_sequence_length=300
We implement the self attention (Transformer style).
use_self_attention=False
attention_dim=500
To use the GPU, set tf.CUDA_VISIBLE_DEVICES=0,1,...
CUDA_VISIBLE_DEVICES=0
for reproduction.
seed=42
epoch=300
batch_size=100
dropout=0.5
learning_rate=0.005
optimizer=Adam
#string: GD/Adagrad/AdaDelta/RMSprop/Adam
checkpoints_max_to_keep=3
print_per_batch=20
early_stop
: if the model did'nt progress within patient
times of iterations of training,
the training processing will be terminated.
is_early_stop=True
patient=5
# unnecessary if is_early_stop=False
checkpoint_name=model-CRFs
output_test_file=test.out
is_output_sentence_entity=True
output_sentence_entity_file=test.entity.out
# unnecessary if is_output_sentence_entity=False
unnecessary to change the default setting if you operate at the local host.
if you display the web page within the Intranet, you may change the
ip
=0.0.0.0
.
ip=127.0.0.1
port=8000
- Once the settings of all parameters are decided during training, it's not allowed to change them during inference e.g., test, interactive_predict, api_service.
- Then training time and iteration of those models with attention module is much longer than that without attention module.
attention_dim
should be an even number.- in
tools
fold:statis.py
can calculate the statistics for your dataset.calcu_measure_testout.py
can compute the metrics based ontest.out
andtest.csv
- ...