Complimentary code for our paper Automatic punctuation restoration with BERT models submitted to the XVII. Conference on Hungarian Computational Linguistics.
We present an approach for automatic punctuation restoration with BERT models for English and Hungarian. For English, we conduct our experiments on Ted Talks, a commonly used benchmark for punctuation restoration, while for Hungarian we evaluate our models on the Szeged Treebank dataset. Our best models achieve a macro-averaged F1-score of 79.8 in English and 82.2 in Hungarian.
.
|-- docs
| └── paper # The submitted paper
|-- notebooks # Notebooks for data preparation/preprocessing
|-- src
└── neural_punctuator
├── base # Base classes for training Torch models
├── configs # YAMl files defining the parameters of each model
├── models # Torch model definitions
├── preprocessors # Preprocessor class
├── trainers # Train logic
├── utils # Utility scripts (logging, metrics, tensorboard etc.)
└── wrappers # Wrapper classes for the models containing all the components needed for training/prediction
Ted Talk dataset (English) - http://hltc.cs.ust.hk/iwslt/index.php/evaluation-campaign/ted-task.html
Szeged Treebank (Hungarian) - https://rgai.inf.u-szeged.hu/node/113
If you use our work, please cite the following paper:
@article{nagy2021automatic,
title={Automatic punctuation restoration with bert models},
author={Nagy, Attila and Bial, Bence and {\'A}cs, Judit},
journal={arXiv preprint arXiv:2101.07343},
year={2021}
}
Attila Nagy, Bence Bial, Judit Ács
Budapest University of Technology and Economics - Department of Automation and Applied Informatics