Master thesis: Named Entity Recognition in Speech to Text Transcripts

Master thesis on Proper Capitalization in Speech to Text Transcripts using NER
This repository contain code and results from the conducted experiments.

Abstract

Traditionally, named entity recognition (NER) research use properly capitalized data for training in testing, however, these results give little insight to how these models may perform in scenarios where proper capitalization is not in place. In this thesis, we explore the capabilities of five fine-tuning BERT based models for NER in all lowercase text. Furthermore, the aim is to measure the performance for classify named entity types correctly, as well as just simply detecting that a named entity is present, so a capitalization error may be corrected. The performance will be assessed using all lowercase data from the NorNE dataset, and the Norwegian Parliamentary Speech Corpus.

Some results from the experimets

NorNE results

Model	No Boundary	Strict Match	Binary
SpaCy*	50.87	45.89	80.17
NCRF++	42.33	32.20	73.66
mBERT	82.15 (±0.50)	79.97 (±0.41)	95.51 (±0.56)
NorBERT	83.12 (±1.10)	82.19 (±1.26)	96.32 (±0.10)
NorBERT2	85.45 (±0.23)	84.38 (±0.37)	95.32 (±0.08)
NB-BERT	87.00 (±0.30)	86.50 (±0.34)	97.06 (±0.09)
Ner-Scandi	87.06 (±0.56)	86.58 (±0.45)	97.07 (±0.13)

NorNE results for all models. BERT models use the average F1 across all seeds for a given model. No Boundary refers to NER without named entity token span taken into account. Strict Match refers to NER with named entity token span taken into the evaluation. Binary classification denotes the task where only the detection if an entity is present, without specific type associated. *Ten sentences were dropped from SpaCy pipeline

NPSC binary results

TBA

Model	Class	Precision	Recall	F1	Support
Spacy*	0	99.76	98.28	99.02	1013745
	1	57.27	90.77	70.23	25723
	Macro avg.	78.51	94.52	84.62	1039468
------------	-------	-----------------	-----------------	-----------------	------------
NCRF++	0	97.38	99.77	98.56	1014791
	1	85.85	34.04	48.75	41247
	Macro avg.	91.61	66.91	73.66	1056038
------------	-------	-----------------	-----------------	-----------------	------------
mBERT	0	99.41 (±0.06)	99.64 (±0.03)	99.52 (±0.02)	1014791
	1	91.09 (±0.70)	86.18 (±1.19)	88.56 (±0.39)	41247
	Macro avg.	95.25 (±0.32)	92.91 (±0.58)	94.04 (±0.20)	1039468
------------	-------	-----------------	-----------------	-----------------	------------
NorBERT	0	99.41 (±0.11)	99.74 (±0.04)	99.58 (±0.04)	1014791
	1	93.69 (±1.04)	86.64 (±1.97)	90.00 (±0.66)	41247
	Macro avg.	96.55 (±0.47)	93.19 (±0.97)	94.79 (±0.35)	1039468
------------	-------	-----------------	-----------------	-----------------	------------
NorBERT2	0	99.39 (±0.05)	99.79 (±0.02)	99.59 (±0.02)	1014791
	1	94.81 (±0.47)	86.42 (±0.89)	90.41 (±0.36)	41247
	Macro avg.	97.10 (±0.22)	93.10 (±0.44)	95.00 (±0.19)	1039468
------------	-------	-----------------	-----------------	-----------------	------------
NB-BERT	0	99.35 (±0.04)	99.85 (±0.02)	99.59 (±0.02)	1014791
	1	96.22 (±0.47)	85.66 (±0.78)	90.63 (±0.43)	41247
	Macro avg.	97.78 (±0.23)	92.76 (±0.39)	95.11 (±0.23)	1039468
------------	-------	-----------------	-----------------	-----------------	------------
Scandi-ner	0	99.42 (±0.05)	99.82 (±0.02)	99.62 (±0.02)	1014791
	1	95.71 (±0.39)	87.03 (±0.97)	91.16 (±0.43)	41247
	Macro avg.	97.56 (±0.18)	93.43 (±0.48)	95.39 (±0.23)	1039468

NPSC NorNe Binary Classification precision, recall and F1 metrics for all models. 0 class denotes the Not Capital Letter class. 1 class denotes the Capital Letter class. The Support collumn refers to the number of entries for a given class. *729 sentences were dropped from SpaCy pipeline because of difference prediction sequence length and gold standard sequence length

Resources

Link to the NorNE dataset: https://github.com/ltgoslo/norne
Link to the Norwegian Parliamentary Speech Corpus: https://www.nb.no/sprakbanken/en/resource-catalogue/oai-nb-no-sbr-58/

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
Results		Results
initial training framework		initial training framework
misc		misc
utils		utils
README.md		README.md
test_all_models_norne_NPSC.ipynb		test_all_models_norne_NPSC.ipynb
train_all_5_BERTs.ipynb		train_all_5_BERTs.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Master thesis: Named Entity Recognition in Speech to Text Transcripts

Abstract

Some results from the experimets

NorNE results

NPSC binary results

Resources

About

Releases

Packages

Contributors 2

Languages

sfimediafutures/MA_Peter-R-ysland-Aarnes

Folders and files

Latest commit

History

Repository files navigation

Master thesis: Named Entity Recognition in Speech to Text Transcripts

Abstract

Some results from the experimets

NorNE results

NPSC binary results

Resources

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages