Assessing the Impact of OCR Quality on Downstream NLP Tasks

This repository provides underlying code for the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'

What is this?

This repository contains code underpinning the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'. This paper assesses the impact of OCR quality on a variety of downstream tasks using a dataset of OCR'd articles from 19th Century newspapers. This repository includes code for downloading and processing the data into a Pandas Dataframe and code for each section of the paper (outlined further below).

Setup

The majority of the analysis is done in Python 3. You can create an environment for running this code using the Anaconda package manager and the environment file includes in this repository.

Install the required packages

Note While Conda environments are largely operating system agnostic we have only tested this environment on macOS. The pandarallel Python package which we use for parallelizing some steps in the notebooks only works on windows if executed from the Windows Subsystem for Linux (WSL). If you want to run the notebooks from Windows directly this should be possible by replacing parallel_apply with apply. This will result in the default Pandas apply function being utilised. This will result in these functions running more slowly.

Install Anaconda following these instructions.
Create ocr-evaluation environment:

conda env create -f environment.yml

Activate ocr-evaluation environment:

conda activate ocr-evaluation

covers the steps required to download Overproof data and process this data into a Pandas Dataframe.
it will take some time to run the first time. The notebook will output the Pandas Dataframe as a pickle file. This will allow for quicker loading of the Dataframe when used in the subsequent notebooks (it will probably be easiest to 'run all cells' and return to the notebook a few hours later)

2) dictionary_lookup_word_errorrate.ipynb

This notebook

plots dictionary lookup against string similarity between OCR'd and human corrected version of the text
produces plots comparing Jaccard and Levenstein distance similarity

3) aligning_trove.ipynb

This notebook performs alignment between the two versions of the text. Further explanation of the approach taken is outlined in the notebook.

4) alignment_assessment.ipynb

This notebook evaluates the alignments created in the previous notebook.

5) linguistic_processing_trove.ipynb

This notebook assesses the impact of OCR on:

Part-of-speech tagging accuracy (fine- and coarse-grained)
Named entity recognition accuracy (matching type, matching type and IOB-tag)
- Persons: f-score (by quality band)
- Geopolitical entities: f-score (by quality band)
- Dates: f-score (by quality band)
Dependency parsing

6) topic_modelling_main.ipynb

This notebook goes through the steps of creating topic models using Latent Dirichlet Allocation (LDA) using the Gensim implementation.

7) topic_modelling_secondary.ipynb

This notebook performs and evaluation of the topic models created in the above notebook.

Language model notebooks

The notebooks for the Language Model evaluation rely on having a pre-trained Language Model to fine-tune. This language model will be released alongside a forthcoming paper. This notebook will be updated once this has been released with instructions on how to fine tune the language model.

Acknowledgements

This work is part of the Living with Machines project. Living with Machines is a multidisciplinary programme funded by the Strategic Priority Fund which is led by UK Research and Innovation (UKRI) and delivered by the Arts and Humanities Research Council (AHRC).

License

Shield:

This work is licensed under a Creative Commons Attribution 4.0 International License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Assessing the Impact of OCR Quality on Downstream NLP Tasks

What is this?

Setup

Install the required packages

Contents

1) create_trove_dataframe.ipynb

2) dictionary_lookup_word_errorrate.ipynb

3) aligning_trove.ipynb

4) alignment_assessment.ipynb

5) linguistic_processing_trove.ipynb

6) topic_modelling_main.ipynb

7) topic_modelling_secondary.ipynb

Language model notebooks

Acknowledgements

License

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 75 Commits
LM_analysis		LM_analysis
figures		figures
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
aligning_trove.ipynb		aligning_trove.ipynb
alignment_assessment.ipynb		alignment_assessment.ipynb
create_trove_dataframe.ipynb		create_trove_dataframe.ipynb
dictionary_lookup_word_errorrate.ipynb		dictionary_lookup_word_errorrate.ipynb
environment.yml		environment.yml
linguistic_processing_trove.ipynb		linguistic_processing_trove.ipynb
topic_modelling_main.ipynb		topic_modelling_main.ipynb
topic_modelling_secondary.ipynb		topic_modelling_secondary.ipynb

License

Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Folders and files

Latest commit

History

Repository files navigation

Assessing the Impact of OCR Quality on Downstream NLP Tasks

What is this?

Setup

Install the required packages

Contents

1) create_trove_dataframe.ipynb

2) dictionary_lookup_word_errorrate.ipynb

3) aligning_trove.ipynb

4) alignment_assessment.ipynb

5) linguistic_processing_trove.ipynb

6) topic_modelling_main.ipynb

7) topic_modelling_secondary.ipynb

Language model notebooks

Acknowledgements

License

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages