This repository provides underlying code for the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'
This repository contains code underpinning the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'. This paper assesses the impact of OCR quality on a variety of downstream tasks using a dataset of OCR'd articles from 19th Century newspapers. This repository includes code for downloading and processing the data into a Pandas Dataframe and code for each section of the paper (outlined further below).
The majority of the analysis is done in Python 3. You can create an environment for running this code using the Anaconda package manager and the environment file includes in this repository.
Note While Conda environments are largely operating system agnostic we have only tested this environment on macOS. The pandarallel Python package which we use for parallelizing some steps in the notebooks only works on windows if executed from the Windows Subsystem for Linux (WSL). If you want to run the notebooks from Windows directly this should be possible by replacing parallel_apply
with apply
. This will result in the default Pandas apply function being utilised. This will result in these functions running more slowly.
-
Install Anaconda following these instructions.
-
Create
ocr-evaluation
environment:
conda env create -f environment.yml
- Activate
ocr-evaluation
environment:
conda activate ocr-evaluation
All of the dependencies for these notebooks should be covered by using the above Conda Environment. The notebooks below are presented in the order they appear in the original paper.
This notebook
- covers the steps required to download Overproof data and process this data into a Pandas Dataframe.
- it will take some time to run the first time. The notebook will output the Pandas Dataframe as a pickle file. This will allow for quicker loading of the Dataframe when used in the subsequent notebooks (it will probably be easiest to 'run all cells' and return to the notebook a few hours later)
This notebook
- plots dictionary lookup against string similarity between OCR'd and human corrected version of the text
- produces plots comparing Jaccard and Levenstein distance similarity
This notebook performs alignment between the two versions of the text. Further explanation of the approach taken is outlined in the notebook.
This notebook evaluates the alignments created in the previous notebook.
This notebook assesses the impact of OCR on:
- Part-of-speech tagging accuracy (fine- and coarse-grained)
- Named entity recognition accuracy (matching type, matching type and IOB-tag)
- Persons: f-score (by quality band)
- Geopolitical entities: f-score (by quality band)
- Dates: f-score (by quality band)
- Dependency parsing
This notebook goes through the steps of creating topic models using Latent Dirichlet Allocation (LDA) using the Gensim implementation.
This notebook performs and evaluation of the topic models created in the above notebook.
The notebooks for the Language Model evaluation rely on having a pre-trained Language Model to fine-tune. This language model will be released alongside a forthcoming paper. This notebook will be updated once this has been released with instructions on how to fine tune the language model.
This work is part of the Living with Machines project. Living with Machines is a multidisciplinary programme funded by the Strategic Priority Fund which is led by UK Research and Innovation (UKRI) and delivered by the Arts and Humanities Research Council (AHRC).
This work is licensed under a Creative Commons Attribution 4.0 International License.