Skip to content

Repository for code underlying the paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'

License

Notifications You must be signed in to change notification settings

Living-with-machines/lwm_ARTIDIGH_2020_OCR_impact_downstream_NLP_tasks

Repository files navigation

DOI

Assessing the Impact of OCR Quality on Downstream NLP Tasks

This repository provides underlying code for the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'

What is this?

This repository contains code underpinning the conference paper 'Assessing the Impact of OCR Quality on Downstream NLP Tasks'. This paper assesses the impact of OCR quality on a variety of downstream tasks using a dataset of OCR'd articles from 19th Century newspapers. This repository includes code for downloading and processing the data into a Pandas Dataframe and code for each section of the paper (outlined further below).

Setup

The majority of the analysis is done in Python 3. You can create an environment for running this code using the Anaconda package manager and the environment file includes in this repository.

Install the required packages

Note While Conda environments are largely operating system agnostic we have only tested this environment on macOS. The pandarallel Python package which we use for parallelizing some steps in the notebooks only works on windows if executed from the Windows Subsystem for Linux (WSL). If you want to run the notebooks from Windows directly this should be possible by replacing parallel_apply with apply. This will result in the default Pandas apply function being utilised. This will result in these functions running more slowly.


  1. Install Anaconda following these instructions.

  2. Create ocr-evaluation environment:

conda env create -f environment.yml
  1. Activate ocr-evaluation environment:
conda activate ocr-evaluation

Contents

All of the dependencies for these notebooks should be covered by using the above Conda Environment. The notebooks below are presented in the order they appear in the original paper.

This notebook

  • covers the steps required to download Overproof data and process this data into a Pandas Dataframe.
  • it will take some time to run the first time. The notebook will output the Pandas Dataframe as a pickle file. This will allow for quicker loading of the Dataframe when used in the subsequent notebooks (it will probably be easiest to 'run all cells' and return to the notebook a few hours later)

This notebook

  • plots dictionary lookup against string similarity between OCR'd and human corrected version of the text
  • produces plots comparing Jaccard and Levenstein distance similarity

This notebook performs alignment between the two versions of the text. Further explanation of the approach taken is outlined in the notebook.

This notebook evaluates the alignments created in the previous notebook.

This notebook assesses the impact of OCR on:

  • Part-of-speech tagging accuracy (fine- and coarse-grained)
  • Named entity recognition accuracy (matching type, matching type and IOB-tag)
    • Persons: f-score (by quality band)
    • Geopolitical entities: f-score (by quality band)
    • Dates: f-score (by quality band)
  • Dependency parsing

This notebook goes through the steps of creating topic models using Latent Dirichlet Allocation (LDA) using the Gensim implementation.

This notebook performs and evaluation of the topic models created in the above notebook.

Language model notebooks

The notebooks for the Language Model evaluation rely on having a pre-trained Language Model to fine-tune. This language model will be released alongside a forthcoming paper. This notebook will be updated once this has been released with instructions on how to fine tune the language model.

Acknowledgements

This work is part of the Living with Machines project. Living with Machines is a multidisciplinary programme funded by the Strategic Priority Fund which is led by UK Research and Innovation (UKRI) and delivered by the Arts and Humanities Research Council (AHRC).

License

Shield: CC BY 4.0

This work is licensed under a Creative Commons Attribution 4.0 International License.

CC BY 4.0