There is a growing interest in using Machine Learning to automatically produce metadata for GLAM (Galleries, Libraries, Archives and Museums) collections. This repository contains the source material for a Jupyter book that moves through the steps of developing a machine learning model to classify book titles into 'crude' genres (fiction or non-fiction). In particular, we work with the British Library's "Microsoft Digitised Books" collection to automatically generate metadata for ~49,455 titles.
The Jupyter Book aims to give an overview of the broader pipeline involved in creating machine learning models, i.e. not just showing the model training process but steps before and after this.
Topics covered include:
- exploring our training data against the entire corpus and assessing the 'representativeness' of our digitised collection
- training an initial baseline model
- assessing weaknesses in our model
- using weak supervision to create more training data
- discussion of how to share models and data
We use several Python machine learning libraries in the notebooks:
While we aim to give an overview of the steps involved in training a machine learning model, we don't aim to introduce machine learning or Natural Language Processing fully.
As part of this work we also share:
- the intial training data: https://doi.org/10.23636/BKHQ-0312 you can also find this in the Hugging Face hub
- the baseline model: On Zenodo and on the Hugging Face Hub
- the improved model: https://huggingface.co/BritishLibraryLabs/bl-books-genre
- two demos:
- blog post
This work was partially support by Living with Machines. This project, funded by the UK Research and Innovation (UKRI) Strategic Priority Fund, is a multidisciplinary collaboration delivered by the Arts and Humanities Research Council (AHRC), with The Alan Turing Institute, the British Library and the Universities of Cambridge, East Anglia, Exeter, and Queen Mary University of London.