Skip to content

Commit

Permalink
Submission 486, Chadha/Rashiti/Sibille (#60)
Browse files Browse the repository at this point in the history
* feat: add submission 486

Co-authored-by: rashitig <[email protected]>
Co-authored-by: Tarun Chadha <[email protected]>
Co-authored-by: Christiane Sibille <[email protected]>

* fix: format citation syntax

* fix: add png file to repo

---------

Co-authored-by: rashitig <[email protected]>
Co-authored-by: Tarun Chadha <[email protected]>
Co-authored-by: Christiane Sibille <[email protected]>
  • Loading branch information
4 people authored Sep 4, 2024
1 parent 031c530 commit 3c01a07
Show file tree
Hide file tree
Showing 3 changed files with 173 additions and 0 deletions.
Binary file added submissions/486/images/graph.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions submissions/486/index.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
---
submission_id: 486
categories: 'Session 7A'
title: From words to numbers. Methodological perspectives on large scale Named Entity Linking
author:
- name: Tarun Chadha
email: [email protected]
affiliations:
- ETH Zürich IT Services
- name: Gentiana Rashiti
email: [email protected]
orcid: 0009-0005-6799-4358
affiliations:
- ETH Zürich Library
- name: Christiane Sibille
email: [email protected]
orcid: 0000-0003-3689-2154
affiliations:
- ETH Zürich Library

keywords:
- Machine Learning
- Named Entity Linking
- Named Entity Recognition
- Historical Data
- Natural Language Processing
abstract: Named Entity Linking (NEL) describes the recognition, disambiguation, and linking of so-called «Named Entities» (such as people, places, and organizations) in text. Machine-assisted linking of entities helps to identify historical actors in large source corpora and thus contributes significantly to digital approaches in historical research. However, applying NEL to historical data presents unique challenges due to issues ranging from poor OCR and alternate spellings to people in historical texts being under-represented in contemporary databases. Given that we often have only sparse specific information about an entity in its direct context, we are developing a robust, modular, and scalable workflow in which we «embed» the people by the context in which they appear. This gives us more information, enabling disambiguation even when only limited data is present and application of NEL to large text corpora. Such techniques have been used and described in works such as [@10.1007/978-3-030-29563-9_13] and [@vasilyev2022namedentitylinkingentity]. With developing this pipeline and the corresponding embedding knowledge base(s) of historical entities we want to enable the use of such methods in the Swiss GLAM landscape.
date: 09-02-2024
bibliography: references.bib
---

## Introduction

Named entity recognition, disambiguation, and linking are pivotal methods in Natural Language Processing (NLP) applied to historical research. These methods present unique and complex challenges in the context of historical texts [@bunout2023; @luthra2022unsilencingcolonialarchivesautomated; @10.1145/3604931]. They grapple with the complexities arising from context-dependent meanings of named entities, as well as the issues of polysemy, homonymy, and naming variations.

Historically, solutions ranged from basic string matching to intricate rule-based heuristics. While these methods are still widely used, they often fall short in terms of scalability, generalization, and accuracy, particularly when compared to current machine-learning techniques. Recent advances have seen a shift towards leveraging contextual embeddings to achieve groundbreaking accuracy in these tasks, as evidenced by seminal works such as @yamada-etal-2016-joint; @ganea-hofmann-2017-deep; and @chen2020contextualizedendtoendneuralentity.

Vector embeddings are an essential tool used in NLP to represent words as numerical vectors. When applied appropriately, they can capture semantic information of words depending on the context in which they appear. For instance, in sentences such as «I opened an account at the bank» and «Beavers build dams in river banks,» the word «bank» would be embedded differently. On the other hand, the vector embeddings for «I sat down on the chair» and «I lowered myself onto the seat» would be «close» in the vector space, as they contain similar content.

Regarding linking named entities in a text, e.g. persons, this would mean that we embed them based on the context in which they appear. If there are two viable options (such as the same first name, last name, and time period) for a match between a name and a person, but the name we are searching for appears in an article about architecture and one of the two options is an architect and one a medical doctor, we can now take into account this semantic context as an additional parameter to calculate a possible match.

## Methods
In our presentation, we will show a glimpse of the current state of our ambitious project, which aims to create a robust and scalable pipeline for applying embeddings-based NEL to historical texts. In our work, we focus on three key aspects. Firstly, on embeddings-based linking and disambiguation workflow applied to a historical corpus of Swiss magazines (E-Periodica) that uses Wikipedia, Gemeinsame Normdatei (GND), and – since our primary use cases deal with historical material from Switzerland – the Historical Dictionary of Switzerland (HDS) as reference knowledge bases. This part aims to develop a performant and modular pipeline to recognize named entities in retro-digitized texts and link them to so-called authority files (Normdaten), e.g., the German Authority File (GND). With this workflow, we will help to identify historical actors in source material and contribute to the in-depth FAIRification of large datasets through persistent identifiers on the text level. Our proposed pipeline is modular with respect to the embedding model, enabling performance comparison across different embedding model choices and leaving room for future improved embedding models, which capture semantic similarities even better than current popular open-source models such as BERT.

Secondly, we plan to use this case study to reflect upon the interpretation of metrics provided by algorithmic models and their relevance in historical research methodology. We will focus on three key areas: Contextual Sensitivity, Ambiguity Resolution, and Computational Efficiency. By focusing on these aspects, we will provide a comprehensive insight into the models' operational capabilities, particularly in large-scale historical text analysis. Given the challenges of retro-digitized historical data (OCR quality, heterogeneous contents in large collections, etc.), it is necessary to not only select appropriate models and methods to the specific needs of such material but also to create representative ground truth data for OCR, NER, and NEL. Furthermore, scale considerations drive our case study, as some of our use cases consist of millions of pages.

Finally, we will discuss the role of GLAM (galleries, libraries, archives, and museums) institutions as drivers of change and facilitators, especially when it comes to the use of their collections as data [@padilla_2023_8342171].

![Pipeline of end-to-end Named Entity Linking.](images/graph.png)

## Conclusion
Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research.
121 changes: 121 additions & 0 deletions submissions/486/references.bib
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
@book{bunout2023,
url = {https://doi.org/10.1515/9783110729214},
title = {Digitised Newspapers – A New Eldorado for Historians?},
title = {Reflections on Tools, Methods and Epistemology},
editor = {Estelle Bunout and Maud Ehrmann and Frédéric Clavert},
publisher = {De Gruyter Oldenbourg},
address = {Berlin, Boston},
doi = {doi:10.1515/9783110729214},
isbn = {9783110729214},
year = {2023},
lastchecked = {2024-09-02}
}
@misc{chen2020contextualizedendtoendneuralentity,
title={Contextualized End-to-End Neural Entity Linking},
author={Haotian Chen and Andrej Zukov-Gregoric and Xi David Li and Sahil Wadhwa},
year={2020},
eprint={1911.03834},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/1911.03834},
}
@article{10.1145/3604931,
author = {Ehrmann, Maud and Hamdi, Ahmed and Pontes, Elvys Linhares and Romanello, Matteo and Doucet, Antoine},
title = {Named Entity Recognition and Classification in Historical Documents: A Survey},
year = {2023},
issue_date = {February 2024},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
volume = {56},
number = {2},
issn = {0360-0300},
url = {https://doi.org/10.1145/3604931},
doi = {10.1145/3604931},
abstract = {After decades of massive digitisation, an unprecedented number of historical documents are available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve, and explore information from this ‘big data of the past’. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical, and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.},
journal = {ACM Comput. Surv.},
month = {sep},
articleno = {27},
numpages = {47},
keywords = {Named entity recognition and classification, historical documents, natural language processing, digital humanities}
}
@inproceedings{ganea-hofmann-2017-deep,
title = "Deep Joint Entity Disambiguation with Local Neural Attention",
author = "Ganea, Octavian-Eugen and
Hofmann, Thomas",
editor = "Palmer, Martha and
Hwa, Rebecca and
Riedel, Sebastian",
booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
month = sep,
year = "2017",
address = "Copenhagen, Denmark",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/D17-1277",
doi = "10.18653/v1/D17-1277",
pages = "2619--2629",
abstract = "We propose a novel deep learning model for joint document-level entity disambiguation, which leverages learned neural representations. Key components are entity embeddings, a neural attention mechanism over local context windows, and a differentiable joint inference stage for disambiguation. Our approach thereby combines benefits of deep learning with more traditional approaches such as graphical models and probabilistic mention-entity maps. Extensive experiments show that we are able to obtain competitive or state-of-the-art accuracy at moderate computational costs.",
}
@misc{luthra2022unsilencingcolonialarchivesautomated,
title={Unsilencing Colonial Archives via Automated Entity Recognition},
author={Mrinalini Luthra and Konstantin Todorov and Charles Jeurgens and Giovanni Colavizza},
year={2022},
eprint={2210.02194},
archivePrefix={arXiv},
primaryClass={cs.DL},
url={https://arxiv.org/abs/2210.02194},
}
@InProceedings{10.1007/978-3-030-29563-9_13,
author="Nozza, Debora
and Sas, Cezar
and Fersini, Elisabetta
and Messina, Enza",
editor="Douligeris, Christos
and Karagiannis, Dimitris
and Apostolou, Dimitris",
title="Word Embeddings for Unsupervised Named Entity Linking",
booktitle="Knowledge Science, Engineering and Management",
year="2019",
publisher="Springer International Publishing",
address="Cham",
pages="115--132",
abstract="The huge amount of textual user-generated content on the Web has incredibly grown in the last decade, creating new relevant opportunities for different real-world applications and domains. In particular, microblogging platforms enables the collection of continuously and instantly updated information. The organization and extraction of valuable knowledge from these contents are fundamental for ensuring profitability and efficiency to companies and institutions. This paper presents an unsupervised model for the task of Named Entity Linking in microblogging environments. The aim is to link the named entity mentions in a text with their corresponding knowledge-base entries exploiting a novel heterogeneous representation space characterized by more meaningful similarity measures between words and named entities, obtained by Word Embeddings. The proposed model has been evaluated on different benchmark datasets proposed for Named Entity Linking challenges for English and Italian language. It obtains very promising performance given the highly challenging environment of user-generated content over microblogging platforms.",
isbn="978-3-030-29563-9"
}
@misc{vasilyev2022namedentitylinkingentity,
title={Named Entity Linking with Entity Representation by Multiple Embeddings},
author={Oleg Vasilyev and Alex Dauenhauer and Vedant Dharnidharka and John Bohannon},
year={2022},
eprint={2205.10498},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2205.10498},
}
@inproceedings{yamada-etal-2016-joint,
title = "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation",
author = "Yamada, Ikuya and
Shindo, Hiroyuki and
Takeda, Hideaki and
Takefuji, Yoshiyasu",
editor = "Riezler, Stefan and
Goldberg, Yoav",
booktitle = "Proceedings of the 20th {SIGNLL} Conference on Computational Natural Language Learning",
month = aug,
year = "2016",
address = "Berlin, Germany",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/K16-1025",
doi = "10.18653/v1/K16-1025",
pages = "250--259",
}
@misc{padilla_2023_8342171,
author = {Padilla, Thomas and
Scates Kettler, Hannah and
Varner, Stewart and
Shorish, Yasmeen},
title = {Vancouver Statement on Collections as Data},
month = sep,
year = 2023,
publisher = {Zenodo},
doi = {10.5281/zenodo.8342171},
url = {https://doi.org/10.5281/zenodo.8342171}
}

0 comments on commit 3c01a07

Please sign in to comment.