Submission 486, Chadha/Rashiti/Sibille (#60)

* feat: add submission 486 Co-authored-by: rashitig <[email protected]> Co-authored-by: Tarun Chadha <[email protected]> Co-authored-by: Christiane Sibille <[email protected]> * fix: format citation syntax * fix: add png file to repo --------- Co-authored-by: rashitig <[email protected]> Co-authored-by: Tarun Chadha <[email protected]> Co-authored-by: Christiane Sibille <[email protected]>
digihistch24 · Sep 4, 2024 · 3c01a07 · 3c01a07
1 parent 031c530
commit 3c01a07
Show file tree

Hide file tree

Showing 3 changed files with 173 additions and 0 deletions.
diff --git a/submissions/486/images/graph.png b/submissions/486/images/graph.png
diff --git a/submissions/486/index.qmd b/submissions/486/index.qmd
@@ -0,0 +1,52 @@
+---
+submission_id: 486
+categories: 'Session 7A'
+title: From words to numbers. Methodological perspectives on large scale Named Entity Linking
+author:
+  - name: Tarun Chadha
+    email: [email protected]
+    affiliations:
+      - ETH Zürich IT Services
+  - name: Gentiana Rashiti
+    email: [email protected]
+    orcid: 0009-0005-6799-4358
+    affiliations:
+      - ETH Zürich Library
+  - name: Christiane Sibille
+    email: [email protected]
+    orcid: 0000-0003-3689-2154
+    affiliations:
+      - ETH Zürich Library
+
+keywords:
+  - Machine Learning
+  - Named Entity Linking
+  - Named Entity Recognition
+  - Historical Data
+  - Natural Language Processing
+abstract: Named Entity Linking (NEL) describes the recognition, disambiguation, and linking of so-called «Named Entities» (such as people, places, and organizations) in text. Machine-assisted linking of entities helps to identify historical actors in large source corpora and thus contributes significantly to digital approaches in historical research. However, applying NEL to historical data presents unique challenges due to issues ranging from poor OCR and alternate spellings to people in historical texts being under-represented in contemporary databases. Given that we often have only sparse specific information about an entity in its direct context, we are developing a robust, modular, and scalable workflow in which we «embed» the people by the context in which they appear. This gives us more information, enabling disambiguation even when only limited data is present and application of NEL to large text corpora. Such techniques have been used and described in works such as [@10.1007/978-3-030-29563-9_13] and [@vasilyev2022namedentitylinkingentity]. With developing this pipeline and the corresponding embedding knowledge base(s) of historical entities we want to enable the use of such methods in the Swiss GLAM landscape.
+date: 09-02-2024
+bibliography: references.bib
+---
+
+## Introduction
+
+Named entity recognition, disambiguation, and linking are pivotal methods in Natural Language Processing (NLP) applied to historical research. These methods present unique and complex challenges in the context of historical texts [@bunout2023; @luthra2022unsilencingcolonialarchivesautomated; @10.1145/3604931]. They grapple with the complexities arising from context-dependent meanings of named entities, as well as the issues of polysemy, homonymy, and naming variations.
+
+Historically, solutions ranged from basic string matching to intricate rule-based heuristics. While these methods are still widely used, they often fall short in terms of scalability, generalization, and accuracy, particularly when compared to current machine-learning techniques. Recent advances have seen a shift towards leveraging contextual embeddings to achieve groundbreaking accuracy in these tasks, as evidenced by seminal works such as @yamada-etal-2016-joint; @ganea-hofmann-2017-deep; and @chen2020contextualizedendtoendneuralentity.
+
+Vector embeddings are an essential tool used in NLP to represent words as numerical vectors. When applied appropriately, they can capture semantic information of words depending on the context in which they appear. For instance, in sentences such as «I opened an account at the bank» and «Beavers build dams in river banks,» the word «bank» would be embedded differently. On the other hand, the vector embeddings for «I sat down on the chair» and «I lowered myself onto the seat» would be «close» in the vector space, as they contain similar content.
+
+Regarding linking named entities in a text, e.g. persons, this would mean that we embed them based on the context in which they appear. If there are two viable options (such as the same first name, last name, and time period) for a match between a name and a person, but the name we are searching for appears in an article about architecture and one of the two options is an architect and one a medical doctor, we can now take into account this semantic context as an additional parameter to calculate a possible match.
+
+## Methods
+In our presentation, we will show a glimpse of the current state of our ambitious project, which aims to create a robust and scalable pipeline for applying embeddings-based NEL to historical texts. In our work, we focus on three key aspects. Firstly, on embeddings-based linking and disambiguation workflow applied to a historical corpus of Swiss magazines (E-Periodica) that uses Wikipedia, Gemeinsame Normdatei (GND), and – since our primary use cases deal with historical material from Switzerland – the Historical Dictionary of Switzerland (HDS) as reference knowledge bases. This part aims to develop a performant and modular pipeline to recognize named entities in retro-digitized texts and link them to so-called authority files (Normdaten), e.g., the German Authority File (GND). With this workflow, we will help to identify historical actors in source material and contribute to the in-depth FAIRification of large datasets through persistent identifiers on the text level. Our proposed pipeline is modular with respect to the embedding model, enabling performance comparison across different embedding model choices and leaving room for future improved embedding models, which capture semantic similarities even better than current popular open-source models such as BERT.
+
+Secondly, we plan to use this case study to reflect upon the interpretation of metrics provided by algorithmic models and their relevance in historical research methodology. We will focus on three key areas: Contextual Sensitivity, Ambiguity Resolution, and Computational Efficiency. By focusing on these aspects, we will provide a comprehensive insight into the models' operational capabilities, particularly in large-scale historical text analysis. Given the challenges of retro-digitized historical data (OCR quality, heterogeneous contents in large collections, etc.), it is necessary to not only select appropriate models and methods to the specific needs of such material but also to create representative ground truth data for OCR, NER, and NEL. Furthermore, scale considerations drive our case study, as some of our use cases consist of millions of pages.
+
+Finally, we will discuss the role of GLAM (galleries, libraries, archives, and museums) institutions as drivers of change and facilitators, especially when it comes to the use of their collections as data [@padilla_2023_8342171].
+
+![Pipeline of end-to-end Named Entity Linking.](images/graph.png)
+
+## Conclusion
+Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research.   
diff --git a/submissions/486/references.bib b/submissions/486/references.bib
@@ -0,0 +1,121 @@
+@book{bunout2023,
+url = {https://doi.org/10.1515/9783110729214},
+title = {Digitised Newspapers – A New Eldorado for Historians?},
+title = {Reflections on Tools, Methods and Epistemology},
+editor = {Estelle Bunout and Maud Ehrmann and Frédéric Clavert},
+publisher = {De Gruyter Oldenbourg},
+address = {Berlin, Boston},
+doi = {doi:10.1515/9783110729214},
+isbn = {9783110729214},
+year = {2023},
+lastchecked = {2024-09-02}
+}
+@misc{chen2020contextualizedendtoendneuralentity,
+      title={Contextualized End-to-End Neural Entity Linking}, 
+      author={Haotian Chen and Andrej Zukov-Gregoric and Xi David Li and Sahil Wadhwa},
+      year={2020},
+      eprint={1911.03834},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/1911.03834}, 
+}
+@article{10.1145/3604931,
+author = {Ehrmann, Maud and Hamdi, Ahmed and Pontes, Elvys Linhares and Romanello, Matteo and Doucet, Antoine},
+title = {Named Entity Recognition and Classification in Historical Documents: A Survey},
+year = {2023},
+issue_date = {February 2024},
+publisher = {Association for Computing Machinery},
+address = {New York, NY, USA},
+volume = {56},
+number = {2},
+issn = {0360-0300},
+url = {https://doi.org/10.1145/3604931},
+doi = {10.1145/3604931},
+abstract = {After decades of massive digitisation, an unprecedented number of historical documents are available in digital format, along with their machine-readable texts. While this represents a major step forward with respect to preservation and accessibility, it also opens up new opportunities in terms of content mining and the next fundamental challenge is to develop appropriate technologies to efficiently search, retrieve, and explore information from this ‘big data of the past’. Among semantic indexing opportunities, the recognition and classification of named entities are in great demand among humanities scholars. Yet, named entity recognition (NER) systems are heavily challenged with diverse, historical, and noisy inputs. In this survey, we present the array of challenges posed by historical documents to NER, inventory existing resources, describe the main approaches deployed so far, and identify key priorities for future developments.},
+journal = {ACM Comput. Surv.},
+month = {sep},
+articleno = {27},
+numpages = {47},
+keywords = {Named entity recognition and classification, historical documents, natural language processing, digital humanities}
+}
+@inproceedings{ganea-hofmann-2017-deep,
+    title = "Deep Joint Entity Disambiguation with Local Neural Attention",
+    author = "Ganea, Octavian-Eugen  and
+      Hofmann, Thomas",
+    editor = "Palmer, Martha  and
+      Hwa, Rebecca  and
+      Riedel, Sebastian",
+    booktitle = "Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing",
+    month = sep,
+    year = "2017",
+    address = "Copenhagen, Denmark",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/D17-1277",
+    doi = "10.18653/v1/D17-1277",
+    pages = "2619--2629",
+    abstract = "We propose a novel deep learning model for joint document-level entity disambiguation, which leverages learned neural representations. Key components are entity embeddings, a neural attention mechanism over local context windows, and a differentiable joint inference stage for disambiguation. Our approach thereby combines benefits of deep learning with more traditional approaches such as graphical models and probabilistic mention-entity maps. Extensive experiments show that we are able to obtain competitive or state-of-the-art accuracy at moderate computational costs.",
+}
+@misc{luthra2022unsilencingcolonialarchivesautomated,
+      title={Unsilencing Colonial Archives via Automated Entity Recognition}, 
+      author={Mrinalini Luthra and Konstantin Todorov and Charles Jeurgens and Giovanni Colavizza},
+      year={2022},
+      eprint={2210.02194},
+      archivePrefix={arXiv},
+      primaryClass={cs.DL},
+      url={https://arxiv.org/abs/2210.02194}, 
+}
+@InProceedings{10.1007/978-3-030-29563-9_13,
+  author="Nozza, Debora
+  and Sas, Cezar
+  and Fersini, Elisabetta
+  and Messina, Enza",
+  editor="Douligeris, Christos
+  and Karagiannis, Dimitris
+  and Apostolou, Dimitris",
+  title="Word Embeddings for Unsupervised Named Entity Linking",
+  booktitle="Knowledge Science, Engineering and Management",
+  year="2019",
+  publisher="Springer International Publishing",
+  address="Cham",
+  pages="115--132",
+  abstract="The huge amount of textual user-generated content on the Web has incredibly grown in the last decade, creating new relevant opportunities for different real-world applications and domains. In particular, microblogging platforms enables the collection of continuously and instantly updated information. The organization and extraction of valuable knowledge from these contents are fundamental for ensuring profitability and efficiency to companies and institutions. This paper presents an unsupervised model for the task of Named Entity Linking in microblogging environments. The aim is to link the named entity mentions in a text with their corresponding knowledge-base entries exploiting a novel heterogeneous representation space characterized by more meaningful similarity measures between words and named entities, obtained by Word Embeddings. The proposed model has been evaluated on different benchmark datasets proposed for Named Entity Linking challenges for English and Italian language. It obtains very promising performance given the highly challenging environment of user-generated content over microblogging platforms.",
+  isbn="978-3-030-29563-9"
+}
+@misc{vasilyev2022namedentitylinkingentity,
+      title={Named Entity Linking with Entity Representation by Multiple Embeddings}, 
+      author={Oleg Vasilyev and Alex Dauenhauer and Vedant Dharnidharka and John Bohannon},
+      year={2022},
+      eprint={2205.10498},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2205.10498}, 
+}
+@inproceedings{yamada-etal-2016-joint,
+    title = "Joint Learning of the Embedding of Words and Entities for Named Entity Disambiguation",
+    author = "Yamada, Ikuya  and
+      Shindo, Hiroyuki  and
+      Takeda, Hideaki  and
+      Takefuji, Yoshiyasu",
+    editor = "Riezler, Stefan  and
+      Goldberg, Yoav",
+    booktitle = "Proceedings of the 20th {SIGNLL} Conference on Computational Natural Language Learning",
+    month = aug,
+    year = "2016",
+    address = "Berlin, Germany",
+    publisher = "Association for Computational Linguistics",
+    url = "https://aclanthology.org/K16-1025",
+    doi = "10.18653/v1/K16-1025",
+    pages = "250--259",
+}
+@misc{padilla_2023_8342171,
+  author       = {Padilla, Thomas and
+                  Scates Kettler, Hannah and
+                  Varner, Stewart and
+                  Shorish, Yasmeen},
+  title        = {Vancouver Statement on Collections as Data},
+  month        = sep,
+  year         = 2023,
+  publisher    = {Zenodo},
+  doi          = {10.5281/zenodo.8342171},
+  url          = {https://doi.org/10.5281/zenodo.8342171}
+}