diff --git a/README.md b/README.md index a14575f..6e6a3b0 100644 --- a/README.md +++ b/README.md @@ -52,10 +52,10 @@ This project is maintained by [@digihistch24](https://github.com/digihistch24). ## Roadmap -- [ ] Add all submissions to the Book of Abstracts +- [x] Add all submissions to the Book of Abstracts - [ ] Add a Table of Contents to the Book of Abstracts - [ ] Add an introduction to the Book of Abstracts -- [ ] Add styling to the Book of Abstracts +- [x] Add styling to the Book of Abstracts ## Contributing diff --git a/index.qmd b/index.qmd index 19bbb4f..2e50e34 100644 --- a/index.qmd +++ b/index.qmd @@ -33,7 +33,7 @@ This book of abstracts contains all the papers and posters presented at the conf We are pleased to present this collection, which reflects the state of the art in digital history, and anticipate that the discussions it stimulates will make a significant contribution to the field. -::: {.callout-note} + ## Paper diff --git a/styles.css b/styles.css index 0be3edd..c44c7f8 100644 --- a/styles.css +++ b/styles.css @@ -95,3 +95,56 @@ scrollbar-color: #a5d7d2 white; scrollbar-width: thin; } + +/* General print styles */ +@media print { + body { + padding: 0; + } + /* Hide navigation, search, and footer elements */ + #quarto-header nav.navbar, + #quarto-code-tools-source, + #quarto-back-to-top { + display: none; + } + + /* Set margins for A4 paper */ + @page { + size: A4; + margin: 2cm; + } + + /* Page breaks to ensure proper structure */ + #title-block-header { + page-break-after: always; + } + + figure { + page-break-inside: avoid; + } + + h1, + h2, + h3, + h4, + h5, + h6 { + page-break-after: avoid; + } + + /* Prevent orphans and widows */ + p, + h1, + h2, + h3 { + widows: 2; + orphans: 2; + } + + /* Ensure that background colors and images are printed */ + body { + background-color: white; + /* Ensures clean background */ + -webkit-print-color-adjust: exact; + } +} diff --git a/submissions/405/index.qmd b/submissions/405/index.qmd index 986cf58..29dc6b9 100644 --- a/submissions/405/index.qmd +++ b/submissions/405/index.qmd @@ -24,10 +24,10 @@ bibliography: references.bib --- ## Introduction -Data-driven approaches bring extensive opportunities for research to analyze large volumes of data, and gain new knowledge and insights. This is considered especially beneficial for implementation in the humanities and social sciences [@weichselbraun2021]. Application of data-driven research methodologies in the field of history requires a sufficient source base, which should be accurate, transparently shaped and large enough for robust analysis [@braake2016]. -Web archives preserve valuable resources that can be drawn upon to analyze the development of the websites and even the whole domains through the decades, and provide access to them [@brugger2018]. At first glance, the volumes of data captured are impressive and suggest the opportunity for big data research practices. For example, the Web Crawls collection of the Internet Archive alone includes 80.2 PB of data [@webcrawls2024]. At the same time, the web-archived collections expose a set of other characteristics relevant to big data and this can be challenging for their efficient use. Such features include, for instance, a high level of velocity, exhaustive in scope and diverse in variety [@kitchin2014], which require addressing and resolving specific issues. -This research focuses on museums’ presence on the web, describes opportunities for implementation of data-driven research, and identifies challenges faced by researchers. In particular, in the paper the opportunity to extract data, to investigate the complexity of structure of the archived websites, and to analyze the content are addressed. At the same time, the findings are relevant to other studies devoted to the use of the archived web in computational research in the humanities. +Data-driven approaches bring extensive opportunities for research to analyze large volumes of data, and gain new knowledge and insights. This is considered especially beneficial for implementation in the humanities and social sciences [@weichselbraun2021]. Application of data-driven research methodologies in the field of history requires a sufficient source base, which should be accurate, transparently shaped and large enough for robust analysis [@braake2016]. +Web archives preserve valuable resources that can be drawn upon to analyze the development of the websites and even the whole domains through the decades, and provide access to them [@brugger2018]. At first glance, the volumes of data captured are impressive and suggest the opportunity for big data research practices. For example, the Web Crawls collection of the Internet Archive alone includes 80.2 PB of data [@webcrawls2024]. At the same time, the web-archived collections expose a set of other characteristics relevant to big data and this can be challenging for their efficient use. Such features include, for instance, a high level of velocity, exhaustive in scope and diverse in variety [@kitchin2014], which require addressing and resolving specific issues. +This research focuses on museums’ presence on the web, describes opportunities for implementation of data-driven research, and identifies challenges faced by researchers. In particular, in the paper the opportunity to extract data, to investigate the complexity of structure of the archived websites, and to analyze the content are addressed. At the same time, the findings are relevant to other studies devoted to the use of the archived web in computational research in the humanities. ## Data-Driven Research of the Museums’ Web Presence @@ -36,14 +36,18 @@ When working with web archives, the main strategies to apply data-driven methods Both web archives offer open APIs to obtain large datasets suitable for data-driven research [@api2024; @troveapi2024]. However, even obtaining data is challenging. The statistics of the Internet Archive show that the preserved version of the MET museum website [@met2024] was saved 20,519 times between November 11, 1996, and July 28, 2023, including 10,867,395 captures of text/HTML files, corresponding to 6,559,761 URLs [@summary_met2023]. The statistics for the National Museum of Australia on the Internet Archive show 353,134 captures of text/HTML files, which relates to 175,999 URLs [@summary_nma2023]. At the same time, the attempt to download all the timestamps using the Wayback Machine Downloader returns only 1,342,067 files for the MET website. The same is relevant for the NMA website obtained from the Internet Archive. The stage of obtaining the dataset requires more attention to the APIs and gaining reliable data. Difficulties related to obtaining data and building datasets have been addressed partially by creating research infrastructures to work with web archived materials. The Internet Archive introduced several initiatives to collect, store and provide access to preserved materials and process data such as Archive-It [@archiveit2024] and ARCH (Archives Research Compute Hub [@arch2024]). The GLAM Workbench [@glam2024] has been created to analyze materials from the Australian Web Archive, the Internet Archive and several other web archives. Initially focused on Australia and New Zealand digital platforms the GLAM Workbench suggests a range of solutions based on the use of Jupyter notebooks for exploration and usage of data from GLAM institutions including web archival data. These infrastructures support researchers in finding solutions of various problems in obtaining and processing data, opening them up to wide opportunities to explore the archived web. Regarding the topic of museums on the web, the GLAM Workbench is particularly valuable because some examples in the notebooks have already focused on the Australian web domain and the code from the notebooks can be easily transformed for addressing the topic related to the NMA’s web presence. Using these research infrastructures is beneficial also for solving some technical issues related to the limited capacity of personal computers to address large amount of data [@robertson2022]. Data-driven approaches require not only obtaining information in as complete as possible form but also assessment of the available data, which can be considered as a step in source criticism. Analysis of the distribution of data can be based on a URL analysis of data preserved on the web archives. The URL analysis serves as a necessary step in the source assessment because its study can reveal the temporal distribution of the data, identify the gaps, trace the regularity of crawling and updating the website, specify the distribution of the file formats, and identify other characteristics related to the complexity of the websites and their changes over time. URL-string after crawling becomes a part of the identification of information in the web archive, being enriched with a timestamp. URLs collected from the web archive provide a series of characteristics such as protocol, domain and subdomain names, path, timestamps, and parameters which are all valuable resources of information. Some of these parameters are more important for tracing the technical side of the website’s history. For example, tracing the protocols http and https give us an idea about accepting of security measures and technological upgrades. Some other characteristics provide valuable information to study the content of the website. -Identification of the subdomains contributes to our understanding complexity and segmentation of the museum’s website and the presentation of information to the website’s visitors. Often the subdomains have been used for specific projects within the museum’s activities and can be studied separately from the ‘main’ website due to their own structure and content. Both of the considered case studies through their history included the subdomain structures and experimented over the years to find more suitable approaches to the complexity of the website. Subdomains could have a different design and non-overlapping content so that they can be located as a large web structure within the museums’ activities. The analysis of the subdomains facilitates reconstruction of their life cycle, sustainability and use in comparison with the main domain. Building the network of the domain and the subdomains serves as a way to identify the important webpages through the most connected nodes (web pages) and the edges (hyperlinks). In this regard, the main website performs as a metastructure that encompasses various substructures (such as subdomains). Therefore, data-driven approaches are helpful in the analysis of functional segmentation, autonomy and integration processes within such a museum’s web universe. +Identification of the subdomains contributes to our understanding complexity and segmentation of the museum’s website and the presentation of information to the website’s visitors. Often the subdomains have been used for specific projects within the museum’s activities and can be studied separately from the ‘main’ website due to their own structure and content. Both of the considered case studies through their history included the subdomain structures and experimented over the years to find more suitable approaches to the complexity of the website. Subdomains could have a different design and non-overlapping content so that they can be located as a large web structure within the museums’ activities. The analysis of the subdomains facilitates reconstruction of their life cycle, sustainability and use in comparison with the main domain. Building the network of the domain and the subdomains serves as a way to identify the important webpages through the most connected nodes (web pages) and the edges (hyperlinks). In this regard, the main website performs as a metastructure that encompasses various substructures (such as subdomains). Therefore, data-driven approaches are helpful in the analysis of functional segmentation, autonomy and integration processes within such a museum’s web universe. The challenge in the identification of the subdomains refers not only to the inherent incompleteness of preserved data but also to the deficient methods of obtaining data from the web archives. Our experience shows that the API of the Internet Archive deriving the millions of URLs from the same domain returns errors and collapses the processing. Also, we are able to get some subdomains of the metmuseum.org website from the Archive-It platform (The Metropolitan Museum of Art Web Archive, 2024) but the MET preserves the data systematically only from 2019 and for this reason identification of the subdomains from the past perspective can be challenging and researchers need to seek better ways to achieve access this data. The GLAM Workbench has a notebook in relation to obtaining subdomains. The code can also be adjusted to any web domain for searching on the Internet Archive (and some other web archives). Regarding the research of the NMA and subdomains, the GLAM workbench suggests a highly powerful tool to consider the NMA’s website as a part of the gov.au domain [@exploring_govau2024]. The sub-subdomains of nma.gov.au website can be analyzed around the main museum’s website and at the same time the museum’s website can be considered in connection with other websites from the gov.au domain. Such an approach has a strong potential for discoveries related to positioning of the museums’ website along with the other 1825 third-level domains of the governmental segment, identifying the unique webpages and other characteristics. The archived web is a complex resource that encompasses a large amount of heterogeneous data [@brugger2013]. The single webpage may include various formats of information and analysing the whole website requires finding the appropriate methods. Deducing complexity and investigation data separated according to formats is a widely accepted method of analysis of the website’s content. Textual analysis is a subversion of such research on the websites when only the texts are taken for the study. Building a corpus of texts from the museum websites preserved on the Danish Web Archive gave insights into the development of the Danish museums on the web and the identification of the attributes specific to the museum clusters [@skov2024]. Separating the content according to the formats and selecting the particular type of data for the analysis, may appear to be a simple task. However, building a corpus is a very complex task which requires defining appropriate approaches to how to obtain data and what type of data to include in the corpus. There are many pitfalls to consider: the transfer of dynamic web content to the static version on the web archive inescapably changes the nature of the data and requires decisions on how to shape the dataset for analysis [@brugger2010]. Moreover, not all the textual data represent the same level of data. Another issue is the multiplication of data when the same page has been crawled and preserved on the web archive several times. The Internet Archive and the GLAM Workbench suggest different solutions in this regard. The Internet Archive provides users with unique identifiers (‘digest’) of every captured URL. If the content on the same URL has been changed, the hash sum and subsequently the identifier will vary as well. It helps to treat the web pages differently if their content is diverse. At the same time, the significance of the changes cannot be assessed from a distance. The Glam Workbench suggests a code published on the Jupyter Notebook to harvest textual data from the required archived webpages [@harvesting2024]. At the same time, the obtaining of data is possible by the lists of the URLs, which aids in treating the large amount of webpages automatically. Textual data analysis is a well-established sphere of computational humanities. However, the complexity of the website is significantly greater than the text only. Analysis of images can be beneficial for many reasons. One such task is to reveal the selection processes in publishing images on the websites in general and in the digital collections in particular. Art museums had to identify the priorities in publishing pictures and develop specific strategies for that. We do not know much about selection processes, especially in the early years of the web, and how these solutions evolved due to the influences of political and cultural events, movements and actions. Data-driven research is able to identify and highlight these trends. To analyze the currently published collections some museums suggest open APIs for obtaining metadata about the museum objects. Both the MET and the NMA provide open access to their collections through the open API [@collection_api2024; @museum_api2024]. Access to the metadata of the publicly available objects is provided on the digital platforms. At the same time, the metadata is limited to information about the objects and does not include metadata regarding their web presence, including the date of the first publication online. In this regard, the timestamps from the web archives can be considered as a valuable resource to analyze the publishing processes from a historical perspective. At the same time, in the web archive the preserved pictures are disconnected from the metadata about the image and this gap requires finding specific solutions to connect the image and metadata to make the discoveries easier. Apart from the texts and images, the websites incorporate other formats of data and their use in the research is more problematic for analysis. The museums represented on the web multimedia content including videos, animation, conducted podcasts, etc. All of this and other content is valuable for our understanding of their evolution on the web. At the same time, these types of content are very challenging for web archiving [@muller2021], so the specific methodologies should be developed for their systematic preservation and then for the subsequent analysis, including data-driven practices. - ## Conclusion -Web archives provide wide opportunities for the implementation of data-driven research in the analysis of museums’ web presence. The archived websites require a thorough source criticism and evaluation of the available data for gaining new insights into the evolution of museums’ activities online. Studying the cases of the MET and the NMA is possible via a large amount of data preserved on the Internet Archive and Trove. However, the robust analysis is challenging due to various factors. Researchers need to investigate new ways to obtain the data from the web archives, identify incompleteness and biases, to evaluate data and diversity of the file formats, and to select the best approaches to address them. Analysis of web content remains challenging and requires the development of innovative solutions to exploit data-driven research. At the same time, some of the issues can be gradually resolved based on the developing tools and digital research infrastructures, first of all, ARCH and the GLAM Workbench. Ultimately, data-driven research on the museums’ web presence has a great potential for new discoveries but at the same time, it is a complex endeavor. +Web archives provide wide opportunities for the implementation of data-driven research in the analysis of museums’ web presence. The archived websites require a thorough source criticism and evaluation of the available data for gaining new insights into the evolution of museums’ activities online. Studying the cases of the MET and the NMA is possible via a large amount of data preserved on the Internet Archive and Trove. However, the robust analysis is challenging due to various factors. Researchers need to investigate new ways to obtain the data from the web archives, identify incompleteness and biases, to evaluate data and diversity of the file formats, and to select the best approaches to address them. Analysis of web content remains challenging and requires the development of innovative solutions to exploit data-driven research. At the same time, some of the issues can be gradually resolved based on the developing tools and digital research infrastructures, first of all, ARCH and the GLAM Workbench. Ultimately, data-driven research on the museums’ web presence has a great potential for new discoveries but at the same time, it is a complex endeavor. + +## References + +::: {#refs} +::: diff --git a/submissions/427/index.qmd b/submissions/427/index.qmd index dab9735..7df15f1 100644 --- a/submissions/427/index.qmd +++ b/submissions/427/index.qmd @@ -146,4 +146,4 @@ Although our journey of refactoring has just begun, we are already seeing the be ## References ::: {#refs} -::: \ No newline at end of file +::: diff --git a/submissions/428/index.qmd b/submissions/428/index.qmd index 62b07ec..54f10eb 100644 --- a/submissions/428/index.qmd +++ b/submissions/428/index.qmd @@ -2,59 +2,66 @@ submission_id: 428 categories: 'Session 1B' title: Tables are tricky. Testing Text Encoding Initiative (TEI) Guidelines for FAIR upcycling of digitised historical statistics. - author: name: Gabi Wuethrich orcid: 0000-0002-9055-2743 email: gabi.wuethrich@ub.uzh.ch affiliation: University of Zurich, University Library - keywords: - Text recognition - Table structure - TEI - abstract: | This project on digital data management explores the use of XML structures, specifically the Text Encoding Initiative (TEI), to digitize historical statistical tables from Zurich's 1918 pandemic data. The goal was to make these health statistics tables reusable, interoperable, and machine-readable. Following the retro-digitization of statistical publications by Zurich's Central Library, the content was semi-automatically captured with OCR in Excel and converted to XML using TEI guidelines. However, OCR software struggled to accurately capture table content, requiring manual data entry, which introduced potential errors. Ideally, OCR tools would allow for direct XML export from PDFs. The implementation of TEI for tables remains a challenge, as TEI is primarily focused on running text rather than tabular data, as noted by TEI pioneer Lou Burnard. Despite these challenges, TEI data processing offers opportunities for conceptualizing tabular data structures and ensuring traceability of changes, especially in serial statistics. An example is a project using early-modern Basle account books, which were "upcycled" following TEI principles. Additionally, TEI's structured approach could help improve the accuracy of table text recognition in future projects. - date: 08-15-2024 - bibliography: references.bib --- ## Introduction + In 2121, nothing is as it once was: a nasty virus is keeping the world on tenterhooks – and people trapped in their own four walls. In the depths of the metaverse, contemporaries are searching for data to compare the frightening death toll of the current killer virus with its predecessors during the Covid-19 pandemic and the «Spanish flu». There is an incredible amount of statistical material on the Covid-19 pandemic in particular, but annoyingly, this is only available in obscure data formats such as .xslx in the internet archives. They can still be opened with the usual text editors, but their structure is terribly confusing and unreadable with the latest statistical tools. If only those digital hillbillies in the 2020s had used a structured format that not only long-outdated machines but also people in the year 2121 could read... Admittedly, very few epidemiologists, statisticians and federal officials are likely to have considered such future scenarios during the pandemic years. Quantitative social sciences and the humanities, including medical and economic history, but also memory institutions such as archives and libraries, should consciously consider how they can sustainably preserve the flood of digital data for future generations. Thus, the sustainable processing and storage of statistical printed data from the time of the First World War makes it possible to gain new insights into the so-called "Spanish flu" e. g. in the city of Zurich even today. The publications by the Statistical Office of the City of Zurich, which were previously only available in “analog” paper format, have been digitized by the Zentralbibliothek (Central Library, ZB) Zurich as part of Joël Floris' Willy Bretscher Fellowship 2022/2023 (@Floris2023). This project paper has been written in the context of this digitisation project, as issues regarding digital recording, processing, and storage of historical statistics have always occupied quantitative economic historians “for professional reasons”. The basic idea of this paper is to prepare tables with historical health statistics in a sustainable way so that they can be easily analysed using digital means. The aim was to capture the statistical publications retro-digitized by the ZB semi-automatically with OCR in Excel tables and to prepare them as XML documents according to the guidelines of the Text Encoding Initiative (TEI), a standardized vocabulary for text structures. To do this, it was first necessary to familiarise with TEI and its appropriate modules, and to apply them to a sample table in Excel. To be able to validate the Excel table manually transferred to XML, I then developed a schema based on the vocabularies of XML and TEI. This could then serve as the basis for an automated conversion of the Excel tables into TEI-compliant XML documents. Such clearly structured XML documents should ultimately be relatively easy to convert into formats that can be read into a wide variety of visualisation and statistical tools. ## Data description + A table from the monthly reports of the Zurich Statistical Office serves as an example data set. The monthly reports were digitised as high-resolution pdfs with underlying Optical Character Recognition (OCR) based on Tesseract by the Central Library's Digitisation Centre (DigiZ) as part of the Willi Bretscher Fellowship project. They are available on the ZB’s Zurich Open Platform (ZOP, @SASZ1919), including detailed metadata information. They were published by the Statistical Office of the City of Zurich as a journal volume under this title between 1908 and 1919, and then as «Quarterly Reports» until 1923. The monthly reports each consist of a 27-page table section with individual footnotes, and conclude with a two-page explanatory section in continuous text. For this study, the data selection is limited to a table for the year 1914 and the month of January (@SASZ1919). In connection with Joël Floris' project, which aims at obtaining quantitative information on Zurich's demographic development during the «Spanish flu» from the retro-digitisation project, it was obvious to focus on tables with causes of death. The corresponding table number 12 entitled «Die Gestorbenen (in der Wohnbev.) nach Todesursachen und Alter» («The Deceased (in the Resident Pop.) by Cause of Death and Age») can be found on page seven of the monthly report. It contains monthly data on causes of death, broken down by age group and gender, as well as comparative figures for the same month of the previous year. The content of this table is to be prepared below in the form of a standardized XML document with an associated schema that complies with the TEI guidelines. ## Methods for capturing historical tables in XML + The source of inspiration for this project paper was a pioneering research project originally based at the University of Basle. In the research project, the annual accounts of the city of Basle from 1535 to 1610 were digitally edited (@Burghartz2015). Technical implementation was carried out by the Center for Information Modeling at the University of Graz. Based on a digital text edition prepared in accordance with the TEI standard, the project manages to combine facsimile, web editing in HTML, and table editing via an RDF (Resource Description Framework ) and XSLT (eXtensible Stylesheet Language Transformations ) in an exemplary manner. The edition thus allows users to compile their own selection of booking data in a "data basket" for subsequent machine-readable analysis. In an accompanying article, project team member Georg Vogeler describes the first-time implementation of a numerical evaluation and how "even extensive holdings can be efficiently edited digitally" [@Vogeler2015]. However, as mentioned, the central basis for this is XML processing of the corresponding tabular information based on the TEI standard. -This project is based on the April 2022 version (4.4.0) of the TEI guidelines (@Burnard2022). They include a short chapter on the preparation of tables, formulas, graphics, and music. And even the introduction to Chapter 14 is being rather cautious with regard to TEI application for table formats, warning that layout and presentation details are more important in table formats than in running text, and that they are already covered more comprehensively by other standards and should be prepared accordingly in these notations. -On asking the TEI-L mailing list whether it made sense to prepare historical tables with the TEI table module, the answers were rather reserved (https://listserv.brown.edu/cgi-bin/wa?A1=ind2206&L=TEI-L#24). Only the Graz team remained optimistic that TEI could be used to process historical tables, albeit in combination with an RDF including a corresponding ontology. Christopher Pollin also provided github links via TEI to the DEPCHA project, in which they are developing an ontology for annotating transactions in historical account books. +This project is based on the April 2022 version (4.4.0) of the TEI guidelines (@Burnard2022). They include a short chapter on the preparation of tables, formulas, graphics, and music. And even the introduction to Chapter 14 is being rather cautious with regard to TEI application for table formats, warning that layout and presentation details are more important in table formats than in running text, and that they are already covered more comprehensively by other standards and should be prepared accordingly in these notations. +On asking the TEI-L mailing list whether it made sense to prepare historical tables with the TEI table module, the answers were rather reserved (). Only the Graz team remained optimistic that TEI could be used to process historical tables, albeit in combination with an RDF including a corresponding ontology. Christopher Pollin also provided github links via TEI to the DEPCHA project, in which they are developing an ontology for annotating transactions in historical account books. ## Table structure in TEI-XML + Basically, the TEI schema treats a table as a special text element consisting of line elements, which in turn contain cell elements. This basic structure was used to code Table 12 from 1914, which I transcribed manually as an Excel file. Because exact formatting including precise reproduction of the frame lines is very time-consuming, the frame lines in the project work only served as structural information and are not included as topographical line elements as TEI demands. Long dashes, which correspond to zero values in the source, are interpreted as empty values in the TEI-XML. I used the resulting worksheet as the basis for the TEI-XML annotation, in which I also added some metadata. -I then had to create an adapted local schema as well as a TEI header, before structuring the table’s text body. Suitable heading ("head") elements are the title of the table, the table number as a note and the «date» of the table. The first table row contains the column headings and is assigned the role attribute "label" accordingly. The third-last cell of each row contains the row total, which I have given the attribute "ana" for analysis and the value "#sum" for total, following the example of the Basle Edition. -The first cell of each row again names the cause of death and must therefore also be labelled with the role attribute "label". The second last row shows the sum of the current monthly table, which is why it is given the "#sum" attribute for all respective cells. Finally, the last line shows the total for the previous year's month. It is therefore not only marked with the sum attribute, but also with a date in the label cell. A potential confounding factor for later calculations is the row "including diarrhea", which further specifies diseases of the digestive organs but must not be included in the column total. Accordingly, it is provided with another analytical attribute called "#exsum". As each cell in the code represents a separate element, the »digitally upcycled table 12 in XML format ultimately extends over a good 550 lines of code, which I’m happy to share on request. +I then had to create an adapted local schema as well as a TEI header, before structuring the table’s text body. Suitable heading ("head") elements are the title of the table, the table number as a note and the «date» of the table. The first table row contains the column headings and is assigned the role attribute "label" accordingly. The third-last cell of each row contains the row total, which I have given the attribute "ana" for analysis and the value "#sum" for total, following the example of the Basle Edition. +The first cell of each row again names the cause of death and must therefore also be labelled with the role attribute "label". The second last row shows the sum of the current monthly table, which is why it is given the "#sum" attribute for all respective cells. Finally, the last line shows the total for the previous year's month. It is therefore not only marked with the sum attribute, but also with a date in the label cell. A potential confounding factor for later calculations is the row "including diarrhea", which further specifies diseases of the digestive organs but must not be included in the column total. Accordingly, it is provided with another analytical attribute called "#exsum". As each cell in the code represents a separate element, the »digitally upcycled table 12 in XML format ultimately extends over a good 550 lines of code, which I’m happy to share on request. ## Challenges and problems + An initial problem already arose during the OCR-based digitisation. The Central Library (ZB)'s Tesseract-based OCR software, which specializes in continuous text, simply failed to capture the text in the tables. I therefore first had to transcribe the table by hand, which is error-prone. In principle, however, it is irrelevant in TEI in which format the original text was created. The potential for errors when transferring Excel data into the "original" XML is also high, especially if the table is complex and/or detailed. Ideally, i. e. with a clean OCR table, it ought to be possible to export OCR content in pdfs to XML. When speaking with the ZB’s DigiZ, they confirmed not being happy with OCR quality anymore, and are considering improvement with regard to precision. Due to the extremely short instructions for table preparation in TEI, I underestimated the variety of different text components that TEI offers. The complexity of TEI is not clear from the rough overview of the individual chapters and their introductory descriptions. This only became clear while adjusting table 12 to TEI standards. By becoming accustomed to TEI, its limitations regarding table preparation also became more evident: It is fundamentally geared towards structuring continuous text rather than text forms, where the structure or layout also indicates meaning, as is the case with tables. The conversion of the sample table into XML and the preparation of an associated TEI schema, which is reduced to the elements present in the sample document, yet remains valid with the TEI standard, proved to be time-consuming code work. Thus, both the sample XML and the local schema each comprise over 500 lines of code – and this basically for only a single – though complex – table with a few metadata. In addition, the extremely comprehensive and complex TEI schema on which my XML document is based is not suitable for implementation in Excel. As a result, I had to prepare an XML table schema that was as general as possible, which may be used to convert the Excel tables into XML in the future, thus reducing error potential of the XML conversion. ## Ideas for Project Expansion -Because, as mentioned, the OCR output of the tables in this case is not usable, it should now be crucial for any digitisation project to achieve high-quality OCR of the retro-digitised tables. Table recognition is definitely an issue in economic history research, and there are several open source development tools around on Git-Repositories, which yet have to set a standard, however. -Ideally, the tables recognized in this way would then provide better text structures in the facsimile. With the module for the transcription of original sources, TEI offers extensive possibilities for linking text passages in the transcription with the corresponding passages in the facsimiles. Such links could ideally be used as training data for text recognition programs to improve their performance in the area of table recognition. Other TEI elements that lend structure to the table, such as the dividing lines and the long dashes for the empty values, could also serve as such structural recognition features. -Additional important TEI elements such as locations and gender would further increase the content of the TEI XML format. Detailed metadata, as e.g. provided by the retro-digitized version of the ZOP, can be easily integrated into the TEI header area "xenodata". Finally, in view of the complex structure of the tables, it is essential to understand and implement XSLT (eXtensible Stylesheet Language Transformation) for automated structuring, and as a basis for RDF used e.g. by the Graz team. + +Because, as mentioned, the OCR output of the tables in this case is not usable, it should now be crucial for any digitisation project to achieve high-quality OCR of the retro-digitised tables. Table recognition is definitely an issue in economic history research, and there are several open source development tools around on Git-Repositories, which yet have to set a standard, however. +Ideally, the tables recognized in this way would then provide better text structures in the facsimile. With the module for the transcription of original sources, TEI offers extensive possibilities for linking text passages in the transcription with the corresponding passages in the facsimiles. Such links could ideally be used as training data for text recognition programs to improve their performance in the area of table recognition. Other TEI elements that lend structure to the table, such as the dividing lines and the long dashes for the empty values, could also serve as such structural recognition features. +Additional important TEI elements such as locations and gender would further increase the content of the TEI XML format. Detailed metadata, as e.g. provided by the retro-digitized version of the ZOP, can be easily integrated into the TEI header area "xenodata". Finally, in view of the complex structure of the tables, it is essential to understand and implement XSLT (eXtensible Stylesheet Language Transformation) for automated structuring, and as a basis for RDF used e.g. by the Graz team. ## Conclusion + So far, tables seem to have had a shadowy existence within the Text Encoding Initiative (TEI) – or, as TEI pioneer Lou Burnard remarked in the TEI mailing list on behalf of my question whether TEI processing of tables made sense: "Tables are tricky". The main reason for this probably lies in the continuous text orientation of existing tools and users, who are also less interested in numerical formats. In principle, however, preparation according to the TEI standard offers the opportunity to think conceptually about the function of tabularly structured data and to make changes, e.g. in serial sources such as statistical tables, comprehensible. The clearly structured text processing of TEI could provide a basis for improving the still rather poor quality of text recognition programs when recording tables. And a platform-independent, non-proprietary data structure such as XML would be almost indispensable for the sustainable long-term archiving of "digitally born" statistics, which have experienced a boom in recent years, and especially during the pandemic. After all, our descendants should also be able to access historical statistics during the next one. + +## References + +::: {#refs} +::: diff --git a/submissions/429/index.qmd b/submissions/429/index.qmd index eab89f6..93475cb 100644 --- a/submissions/429/index.qmd +++ b/submissions/429/index.qmd @@ -17,7 +17,6 @@ keywords: - students participation - 3D reconstruction - spinning mill - abstract: | The Techn'hom Time Machine project aims to offer a digital reconstruction of a spinning mill, integrating buildings in their environment, machines, and activities. It involves engineering students from the Belfort-Montbéliard University of Technology, in all aspects of the project. They are able to discover and practise on software that they will have to use in their future professional activity (Revit, Catia, Blender, Unity…). Some students are able to discover entire conceptual fields that are rarely covered in their course, such as the notions of ontology and RDF database. A special relationship between history and digital technology underlies this work: students have a choice about which software to use, their results directly impact the evolution of the project, and they learn the importance of the whole organizing. These students, unfamiliar with humanities and specific problems associated with them, are at the same time discovering these disciplines and their difficulties, thus opening up their perspectives. key-points: @@ -32,8 +31,6 @@ bibliography: references.bib Part of the national Lab In Virtuo project (2021-2024), the Techn'hom Time Machine project, initiated in 2019 by the Belfort-Montbéliard University of Technology, aims to study and digitally restore the history of an industrial neighborhood, with teacher-researchers but also students as co-constructors [@Gasnier2014 ; @Gasnier2020, p. 293]. The project is thus located at the interface of pedagogy and research. The Techn'hom district was created after the Franco-Prussian War of 1870 with two companies from Alsace: the Société Alsacienne de Constructions Mécaniques, nowadays Alstom; and the Dollfus-Mieg et Compagnie (DMC) spinning mill, in operation from 1879 to 1959. The project aims to create a “Time Machine” of these industrial areas, beginning with the spinning mill. We seek to restore in four dimensions (including time) buildings, machines with their operation, but also document and model sociability and know-how, down to the gestures and feelings. The resulting “Sensory Realistic Intelligent Virtual Environment” should allow both researchers and general public to virtually discover places and “facts” taking place in the industry, but also interact with them or even make modifications. - - ## Study and training areas The project is carried out within a technology university and, as such, is designed to include the participation of engineering students. They can apply and develop skills previously covered in a more basic way in their curriculum. This constitute for students an investment in the acquisition of skills that can subsequently be reused in their professional lives as engineers. In the current state, four main axes exist concerning inclusion of students in the Techn’hom Time Machine project: @@ -44,22 +41,18 @@ The project is carried out within a technology university and, as such, is desig * Integration of those elements in the same virtual environment on Unity. Historical sources are crucial in all axes since many artifacts no longer exist, have been heavily modified and/or are inaccessible. Modeling is based on handwritten or printed writings, plans, iconography, and surviving heritage. This imposes a disciplinary opening for engineering students, untrained in the manipulation and analysis of such sources, and who may feel distant from issues linked to human and social sciences. - ## Project progress To date, thirty two students were included in the project. Each of the four axes was allocated between four and twelve students depending on opportunities and needs. In addition to the scientific contribution, student reports make it possible to evaluate their point of view on this training, all critical perspective retained. - ### Modeling: the software question This axis has currently involved twelve students, and has led to the complete or partial modeling of six machines. It implies to reverse engineering machines with very partial data, on software designed for rendering of much more recent mechanisms. Students are assigned to work on small projects whose results are not necessarily directly usable. This offers the advantage of an exploratory and critical approach, by having a student take over the project of a previous one. Students were thus responsible for creating the model, but also for defining the software used. The first machine modeled, a twisting machine, was the subject of two successive works, linked to a change in modeling software. The first student used Blender, directing his work “on the optimization of models rather than on precision” and “took the initiative to abandon coherence”, offering “parts very close to the base material from a visual point of view but absolutely not reliable from a measurement point of view” [@Bogacz2019, pp. 11-12]. The following year, a second group was tasked of restoring consistency in this model, but realized that their colleague's choices prevented such an achievement: pieces were too inaccurate, and conversion to a kinematic CAD model was impossible [@Castagno2020, pp. 11, 13]. They therefore remade the model on Catia, without realistic texture. The team of another machine proposed another solution: on Catia, they “‘imagined’ missing parts”, paying attention to their mechanical coherence, while using Keyshot to obtain a more visually attractive final result [@Paulin2020, pp. 15-16]. This questioning also occurred with integration of buildings and machines on Unity: models produced by specialized software are each quite efficient, but too heavy and ill-optimized to be all integrated in the same simulation. Students working on this topic thus have to take and reduce models in order to optimize performance, losing a part of the precision [@Bozane2022, pp. 4-6, 10]. Freedom left to students in technical solutions thus made it possible, by authorizing research and free experimentation, to identify configurations most likely to meet the needs of the project as a whole. - ### Which data model? Similarly, tests “distribution” between students provided insights as to the appropriate type of data model. The Techn’hom Time Machine project was initially supposed to rely on a “classic” relational database. The first student to work on setting up said database quickly realized that a historical database involves “a certain complexity in its design”, necessitating a table for abstract concepts “most difficult to define”, and a table for specifying types of links between actors, but without informing in advance all possible types of relationships [@Garcia2020, pp. 20, 23]. In short, the student realized that, for a system as complex as a human society, a relational database quickly shows its limits. In fact, even if this first student still managed to create a relational database, the next two underlined its complexity: “the number of tables in the database makes reading difficult” [@Ruff2022, p. 7], and it was difficult to “precisely complete [it]” [@Marais2020, pp. 9, 16]. A fourth student, tasked to take up the previous work to refine it and make a functional application, concluded with the support of teacher-researchers that this database simply did not allowed to describe precisely enough a historical reality, and pointed the need to use an RDF graph database [@Echard2023, pp. 15-16, 21]. This solution, actually adopted, therefore comes once again from a series of works allowing a self-critique of the entire project, helping to define effective solutions. - ## Reflective feedback from students Beyond these contributions to the scientific project, this program also aims to offer training to students. The point that emerges most clearly from students' reports, before any technical consideration or skills acquisition, relates to discovery of human sciences and their methodologies. @@ -72,14 +65,17 @@ Despite this initial blockage, students developed solutions - starting with awar This need to delve into sources implied for students the discovery, through practice, of the ins and outs of human sciences research. Typically, with data modeling, working from real data brings a certain advantage: working from “concrete cases […] helped us to understand how to articulate [several] ontologies and thus develop a strategy to combine them effectively into a coherent whole” [@Echard2023, pp. 23, 32]. Likewise, for buildings, sources comparison led students to perceive inconsistencies, and thus “note the importance of reading all the archives and not just a few because errors may be present” [@Pic2020, pp. 3-4]. Some also emphasize “difficulty of exploiting numerous bibliographic resources” in terms of synthesis capacities and working time [@Bogacz2019, p. 6; @Castagno2020, p. 15], but also the pleasure of “learning to read archives” [@Paulin2020, p. 20]. The novelty of the practice compared to classic engineering curriculum is well summed up by one of the teams: “This type of task requires patience and a methodology completely different from what we have habit of doing. The difficulty or even the impossibility of finding the desired information taught us to put ourselves in the shoes of a historian who must at certain times make hypotheses in order to continue his work.” [@Castagno2020, p. 15]. - ### Project managing Whatever the students’ specific project, it generally appeared to be a first in their training, positioning them as researchers over several months. This induced a “complete autonomy” [@Garcia2020, p. 8] underlined by all reports, often before competence gains. One, those project was “the most significant project he had to carry out”, “learned the management” of his organization [@Bogacz2019, p. 16]. Another “learned to manage a project in [his] free time” [@LeGuilly2022, p. 10], and a third “learned to work efficiently and manage projects independently” [@Echard2023, pp. 9, 40]. Faced with complex and non-linear projects, students emphasize the “need to do a lot of research to use the right method to work correctly”, and to propose solutions on their own [@Marchal2021, pp. 9-10, 27]. The gross volume of work is finally underlined, projects requiring “time to understand the documents, research into software functionalities as well as a considerable investment” [@Pic2020, p. 16]. Participation in the project can appear as “a first professional experience (...) The experience gained during the internship is immense” [@Garcia2020, p. 39]. Beyond each individual work, some students also develop reflection on the overall project. In particular, they suffered from a lack of communication with their predecessor on the same subject, “making the task more difficult”, leading to risk of “wasting (...) time understanding what the other had already understood” [@Castagno2020, p. 15]. This experience lead to an awareness of the importance of good communication or documentation. Students therefore suggested organizing “video conferences between old and new groups”, and that “each group [should] bring together important documents in a separate file” during project transitions. They applied the lesson to their own report, by “explaining as best as possible what [they] had understood”, with concrete recommendations [@Castagno2020, pp. 15-16]. - ## Conclusion Students involvement in the Techn’hom Time Machine project leads to bidirectional enrichment. The project benefits from the possibility of distributed work and multiple proposal strengths, making it possible to test several options in parallel on a given subject. Students deepen their knowledge of diverse software, while introducing themselves to human sciences and project management. Gain in technical skills is often implied in reports, obviously being an integral part of expectations of any engineering school project. Acquisition of more fundamental knowledge can be identified, with discovery of some entirely new technologies. An interest in the historical dimension is also mentioned, as well as human contacts with researchers and workers. Finally, the very fact of participating in a digital humanities project, atypical in itself, appears as a source of satisfaction. + +## References + +::: {#refs} +::: diff --git a/submissions/431/index.qmd b/submissions/431/index.qmd index 51ff9e4..7df333f 100644 --- a/submissions/431/index.qmd +++ b/submissions/431/index.qmd @@ -39,11 +39,8 @@ Data skills in RAG can be divided into data collection, data entry and data anal The RAG started with a Microsoft Access database as a multi-user installation. In 2007, the switch was made to a client-server architecture, with MS Access continuing to serve as the front end and a Microsoft SQL server being added as the back end. This configuration had to be replaced in 2017 as regular software updates for the client and server had been neglected. As a result, it was no longer possible to update the MS Access client to the new architecture in good time and the server, which was running on the outdated MS SQL Server 2005 operating system, increasingly posed a security risk. In addition, publishing the data on the internet was only possible to a limited extent, as a fragmented export from the MS SQL server to a MySQL database with a PHP front end was required. In 2017, it was therefore decided to switch to a new system [@gubler2020]. - - ![Fig. 1: Former frontend of the RAG project for data collection in MS Access 2003.](images/RAG Eingabemaske MS Access.jpg) - Over one million data records on people, events, observations, locations, institutions, sources and literature were to be integrated in a database migration - a project that had previously been considered for years without success. After a evaluation of possible research environments, nodegoat was chosen [@vanBree_Kessels2013]. Nodegoat was a tip from a colleague who had attended a nodegoat workshop [@gubler2021]. With nodegoat, the RAG was able to implement the desired functions immediately: - Location-independent data collection thanks to a web-based front end. @@ -52,12 +49,10 @@ Over one million data records on people, events, observations, locations, instit - Research data can be published directly from nodegoat without the need to export it to other software. - From then on, the RAG research team worked with nodegoat in a live environment in which the data collected can be made available on the Internet immediately after a brief review. This facilitated the exchange with the research community and the interested public and significantly increased the visibility of the research project. The database migration to nodegoat meant that the biographical details of around 10,000 people could be published for the first time, which had previously not been possible due to difficulties in exporting data from the MS SQL server. On 1 January 2018, the research teams at the universities in Bern and Giessen then began collecting data in nodegoat, starting with extensive standardisation of the data. Thanks to a multi-change function in nodegoat, these standardisations could now be carried out efficiently by all users. Institutions where biographical events took place (e.g. universities, schools, cities, courts, churches, monasteries, courts) were newly introduced. ![Fig. 2: Frontend of the RAG project for data collection in nodegoat.](images/RAG nodegoat frontend for data collection.png) - ## Methodology These institutions were assigned to the events accordingly, which forms the basis for the project's method of analysis: analysing the data according to the criteria 'incoming' and 'outgoing' [@gubler2022]. The key questions here are: Which people, ideas or knowledge entered an institution or space? @@ -72,7 +67,6 @@ How was this knowledge shared and passed on there? Spaces are considered both as ![Fig. 5: Network: Jurists with a doctorate from the University of Basel 1460-1550., data: repac.ch, 07/2024.](images/Network of Basel jurists 1460-1550.png) - ## Data literacy Students and researchers working on the RAG project can acquire important data skills. We can make a distinction, as said, between the skills required to collect, enter and analyse the biografical data. Key learning content related to the data entering process for students working in the RAG project are: @@ -125,7 +119,6 @@ How have the described data competences changed since the start of the project i - Collaboration: Web-based research environments have made collaboration much easier and more transparent. Teams are now able to follow each other's progress in real time, making the location of the work less important and communication smoother. - ## Human and artificial intelligence Regarding data collection, entry, and analysis, artificial intelligence significantly impacts several, though not all, tasks within the RAG project. @@ -134,12 +127,16 @@ Regarding data collection, entry, and analysis, artificial intelligence signific ![Fig. 7: Example settings for the algorithm for reconciling textual data in nodegoat.](images/nodegoat text reconciliation settings.png) - - Data entry: In this area, human intelligence remains crucial. In-depth specialist knowledge of the historical field under investigation is essential, particularly concerning the history of universities and knowledge in the European Middle Ages and the Renaissance. Due to the heterogeneous and often fragmented nature of the sources, AI cannot yet replicate this expertise. The nuanced understanding required to interpret historical events and their semantic levels still necessitates human insight. - Data analysis: While AI support for data entry is still limited, it is much greater for data analysis. The epistemological framework has expanded considerably not only in digital prosopography and digital biographical research, but in history in general. Exploratory data analysis in particular will become a key methodology in history through the application of AI. -## Conclusion +## Conclusion Since the 1990s, digital resources and tools have become increasingly prevalent in historical research. However, skills related to handling data remain underdeveloped in this field. This gap is not due to a lack of interest from students, but rather stems from a chronic lack of available training opportunities. This situation has gradually improved in recent years, with a growing number of courses and significant initiatives promoting digital history. Nevertheless, the responsibility now lies with academic chairs to take a more proactive role in integrating a sustainable range of digital courses into the general history curriculum. It is crucial that data literacy becomes a fundamental component of the training for history students, particularly considering their future career prospects and the increasingly complex task of evaluating information, including the critical use of artificial intelligence methods, tools and results. Especially with regard to the methodology of source criticism, which is now more important than ever in the evaluation of AI-generated results. In addition to formal teaching, more project-based learning should be offered to support students in acquiring digital skills. + +## References + +::: {#refs} +::: diff --git a/submissions/438/index.qmd b/submissions/438/index.qmd index 4ea9902..02d5138 100644 --- a/submissions/438/index.qmd +++ b/submissions/438/index.qmd @@ -36,9 +36,9 @@ Classical visual analysis is limited in its ability to deal with video game imag While the framework can identify different game modes and the different functionalities those screens can encompass, more intricate details can escape an analysis. FAVR distinguishes between _tangible_, _intangible,_ and _negative space_, as well as _agents_, _in-game_ and _off-game elements_, and _interfaces_. Whereas the aspect of space concerns the overall composition of the screen, the second set of attributes circumscribes the construction of the image. Intangible space, for example, is concerned with information relevant to gameplay but without the direct agency of the player. Examples are life bars or a display of the current score. As another example, off-game denotes decorative background elements. Being of a time-based and interactive nature, some of the relevant information only unfolds as animation or through player interaction. Further, not all visual mediations of the games’ operation are represented as expected in software interfaces or classic visual compositions. -![Fig. 1: Barbarian (Palace Software Inc, 1987, Amiga). Our character is attacking and blood spurting forth indicates our hit was successful.](images/barbarian_amiga_in-game.png) +![Fig. 1: Barbarian (Palace Software Inc, 1987, Amiga). Our character is attacking and blood spurting forth indicates our hit was successful.](images/barbarian_amiga_in-game.png){width=100%} -![Fig. 1: Barbarian (Palace Software Inc, 1987, DOS). A similar scene from the game's DOS version. Technical limitations, such as a limited color palette, can be a difficult factor to implement and port the same game on another system, raising questions about the tension between technology and design.](images/barbarian_dos_in-game.png) +![Fig. 1: Barbarian (Palace Software Inc, 1987, DOS). A similar scene from the game's DOS version. Technical limitations, such as a limited color palette, can be a difficult factor to implement and port the same game on another system, raising questions about the tension between technology and design.](images/barbarian_dos_in-game.png){width=100%} A simple example could be blood spurting from an agent, which can be any gameplay-relevant character on screen. The blood holds information relevant to the player, indicating that the character on screen got hurt and may prompt a change in play behavior. Whereas a life bar can represent the player’s character health, such indications are usually absent for enemies. Some video games also play with the distinction between in- and off-game planes. In Final Fight (Capcom, 1989, Arcade), our character walks from left to right in a raging city and, on the way, fights numerous enemies entering the screen from left and right. The off-game plane, the background, is composed of run-down houses and alleyways. At one point during the game, those houses’ doors start to open and spawn enemies as well. This mixes up the formerly established convention of what visual information is relevant for gameplay in terms of interactive and decorative elements. @@ -60,6 +60,11 @@ To be able to analyze large quantities of video game images towards their functi The _Framework for the Analysis of Visual Representation in Video Games_ is a welcomed vantage point for my research inquiry. After translating the framework into a linked open ontology, further work is needed to refine and expand it to encompass more subtle aspects of video game interfaces. Whereas the ontology developed so far works on a formal level, I have yet to research to what extent FAVR can be leveraged to applications of distant viewing of larger video game image corpora. Despite being implemented only in a limited form so far, the FAVR has proven to be a valuable tool in analyzing video game images towards their formal, discursive, and historical aspects. +## References + +::: {#refs} +::: + ## Media List - Fig. 1-2: Screenshots from [Barbarian (1987) - MobyGames](https://www.mobygames.com/game/253/barbarian/), accessed July 09, 2024 @@ -76,9 +81,3 @@ The _Framework for the Analysis of Visual Representation in Video Games_ is a we [^5]: [favr-ontology/examples/ball-raider-1987-main-gameplay.json](https://github.com/thgie/favr-ontology/blob/main/examples/ball-raider-1987-main-gameplay.json), accessed July 09, 2024 [^6]: [Convolutional neural network - Wikipedia](https://en.wikipedia.org/wiki/Convolutional_neural_network), accessed July 19, 2024 [^7]: [Transformer (deep learning architecture) - Wikipedia](https://en.wikipedia.org/wiki/Transformer_(deep_learning_architecture)), accessed July 19, 2024 - - diff --git a/submissions/443/index.qmd b/submissions/443/index.qmd index f31854e..b6d0636 100644 --- a/submissions/443/index.qmd +++ b/submissions/443/index.qmd @@ -39,9 +39,9 @@ bibliography: references.bib ## Introduction -The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media. +The Impresso project pioneers the exploration of historical media content across temporal, linguistic, and geographical boundaries. In its initial phase (2017-2020), the project developed a scalable infrastructure for Swiss and Luxembourgish newspapers, featuring a powerful search interface. The second phase, beginning in 2023, expands the focus to connect media archives across languages and modalities, creating a Western European corpus of newspaper and broadcast collections for transnational research on historical media. -We introduce Impresso 2 and review the evolution from the first to the second project. We also discuss the specific challenges to connecting newspaper and radio, our efforts to adapt text mining and exploration tools to the affordances of historical material derived from different modalities, and our approach to conducting comparative and data-driven historical research using semantically enriched sources, accessible through both graphical and API-based interfaces. +We introduce Impresso 2 and review the evolution from the first to the second project. We also discuss the specific challenges to connecting newspaper and radio, our efforts to adapt text mining and exploration tools to the affordances of historical material derived from different modalities, and our approach to conducting comparative and data-driven historical research using semantically enriched sources, accessible through both graphical and API-based interfaces. ## Media Monitoring of the Past: The Impresso Projects @@ -57,10 +57,9 @@ Building on the achievements of Impresso 1, Impresso 2 envisions a comprehensive To achieve this goal, Impresso 2 collaborates with [21 European partners](https://impresso-project.ch/consortium/associated-partners/), including national libraries, archives, newspapers, cultural heritage institutions, and international research networks. - ## Challenges in Enriching and Integrating Historical Digitised Newspaper and Radio Archives -As the mass media of their time, historical newspapers and radio broadcasts offer a daily account of the past and are key to the study of historical media ecosystems. Since the 19th century in print and the 1920s on air, these media have disseminated news, opinion, and entertainment, reported on events, and offered insights into the daily lives of past societies. +As the mass media of their time, historical newspapers and radio broadcasts offer a daily account of the past and are key to the study of historical media ecosystems. Since the 19th century in print and the 1920s on air, these media have disseminated news, opinion, and entertainment, reported on events, and offered insights into the daily lives of past societies. ::: {#fig-1} @@ -68,7 +67,7 @@ As the mass media of their time, historical newspapers and radio broadcasts offe ::: -They have both shaped and been shaped by their social, cultural, and political environments. Until now, these sources have mostly been studied in isolation, resulting in a plethora of parallel national histories [@fickers_hybrid_2018]. Although there has been a 'transnational turn' toward broader geographical and temporal perspectives [@badenoch_airy_2013;@fickers_transnational_2012;@cronqvist_entangled_2017], in media history, transnational and transmedia perspectives are rare, particularly when focusing on the distribution of content rather than institutional histories. +They have both shaped and been shaped by their social, cultural, and political environments. Until now, these sources have mostly been studied in isolation, resulting in a plethora of parallel national histories [@fickers_hybrid_2018]. Although there has been a 'transnational turn' toward broader geographical and temporal perspectives [@badenoch_airy_2013;@fickers_transnational_2012;@cronqvist_entangled_2017], in media history, transnational and transmedia perspectives are rare, particularly when focusing on the distribution of content rather than institutional histories. How can we accurately connect large collections of newspapers and radio, provide effective means for their exploration in historical research, and put content-based transmedia history into practice? Impresso 2 undertakes a multi-dimensional approach to integrating historical newspaper and broadcasts, focusing on four interconnected areas (see also @fig-1): @@ -77,7 +76,7 @@ How can we accurately connect large collections of newspapers and radio, provide 3. **Conducting original historical research**, which advances understanding of historical media ecosystems through various case studies, while also defining methods for data-driven analysis and identifying diverse user needs for data and interface design. 4. **Designing and implementing new interfaces** by creating different entry points to the data and its enrichments. -In pursuing these objectives, Impresso 2 faces a set of unique challenges arising from the central issue of aligning and connecting newspaper and radio archives. +In pursuing these objectives, Impresso 2 faces a set of unique challenges arising from the central issue of aligning and connecting newspaper and radio archives. ::: {#fig-2} @@ -91,23 +90,21 @@ In many ways familiar sources, digitised radio and newspapers have however never ### The Impresso Corpus: A Silo-Breaking, Transmedia and Transnational Corpus -The foundation of Impresso 2 is the creation of a large corpus of print and broadcast media collections across several countries. Building on the newspapers collected in the first Impresso project, the corpus expands to a geographically and historically coherent set of neighbouring countries, encompassing both newspaper and radio archives from each. +The foundation of Impresso 2 is the creation of a large corpus of print and broadcast media collections across several countries. Building on the newspapers collected in the first Impresso project, the corpus expands to a geographically and historically coherent set of neighbouring countries, encompassing both newspaper and radio archives from each. To begin, let us review the core characteristics of the two media sources. Newspapers were disseminated and are preserved in print, while radio broadcasts, originally transmitted as audio signals, are preserved through audio recordings or typescripts (the text read by the speaker). Newspapers were typically published daily, with one issue per day, whereas radio broadcasts followed a highly variable rhythm, documented in radio schedules available in dedicated magazines, unpublished internal listings, and newspapers (see @fig-2). From a digitised archive perspective, newspaper materials consist of facsimiles and their transcriptions obtained via optical character recognition (OCR), whereas radio materials include facsimiles and OCR for typescripts and radio schedules, and audio recordings and transcripts generated through automatic speech recognition ASR for broadcasts. These materials encompass different modalities, such as text and image for newspapers, and text and sound for radio. - **Data Acquisition and Sharing Framework** Acquiring such a large and diverse corpus is a lengthy and complex endeavour with several obstacles: collections include various data elements (metadata, facsimiles, audio records, transcripts) with differing copyright statuses and are held by institutions across multiple countries, each with its own jurisdiction and legal constraints based on data owners’ preferences and institutional policies. We aim to make these sources accessible to researchers for operations such viewing, searching, and exporting. To address these issues, we have developed a dual approach: operational, by implementing a rigorous organisation to conduct dialogue and facilitate data exchange with our partners; and legal, by designing a modular data sharing and access framework that respects copyright and institutional constraints while maximising research opportunities through differentiated access. We believe that this operational and legal basis will help to break down institutional barriers. **Rich Contextual Information for Historical Research** -The practical acquisition of the corpus also provides an opportunity to deepen our understanding of the sources, which is essential for their use. Although radio and newspaper archival records come with standard metadata, this information is often heterogeneous and varies significantly in content, quantity and quality across collections and institutions. Additionally, there are other sources of contextual knowledge, including unspoken or unwritten information. Two ongoing initiatives aim to further document the production and preservation of these archives to provide rich contextual information for historical research, all the more important in this multi-source context. +The practical acquisition of the corpus also provides an opportunity to deepen our understanding of the sources, which is essential for their use. Although radio and newspaper archival records come with standard metadata, this information is often heterogeneous and varies significantly in content, quantity and quality across collections and institutions. Additionally, there are other sources of contextual knowledge, including unspoken or unwritten information. Two ongoing initiatives aim to further document the production and preservation of these archives to provide rich contextual information for historical research, all the more important in this multi-source context. The first seeks to leverage the information contained in newspaper directories, following the approach outlined in [@beelen_bias_2022]. As a starting point for the Swiss context, we are focusing on extracting semi-structured information from the Swiss Press Bibliography published by Fritz Blaser in 1956 [@alma991017981185503976]. This bibliography documents in great detail about 480 newspapers and around 1,000 periodicals published in Switzerland between 1803 and 1958. It offers rich insights into the origins and history of Swiss newspapers, which can be used to enhance the documentation of newspapers in the Impresso corpus and interface, as well as support the study of the newspaper ecosystem in Switzerland. A similar approach will be used to trace the development of radio programmes through radio schedules and magazines. The second initiative focuses on radio, adopting an oral history perspective to gain a better understanding not only of radio archives, but also of each archive custodian. What important aspects about these archives might we be unaware of? The ‘Oral History of Radio Archivists” (OHRA) addresses this by conducting semi-structured interviews of Impresso audio partners.These interviews explore topics such as archival preservation and documentation policies, digitisation priorities, and the evolution of data quality over time, with the aim of providing complementary narrative descriptions of radio archives. - ### Enriching and Connecting Historical Media Collections A critical step towards Impresso’s vision is the application of text and image processing techniques to the corpus. We aim to enrich and connect media sources through multiple layers of semantic enrichment, ultimately represented and connected in a common vector space. This endeavour involves three main steps; we discuss here only the first and the last. @@ -116,14 +113,19 @@ A critical step towards Impresso’s vision is the application of text and image First, we aim to define an appropriate and consistent data representation framework for both radio and newspaper digital materials. By ‘representation’, we refer to how data or information is effectively encoded and structured in a machine-readable format. The world of digitised newspapers is well understood, with a clear and consistent structure: a title consists of issues, which include pages containing articles and images. Each issue's content is organised into sections, a framework that intersects with journalistic genres, exhibits distinct characteristics in both layout and content, and evolves over time. In contrast, digitised radio broadcasts present a more complex and heterogeneous landscape, lacking a shared definition of what constitutes a ‘standard’ broadcasting unit. There are varying levels of content organisation and an uneven distribution of material over time. Drawing from concrete evidence from files on disk, we carefully inventory all source elements for both media types, refining the terminology for newspapers and radio components – an often challenging process. With this groundwork, we then explore how to align collections’ structure and compositional units of both media sources. This alignment design, developed in parallel with data acquisition and in collaboration with archive partners, also informs and influences the rendering of sources in the interface. -Second, we elevate the corpus to a consistent and higher-quality level, through the assessment and optimisation of OCR and ASR quality, along with the homogenisation of content item segmentation and classes. The objective in this regard is to create “bridges” between radio broadcast and newspaper content types, enabling the retrieval of specific categories–such as editorials, sports-related (sub)sections, or radio schedules–across a set of titles, languages and a specific period in all digitised material. +Second, we elevate the corpus to a consistent and higher-quality level, through the assessment and optimisation of OCR and ASR quality, along with the homogenisation of content item segmentation and classes. The objective in this regard is to create “bridges” between radio broadcast and newspaper content types, enabling the retrieval of specific categories–such as editorials, sports-related (sub)sections, or radio schedules–across a set of titles, languages and a specific period in all digitised material. **Cross-lingual connection of sources.** After enriching historical sources with semantic information to facilitate content-related search facets and support the exploration and comparison of people, locations, keyphrases, topics, and semantic classes across time periods, languages, and media channels (the second step not discussed here), a crucial task is establishing meaningful connections at the content level across both media, both mono- and cross-lingually. This process shifts focus away from structure, concentrating instead on content and its enrichments, with the goal of computing effective vector representations. This is relatively easy to achieve on a monolingual basis, but becomes more challenging cross-lingually, where the goal is to compute meaningful similarities of multilingual representations across languages. -### Unlocking the Dynamics of Historical Media Ecosystems +### Unlocking the Dynamics of Historical Media Ecosystems The focus of historical research activities within Impresso 2 is on examining influence—encompassing actors, themes, and formats—within a transnational media ecosystem. In our context, influence refers to the ability of individuals, groups, organisations or even texts (e.g. books, articles) to shape, direct, or alter narratives, imagery, content, opinions, behaviours, or outcomes related to a particular subject, issue, or topic represented by and through the media. Several case studies, which are both conceptually and methodologically complementary, investigate influences from various perspectives: external to the media, within the media ecosystem, within individual institutions, and across different content formats. This transnational and transmedia research aims to revisit and compare media autonomy, examining interactions between different media forms, such as radio and newspapers, beyond their competitive aspects. A central challenge is extend the historian’s traditional skill of making meaningful comparisons from incomplete and diverse sources to a large-scale, multilingual and transmedia corpus interconnected and enriched with semantic annotations. To this end, we are developing a comparative framework that leverages our data, tools, and interfaces. ### Interfaces To facilitate research on our data, we are developing two complementary research-oriented user interfaces: the Impresso web app, a powerful graphical user interface, and the Impresso data lab, built around a user-oriented API. While the search interface offers a more traditional entry point to explore sources, enable close reading and the compilation of research datasets, the data lab is designed for automating access and enabling computational analysis. Our efforts focus on providing an easy programmatic access to the corpus and its enrichments through a public API and the Impresso Python library, allowing users to annotate external documents with Impresso tools and import external annotations into the web app (annotation and import services), and ensuring transparency with comprehensive documentation, including datasets, models notebook galleries, and tutorials. To further inform the design of the Impresso Datalab, we are surveying current approaches and realisation around data labs for computational humanities research [@beelen_surveying_2024]. + +## References + +::: {#refs} +::: diff --git a/submissions/444/index.qmd b/submissions/444/index.qmd index cbe980f..4d4828e 100644 --- a/submissions/444/index.qmd +++ b/submissions/444/index.qmd @@ -29,13 +29,16 @@ bibliography: references.bib --- ## Introduction + This paper discusses data and practices related to an ongoing digital humanities consortium project *Constellations of Correspondence – Large and Small Networks of Epistolary Exchange in the Grand Duchy of Finland* (CoCo; Research Council of Finland, 2021–2025). The project aggregates, analyses and publishes 19th-century epistolary metadata from letter collections of Finnish cultural heritage (CH) organisations on a Linked Open Data service and as a semantic web portal (the ‘CoCo Portal’), and it consists of three research teams, bringing together computational and humanities expertise. We focus exclusively on metadata considering them to be part of the cultural heritage and a fruitful starting point for research, providing access i.e. to 19th-century epistolary culture and archival biases. The project started with a webropol survey addressed to over 100 CH organisations to get an overview of the preserved 19th-century letters and the Finnish public organisations willing to share their letter metadata with us. Currently the CoCo portal includes seven CH organisations and four online letter publications with the metadata of over 997.000 letters and with 95.000 actors (senders and recipients of letters). ## The Data and the Portal + The CoCo Data Model is based on international standards, such as CIDOC CRM [@Doerr], Dublin Core, and ICA Records in Contexts to promote interoperability with other datasets. The model supports the modelling of the relevant properties of letter metadata (Letter, Actor, Place, and Time-Span, provenance i.e. MetadataRecord and archival/collection level information) from the source datasets. To represent actors in these different source datasets, we use an adaptation of the Open Archives Initiative Object Reuse and Exchange (OAI-ORE) proxy concept. The collected metadata are transformed into linked open data using an automated transformation pipeline [@Drobac], which consists of several steps. First, each received dataset is processed into an intermediate RDF format. Then the data are harmonised with the CoCo Data Model and enriched by linking the recognised actors and places to external resources, such as Wikidata, Finnish National Biographies, as well as the Finnish AcademySampo [@Drobac] and the BiographySampo [@Tamper]. Finally, the transformation pipeline produces a harmonised dataset of correspondence metadata. The availability of the aggregated letter metadata in the linked open data format facilitates the use of the data for data exploration and answering humanities research questions by publishing the data in a semantic portal or by using SPARQL queries. The project’s semantic portal allows users to search, browse and analyse the letters, archives, actors, and places in the CoCo dataset. It is based on the Sampo model [@Hyvönen] and is implemented using the Sampo-UI programming framework [@Ikkala]. The user interface works on a faceted search paradigm, which allows the user to search, e.g. for letters sent by a certain person, letters from a certain period or letters kept in a certain organisation. Data such as the sending places can be visualised on a map, and other visualisations include the yearly distributions of letters, top correspondents, and correspondence networks. The portal also offers some network analysis figures. ## Constant Datawork in an Interdiscplinary Team + At this stage (mid 2024), the three consortium teams have been working together for three full years. Two of the teams (Aalto University and the University of Helsinki) have a computational profile and are responsible for data modelling and transformation and the development of the user interface (semantic portal), and the team based at the Finnish Literature Society maintains relations with the data providers (CH organisations), but also participates in the first stages of data processing, for example by manually harmonising finding aids in word format to prepare them for algorithmic processing, and by performing quality checks on automatically processed data. Members of all teams participate in research activities (conferences, seminars, etc.) in their fields, but we mostly publish co-authored papers, currently focusing on what we call ‘critical collection history’. The starting point for this interdisciplinary work was fortuitous, as the key members had years of experience in – to borrow a term from Jo Guldi [-@Guldi] – hybrid teamwork. Guldi prefers the word hybrid to the more common interdisciplinary because ‘it is more specific to the kind of aliveness, the ongoing exchange, that I have in mind as the basis for the genesis of a new field. ... I can be interdisciplinary in solitude ... But hybrid teams require ongoing support and thinking between people trained in and who identify as members of far-flung disciplines’. [@Guldi]. Thinking together across disciplines requires a strong commitment to developing a common language and an atmosphere in which it is possible to express uncertainty and ask questions. Despite the previous experience, working with the specificities of archival metadata and transforming them into ‘big metadata’ has forced us to study and develop a new shared vocabulary that reflects the particular characteristics or ‘localities’ [@Loukissas] of this particular data in the context of Linked Open Data and the chosen Humanities’ questions. One of the realisations and lessons of the project is the importance of collection-level information (identifying the so-called records creators and the fonds, i.e. the archival collections). Data models developed in previous projects dealing with epistolary metadata (usually based on curated and published editions of letters) did not highlight these characteristics, and all the datasets we acquired from archives, libraries and museums conveyed this information in different ways. In order to understand and visualise this aspect of the data, we needed to combine a practical understanding of archival work, theoretical reflections from cultural heritage studies and critical archival theory, and domain expertise in cultural heritage data ontologies and models. @@ -43,7 +46,8 @@ Such hybrid thinking takes time, especially when the aim is both to create a new The funding mechanism – a three-part consortium project – makes it possible to work across different universities and research institutions to gain expertise that is not available in any one place. The downside, however, is that we are working in different organisations; this is not a fashionable digital humanities ‘LAB’. All three teams in the project work in the Helsinki metropolitan area, which makes face-to-face meetings possible, but even with the time and effort invested in team work and the digital tools for easy communication (Slack, Trello, etc.), we still find from time to time that the algorithmic premises used by the computational team do not stand up to historical scrutiny, or that the research assistants have misinterpreted instructions for manual data processing. ## Portal Test Users and their Feedback -We realised early on that it was extremely important to gather user experience before actually launching the CoCo portal. So, in early February 2024, we opened the portal to a test group of 17 people for a period of 2.5 months. The group was assembled partly through an open call on social media and at some conferences, and partly by asking specific people to join. The volunteers were mostly academic humanities researchers and invited specialists from two museums with 19th-century letter collections. We provided the testers with background material on the portal and the CoCo data endpoint documentation, the organisations whose data was currently in the portal, with user instructions, and with some questions or tasks to help them get started. We asked five specific questions to which we wanted answers, but we welcomed all kinds of feedback. + +We realised early on that it was extremely important to gather user experience before actually launching the CoCo portal. So, in early February 2024, we opened the portal to a test group of 17 people for a period of 2.5 months. The group was assembled partly through an open call on social media and at some conferences, and partly by asking specific people to join. The volunteers were mostly academic humanities researchers and invited specialists from two museums with 19th-century letter collections. We provided the testers with background material on the portal and the CoCo data endpoint documentation, the organisations whose data was currently in the portal, with user instructions, and with some questions or tasks to help them get started. We asked five specific questions to which we wanted answers, but we welcomed all kinds of feedback. Building an engaged and motivated community of test users proved to be a challenge. The opening online session, where the project’s portal experts taught how to use it, received only four participants. However, they were active in asking questions and satisfied with the introduction. We also offered the possibility of a joint online closing session to discuss the experiences, but only two people registered, so we cancelled it. All this shows that once the initial excitement of joining the test group had worn off, it was difficult to reach the testers and very difficult to engage them in the test or create a group spirit. People got lost in their own research and daily work. In the end, however, after a reminder, we received feedback from eight testers, i.e. from half of the group. We are still waiting for feedback from the later test group of archivists in the Finnish Literary Society. The feedback can be divided into two groups: comments on errors in the data, mostly errors in the disambiguation of actors, and comments on the functionalities and performance of the portal. The latter was what we had hoped for. It is interesting to note that the testers, who work in CH organisations and are used to working with collection management systems and have even catalogued archival material themselves, commented more on the functionalities, while the researchers clearly focused on data errors. In our view, this shows how much hands-on experience with databases affects a humanist’s ability to study mass data material and turn into digital humanities. You have to change your mindset, as our own experience in the project has shown. We also noticed that working only with the metadata of letters was a barrier for some of the testers; in the wish list for the future development of the portal, digitised letters or their transliterations stood out. A positive result from the project’s point of view was that the testers found the portal easy to use, the user interface with its four perspectives clear and the data offered useful both for their own research and for information services in CH organisations. They were able to find unknown connections, relationships and people in the data. The metadata available seemed to respond well to their research questions and to encourage further research. The ability to do queries with places was particularly appreciated. @@ -55,3 +59,8 @@ The feedback has provided us with an insight into the difficulties that users, b ![](images/CoCo_Portal.png) ::: + +## References + +::: {#refs} +::: diff --git a/submissions/445/index.qmd b/submissions/445/index.qmd index 12ea0ea..f942243 100644 --- a/submissions/445/index.qmd +++ b/submissions/445/index.qmd @@ -31,9 +31,9 @@ Scholars and interested laypeople who want to adequately deal with historical to ## Teaching ATR online – the setting -Digital methods allow a more comprehensive range of users to assign, analyse and interpret sources digitally. This increased availability of data (mainly in the form of digitized images) also protects the original documents. Recognising handwriting with the help of machine learning, known as ATR, has been greatly improved and is becoming increasingly important in various disciplines. Machine learning methods, especially deep learning, have been used for complex evaluation decisions for a number of years.[@muehlberger_transforming_2019] For text recognition, especially for recognising handwriting, ATR can achieve far better results than conventional Optical Character Recognition (OCR). However, many non-standardised fonts and layouts will lead to high error rates in the recognition processes, which is why it is essential to clean up data manually. The reading order of individual lines or blocks of text, in particular, poses major challenges for machine transcription tools. +Digital methods allow a more comprehensive range of users to assign, analyse and interpret sources digitally. This increased availability of data (mainly in the form of digitized images) also protects the original documents. Recognising handwriting with the help of machine learning, known as ATR, has been greatly improved and is becoming increasingly important in various disciplines. Machine learning methods, especially deep learning, have been used for complex evaluation decisions for a number of years.[@muehlberger_transforming_2019] For text recognition, especially for recognising handwriting, ATR can achieve far better results than conventional Optical Character Recognition (OCR). However, many non-standardised fonts and layouts will lead to high error rates in the recognition processes, which is why it is essential to clean up data manually. The reading order of individual lines or blocks of text, in particular, poses major challenges for machine transcription tools. -Even ATR itself presents several hurdles; the existing tools are often only intuitive to use to a limited extent. Due to the error rates described above, cleaning up the automatically recognized texts by hand is essential. New users must familiarise themselves with these processes, whether this is on their own at home or in a mentored university or non-academic course. In addition, text recognition itself is only part of the learning curve: to work independently with ATR, it is also necessary to recognise when which form of text and layout recognition makes sense, where it is worth investing time to save more time later on and how to proceed with the output. For these reasons, the ongoing DIZH project PATT (Potentials of Advanced Text Technologies: Machine Learning-based Text Recognition)[@noauthor_potentials_2024] at the University of Zurich and the Zurich University of Applied Sciences is currently developing an open-source e-learning module teaching students, young researchers, and the interested public (in the sense of citizen science) how to use ATR. +Even ATR itself presents several hurdles; the existing tools are often only intuitive to use to a limited extent. Due to the error rates described above, cleaning up the automatically recognized texts by hand is essential. New users must familiarise themselves with these processes, whether this is on their own at home or in a mentored university or non-academic course. In addition, text recognition itself is only part of the learning curve: to work independently with ATR, it is also necessary to recognise when which form of text and layout recognition makes sense, where it is worth investing time to save more time later on and how to proceed with the output. For these reasons, the ongoing DIZH project PATT \(Potentials of Advanced Text Technologies: Machine Learning-based Text Recognition\)[@noauthor_potentials_2024] at the University of Zurich and the Zurich University of Applied Sciences is currently developing an open-source e-learning module teaching students, young researchers, and the interested public (in the sense of citizen science) how to use ATR. Developing an exhaustive learning module is a desideratum, as many researchers working in history or linguistics today want to work with automated text and layout recognition. This complex digital skill set involves the critical categorisation of the machine's feedback. Although the steps involved in manuscript reading are becoming more efficient, they also require new skills: the work no longer centres on direct engagement with handwriting or print, but on the efficient and task-appropriate correction of the results of automated text and layout recognition, often bringing in the original sources later in for a combined distant and close reading. @@ -51,8 +51,13 @@ We use a spider diagram model to communicate the various influences on the benef ![Our model showing identified factors for the use of ATR in historical projects](images/SpiderDiagrammATR.png) -A high heterogeneity of handwriting could affect the accuracy of ATR as the recognition software might have difficulties to recognize the text consistently. A higher degree of heterogeneity requires a broader model based on larger amount of training data.[@hodel_general_2021] Large amounts of text often mean that ATR needs to be able to work efficiently and scalable to process large amounts of data. For small amounts of text, the focus could be on the level of detail of the recognition or more manual correction. A close reading would require a more precise and detailed analysis of the texts, which means that the ATR models must be accurate and able to recognize fine details, while distant reading focuses more on the recognition of patterns and trends. A broad research question might require a general analysis of many texts, which means that the ATR models should be versatile and robust (e.g. based on TrOCR models)[@li_trocr_2022]. A specific research question might mean the ATR must focus on specific details and accurate detections. In summary, working with ATR requires a careful balance between the quantity and type of texts, the desired accuracy and detail of the analysis, and the heterogeneity of the manuscripts to be analysed. The diagram helps to organise these factors visually and to understand their interactions. +A high heterogeneity of handwriting could affect the accuracy of ATR as the recognition software might have difficulties to recognize the text consistently. A higher degree of heterogeneity requires a broader model based on larger amount of training data.[@hodel_general_2021] Large amounts of text often mean that ATR needs to be able to work efficiently and scalable to process large amounts of data. For small amounts of text, the focus could be on the level of detail of the recognition or more manual correction. A close reading would require a more precise and detailed analysis of the texts, which means that the ATR models must be accurate and able to recognize fine details, while distant reading focuses more on the recognition of patterns and trends. A broad research question might require a general analysis of many texts, which means that the ATR models should be versatile and robust [e.g. based on TrOCR models](@li_trocr_2022). A specific research question might mean the ATR must focus on specific details and accurate detections. In summary, working with ATR requires a careful balance between the quantity and type of texts, the desired accuracy and detail of the analysis, and the heterogeneity of the manuscripts to be analysed. The diagram helps to organise these factors visually and to understand their interactions. ## Our difficulties We have already recognised some difficulties for our module and would like to address them briefly: The fast-moving nature of tools and products prevents us from providing precise instructions and means that we can only provide a general introduction to the technology rather than the software(-suites) themselves. Some of this software requires licence or operates on a pay-per-use basis and is therefore not a viable option for everyone. When referring to these products, our open-source teaching module also provides free advertising for paid tools. On the other hand, free tools have disadvantages, which means they are not helpful in all cases. We, therefore, must find a good balance between these two poles. In the technical realisation of our teaching module, we are limited by the options developed during the relaunch. We, therefore, must develop our teaching module within the structures of the existing options. We would furthermore like to set up a FAQ page on ATR, but this requires us to be able to collect and identify these problems and questions systematically. + +## References + +::: {#refs} +::: diff --git a/submissions/447/index.qmd b/submissions/447/index.qmd index bc84e31..df0b474 100644 --- a/submissions/447/index.qmd +++ b/submissions/447/index.qmd @@ -78,6 +78,7 @@ This is the goal of Geovistory. It is conceived as a virtual research and data p ## Geovistory as a Research Environment Geovistory aims to be a comprehensive research environment that accompanies scholars throughout the whole research cycle. Geovistory includes: + - The *Geovistory Toolbox*, which allows to manage and curate projects' research data. The Toolbox is freely accessible for all individual projects. Each research project works on its own data perspective but at the same time directly contributes to a joint knowledge graph. - A joint *Data repository* that allows to connect and link the different research projects under a unique and modular ontology, thus creating a large Knowledge Graph. - The Geovistory *Publication platform* (), where data is published using the RDF framework and can be accessed via the community page or project-specific webpages and its graphical search tools or a SPARQL-endpoint. @@ -94,6 +95,7 @@ As per current terms of service, all data produced in the information layer of G ## The aim of breaking information silos The goal of producing and publishing FAIR research data is to break the information silos that hinder the sharing and reusing of scientific data. However, achieving interoperability hinges on two critical components [@beretta2024a]: + - Firstly, the unambiguous identification of real-world entities (e.g., persons, places, concepts) with unique identifiers (e.g., URIs in Linked Open Data) and the establishment of links between identical entities across different projects (e.g., ensuring that the entity "Paris" is identified by the same URI in all projects); - Secondly, the utilization of explicit ontologies that can be aligned across projects. Nevertheless, mapping between ontologies may prove challenging, or even unfeasible, particularly when divergent structural frameworks are employed (e.g., an event-centric ontology may have limited compatibility with an object-centric one). @@ -124,3 +126,8 @@ Geovistory has been designed as a comprehensive research environment tailored by The forthcoming years mark a critical juncture for Geovistory, as the tools and infrastructures of the environment recently transitioned into the public domain. This needed change will ease collaboration with future public institutions within Europe, but a greater part of public fundings will be needed to ensure the sustainability of the ecosystem. Nonetheless, the Digital Humanities ecosystem remains unstable, attributed to the lack of sustained funding for infrastructural initiatives by national funding agencies and the absence of cohesive coordination among institutions. To ameliorate this landscape, prioritizing the establishment of robust collaborations and partnerships among diverse tools and infrastructures in Switzerland and Europe is imperative. Leveraging the specialized expertise of each institution holds the promise of engendering a harmonized and synergistic, distributed environment conducive to scholarly pursuits. + +## References + +::: {#refs} +::: diff --git a/submissions/450/index.qmd b/submissions/450/index.qmd index a42fbe2..d0e77e4 100755 --- a/submissions/450/index.qmd +++ b/submissions/450/index.qmd @@ -25,67 +25,65 @@ bibliography: references.bib --- ## Introduction -GIS (Geographic Information Systems) have become increasingly valuable in spatial history research since the mid-1990s, and is particularly useful for analyzing socio-spatial dynamics in historical contexts [@kemp_what_2009, p. 16; @gregory_historical_2007, p.1]. My PhD research applies GIS to examine and compare the development of public urban green spaces, namely public parks and playgrounds, in the port cities of Hamburg and Marseille, between post-WWII urban reconstruction and the First Oil Shock in 1973. The management and processing of data concerning green space evolution in GIS allow visualization of when and where parks were created, and how these reflect socio-spatial differentiations. This layering of information offers ways to evaluate historical data and construct arguments, while also helping communicate the project to a wider audience. To critically assess the application of GIS in historical research, I will use the SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis framework. This popular business consultancy approach [@minsky_are_2021] serves here as a structure for systematic reflection on how digital methods can enhance historical research and where caution is needed. The goal is to provoke critical thinking about when using GIS genuinely support research beyond producing impressive visuals, and to explore the balance between close and distant reading of historical data. +GIS (Geographic Information Systems) have become increasingly valuable in spatial history research since the mid-1990s, and is particularly useful for analyzing socio-spatial dynamics in historical contexts [@kemp_what_2009, p. 16; @gregory_historical_2007, p.1]. My PhD research applies GIS to examine and compare the development of public urban green spaces, namely public parks and playgrounds, in the port cities of Hamburg and Marseille, between post-WWII urban reconstruction and the First Oil Shock in 1973. The management and processing of data concerning green space evolution in GIS allow visualization of when and where parks were created, and how these reflect socio-spatial differentiations. This layering of information offers ways to evaluate historical data and construct arguments, while also helping communicate the project to a wider audience. To critically assess the application of GIS in historical research, I will use the SWOT (Strengths, Weaknesses, Opportunities, Threats) analysis framework. This popular business consultancy approach [@minsky_are_2021] serves here as a structure for systematic reflection on how digital methods can enhance historical research and where caution is needed. The goal is to provoke critical thinking about when using GIS genuinely support research beyond producing impressive visuals, and to explore the balance between close and distant reading of historical data. ## Strengths -GIS are composed of layers of data sets and are mainly used for mapping, georeferencing, and data analysis (e.g. spatial analysis). The data can be varied in what it represents but must be linked to spatial parameters for it to be positioned in a visualization for analysis of spatial information [@wheatley_spatial_2005, pp.1,8]. GIS layers for historical studies can be viewed as models of data sets. They represent specific topics in time and space simplistically, and spark reflection [@van_ruymbeke_modeliser_2021, p.8]. The screen shot of my QGIS workspace (@fig-1) shows the city of Marseille with parks marked in various stages of planning in the early 1970s. This was the time when longstanding Mayor Gaston Defferre launched the large-scale greening initiative Mille Points Verts pour Marseille [@g_envoi_1971]. Defferre and his team’s goal was to react to a growing ecological awareness and increase the number of green spaces for a more livable city [@prof_j_no_1971]. They also organized events to include and educate citizens and to garner their support for the upcoming elections [@noauthor_semaine_1971]. The majority of parks created within Mille Points Verts remain until today, with only a handful of additional parks added after the mid-1970s. This is visible when the layer with parks and gardens from a 2018 dataset provided by the government of Marseille is selected (@fig-2). The strength of GIS layering is evident when we apply distant reading techniques: skimming over the model we see a display of spatial relations of park distribution and location as well as their connection to time.\ -In order for this visualization to take shape, I produced and assembled data. Specifically, I selected data and went through the process of closely reading my historical sources, learning to understand them and to think through their meaning. In this way maps are social documents. By themselves, they do not reveal anything yet. But by superimposing visualizations, GIS can reveal thinking processes of the data curators and map creators [cf. for example @jones_mapping_2021]. + +GIS are composed of layers of data sets and are mainly used for mapping, georeferencing, and data analysis (e.g. spatial analysis). The data can be varied in what it represents but must be linked to spatial parameters for it to be positioned in a visualization for analysis of spatial information [@wheatley_spatial_2005, pp.1,8]. GIS layers for historical studies can be viewed as models of data sets. They represent specific topics in time and space simplistically, and spark reflection [@van_ruymbeke_modeliser_2021, p.8]. The screen shot of my QGIS workspace (@fig-1) shows the city of Marseille with parks marked in various stages of planning in the early 1970s. This was the time when longstanding Mayor Gaston Defferre launched the large-scale greening initiative Mille Points Verts pour Marseille [@g_envoi_1971]. Defferre and his team’s goal was to react to a growing ecological awareness and increase the number of green spaces for a more livable city [@prof_j_no_1971]. They also organized events to include and educate citizens and to garner their support for the upcoming elections [@noauthor_semaine_1971]. The majority of parks created within Mille Points Verts remain until today, with only a handful of additional parks added after the mid-1970s. This is visible when the layer with parks and gardens from a 2018 dataset provided by the government of Marseille is selected (@fig-2). The strength of GIS layering is evident when we apply distant reading techniques: skimming over the model we see a display of spatial relations of park distribution and location as well as their connection to time.\ +In order for this visualization to take shape, I produced and assembled data. Specifically, I selected data and went through the process of closely reading my historical sources, learning to understand them and to think through their meaning. In this way maps are social documents. By themselves, they do not reveal anything yet. But by superimposing visualizations, GIS can reveal thinking processes of the data curators and map creators [cf. for example @jones_mapping_2021]. ::: {#fig-1} -![Planned parks (1970-71) within the 'Mille Points Verts' initiative.](images/Fig.1.png){#fig-1} +![Planned parks (1970-71) within the 'Mille Points Verts' initiative.](images/Fig.1.png) ::: ::: {#fig-2} -![This image can be deceiving as it seems that many parks do not overlap. Figure 1 shows approximate planning locations. The 2018 state of parks therefore shows the current locations of many of the planned ones from 1970-71.](images/Fig.2.png){#fig-2} +![This image can be deceiving as it seems that many parks do not overlap. Figure 1 shows approximate planning locations. The 2018 state of parks therefore shows the current locations of many of the planned ones from 1970-71.](images/Fig.2.png) ::: - - ## Weaknesses The curation of data, although an important and empowering step for the historian and GIS researcher, also reveals the weaknesses of GIS: the mismatch between GIS requirements (in terms of data structuring and quality) and the imperfection of historical data. GIS software is created for geographers, not historians. Everything in GIS is structured data and therefore cannot handle ambiguities natural to historical sources. Sources in whatever form they are collected by the historian first must be organized, selected, tabularized, geocoded and/or georeferenced [@kemp_what_2009, p.16-17]. The caveat here is that a historian’s data is hardly ever complete. Missing records shape both what we can and cannot analyze – especially when working with GIS. In historical narration on text the researcher can explain gaps and postulate why this may be the case. GIS do not allow for gaps and thus we can only produce models and visualizations with the numerical evidence available.\ The visualization presented here helps to model different states of an object: the park. As my data is not complete, discrepancies between the mapped data and the on-the-ground reality occur, especially since the planned parks had vague names sometimes only matching the name of an entire neighborhood. This raises the question of how to capture temporality. How can the aspect of time appear on a two-dimensional visualization? Rendering the time layer onto the spatial one demands creativity and an awareness that time is something constructed [Massey speaks of “implicit imaginations of time and space” @massey_for_2005, pp.22].\ From the map making perspective, time significantly impacts the creation process. GIS work is time-consuming and labor-intensive. It involves meticulous manual searching, assembling, and layering of data. However, linking to the overarching topic of this conference, AI may offer new possibilities. Tools such as Transkribus allow users to apply machine learning to filter specific elements from document sets. LLMs can then process this information into CSV files for GIS software. While not yet revolutionary, as these tools evolve, AI could become useful in extracting numerical evidence from textual sources. For geocoding of places, AI would greatly aid efficiency and relieve the researcher of tedious manual work. However, at this point, LLMs such as Claude AI and ChatGPT still hallucinate considerably. - ## Opportunities + AI-assisted data extraction presents a gateway to think about opportunities. Researchers could focus more on experimenting with design and layering by automating time-consuming tasks. For example, mapping supports spatial thinking and perception by integrating the crucial 'where' element: Where are specific features located? How near or far is one place from another? Depending on what obstacles or facilitators are in place a park may be close in measured distance but far in terms of accessibility if there is no bridge or tunnel to e.g. cross a motorway, or water element. Therefore, how are different locations related? How can we perceive and understand distance?\ -The screenshot here shows a handful of ‘wheres’ (@fig-3). They reveal where the majority of parks are located, their proximity to neighborhoods, the types of surrounding communities, and their connections to amenities and public infrastructure. This approach enables comparisons across different scales. For instance, I can compare park distribution between Hamburg and Marseille and track their development over time.\ +The screenshot here shows a handful of ‘wheres’ (@fig-3). They reveal where the majority of parks are located, their proximity to neighborhoods, the types of surrounding communities, and their connections to amenities and public infrastructure. This approach enables comparisons across different scales. For instance, I can compare park distribution between Hamburg and Marseille and track their development over time.\ These questions prompted by the use of GIS direct both user and observer back to the original sources for close reading. A GIS model can spark interest in a topic and motivate the researcher to dig deeper on what these layers mean and how they were created. Ideally GIS should be used as a starting point for in depth analysis. In the case of Marseille and Hamburg, the development of public urban green spaces was what inspired me to look more closely at the historical circumstances. Hamburg, for instance, has a long history of creating expansive green areas with the support of private patrons. Marseille does not have a comparable patronage system. Instead, municipal expropriation rendered private villas and their gardens public.\ GIS are a powerful tool that serve multiple functions in research [@wheatley_spatial_2005, p.8]. They “can play a role in generating ideas and hypotheses at the beginning of a project” and serve as valuable instruments for analysis and evaluation [@brewer_basic_2006, p.S36]. By modeling research hypotheses and findings, e.g. maps can be used to effectively communicate to diverse audiences – from the general public to specialized groups such as urban planners and municipal governments, relevant to my field of historical urban planning research.\ -A particularly compelling aspect of GIS is their ability to visually represent power relations (@fig-4). This feature bridges the gap between historical analysis and contemporary urban planning, making it an invaluable tool in understanding the evolution of urban spaces. The visualization of Marseille reveals that the majority of parks are located towards the center and south of the city and does not necessarily correspond to the population density. The south of Marseille is where villas abound and thus the upper and upper-middle class live. The majority of the HLM (housing at moderate rent) are located towards the north, where living conditions are condensed, and political representation is low. What is more, if I select the layers showing where most immigrants and workers live today, a lack of green spaces is visible (@fig-5) (@fig-6) (@fig-7).\ -Connecting this once more to close reading of the sources: when Mille Points Verts was launched, planners scavenged locations for green space creation. The HLM neighborhoods were marked as unsuitable for participation in this program: People living in social housing would “misuse” the parks by playing soccer on them or walking across the grass [@noauthor_amenagement_1970]. This shows complexity of space perception and power imbalance [@van_ruymbeke_modeliser_2021, pp.7]. +A particularly compelling aspect of GIS is their ability to visually represent power relations (@fig-4). This feature bridges the gap between historical analysis and contemporary urban planning, making it an invaluable tool in understanding the evolution of urban spaces. The visualization of Marseille reveals that the majority of parks are located towards the center and south of the city and does not necessarily correspond to the population density. The south of Marseille is where villas abound and thus the upper and upper-middle class live. The majority of the HLM (housing at moderate rent) are located towards the north, where living conditions are condensed, and political representation is low. What is more, if I select the layers showing where most immigrants and workers live today, a lack of green spaces is visible (@fig-5) (@fig-6) (@fig-7).\ +Connecting this once more to close reading of the sources: when Mille Points Verts was launched, planners scavenged locations for green space creation. The HLM neighborhoods were marked as unsuitable for participation in this program: People living in social housing would “misuse” the parks by playing soccer on them or walking across the grass [@noauthor_amenagement_1970]. This shows complexity of space perception and power imbalance [@van_ruymbeke_modeliser_2021, pp.7]. ::: {#fig-3} -![Zoom in on the harbor area where also the "pénétrante nord" is located (built in the late 1960s). Along this main road many parks were planned but not built.](images/Fig.3.png){#fig-3} +![Zoom in on the harbor area where also the "pénétrante nord" is located (built in the late 1960s). Along this main road many parks were planned but not built.](images/Fig.3.png) ::: ::: {#fig-4} -![Population 2012 layer turned on (natural breaks). This visualization presents a caveat: the northern part does not appear densely populated. This is not the case as the northern neighborhoods are very hilly, thus HLM apartment blocks house a large number of people in a small space.](images/Fig.4.png){#fig-4} +![Population 2012 layer turned on (natural breaks). This visualization presents a caveat: the northern part does not appear densely populated. This is not the case as the northern neighborhoods are very hilly, thus HLM apartment blocks house a large number of people in a small space.](images/Fig.4.png) ::: ::: {#fig-5} -![Zoom in on harbor area. Population 2012 (natural breaks).](images/Fig.5.png){#fig-5} +![Zoom in on harbor area. Population 2012 (natural breaks).](images/Fig.5.png) ::: ::: {#fig-6} -![Zoom in on harbor area. Immigrants 2012 (natural breaks).](images/Fig.6.png){#fig-6} +![Zoom in on harbor area. Immigrants 2012 (natural breaks).](images/Fig.6.png) ::: ::: {#fig-7} -![Zoom in on harbor area. Workers 2012 (natural breaks).](images/Fig.7.png){#fig-7} +![Zoom in on harbor area. Workers 2012 (natural breaks).](images/Fig.7.png) ::: - - ## Threats Yet all these opportunities are ambiguous and “entrusting machines with the memory of human activity can be frightening”. The last element of the SWOT analysis, threats, rounds off these reflections. Although it is crucial to encourage critical thinking through the mapping of, for example, political representation and wealth distribution of a city it also shows my personal convictions. I wish to demonstrate which voices where not heard in the planning of these spaces, which people were not considered when decisions were made. I am biased when I start with the premise that there is inequality. The map, objective as it may seem, never is. The book *How to Lie with Maps* provocatively shows the power of maps to create a strong, and perhaps deceiving, narrative:\ *"Map users generally are a trusting lot: they understand the need to distort geometry and suppress features, and they believe the cartographer really does know where to draw the line, figuratively as well as literally. […] Yet cartographers are not licensed, and many mapmakers competent in commercial art or the use of computer workstations have never studied cartography. Map users seldom, if ever, question these authorities, and they often fail to appreciate the map's power as a tool of deliberate falsification or subtle propaganda"* [@monmonier_how_1996, pp.1].\ - -People working with GIS can have all kinds of skill levels and interests. I, for example, am not a GIS specialist and relatively new to using the tool. Still, I can easily manipulate my model to paint various pictures, if I wish to do so. I can turn on different layers and focus on the number of immigrants per neighborhood, I can change the classification for the choropleth map and create entirely different impressions, or I can simply change the basemap and take away the context of terrain, transportation systems, etc. (cf. @fig-4). The quote speaks of an almost blind trust in maps, which shows once more that we must always be critical observers of the things we consume and historians should always want to be curios fact checkers.\ +People working with GIS can have all kinds of skill levels and interests. I, for example, am not a GIS specialist and relatively new to using the tool. Still, I can easily manipulate my model to paint various pictures, if I wish to do so. I can turn on different layers and focus on the number of immigrants per neighborhood, I can change the classification for the choropleth map and create entirely different impressions, or I can simply change the basemap and take away the context of terrain, transportation systems, etc. (cf. @fig-4). The quote speaks of an almost blind trust in maps, which shows once more that we must always be critical observers of the things we consume and historians should always want to be curios fact checkers.\ A map is a series of decisions and it reflects the biography of both the maker and the observer. It is the responsibility of the historian working with GIS to be as transparent as possible regarding the choices made to display a historical development or state. It is the responsibility of the observer to use the map as a starting point for close reading, interpretation and analysis rather than the end point and a fact. We must remember “80% of GIS is about transforming, manipulating and managing spatial data”[@jones_lesson_2022]. - ## Conclusion In conclusion the use of GIS in historical research and analysis requires the researcher to stay true to the principles of the craft of the historian: source criticism, the ability to subsume information, create a strong narrative, document the process of source manipulation and proper source citation. As historians we should be aware of the power of storytelling – no matter which medium we use. An audience’s spatial understanding can be enhanced via GIS models, serving as a support system of sorts. All the more reason why GIS in historical analysis must be used and consumed critically and consciously. By embracing this complexity, we can use GIS for historical reflections, enhancing our understanding of spatial and temporal dynamics in historical contexts. +## References +::: {#refs} +::: diff --git a/submissions/452/index.qmd b/submissions/452/index.qmd index d408714..f4aa5e1 100644 --- a/submissions/452/index.qmd +++ b/submissions/452/index.qmd @@ -18,19 +18,14 @@ author: email: wissam.al-kendi@utbm.fr affiliations: - UTBM - keywords: - Demographic history - Industrialization - Handwritten Text Recognition - -abstract: - "The Belpop project aims to reconstruct the demographic behavior of the population of a mushrooming working-class town during industrialization: Belfort. Belfort is a hapax in the French urban landscape of the 19^th^ century, as the demographic growth of its main working- class district far outstripped that of the most dynamic Parisian suburbs. The underlying hypothesis is that the massive Alsatian migration that followed the 1870-71 conflict, and the concomitant industrialization and militarization of the city, profoundly altered the demographic behavior of the people of Belfort. - +abstract: | + The Belpop project aims to reconstruct the demographic behavior of the population of a mushrooming working-class town during industrialization: Belfort. Belfort is a hapax in the French urban landscape of the 19^th^ century, as the demographic growth of its main working- class district far outstripped that of the most dynamic Parisian suburbs. The underlying hypothesis is that the massive Alsatian migration that followed the 1870-71 conflict, and the concomitant industrialization and militarization of the city, profoundly altered the demographic behavior of the people of Belfort. This makes Belfort an ideal place to study the sexualization of social relations in 19^th^-century Europe. These relationships will first be understood through the study of out-of-wedlock births, in their socio-cultural and bio-demographic dimensions. In the long term, this project will also enable to answer many other questions related to event history analysis, a method that is currently undergoing major development, thanks to artificial intelligence (AI), and which is profoundly modifying the questions raised by historical demography and social history. - The contributions of deep learning make it possible to plan a complete analysis of Belfort's birth (ECN) and death (ECD) civil registers (1807-1919), thanks to HTR methods applied to these sources (two interdisciplinary computer science-history theses in progress). This project is part of the SOSI CNRS ObHisPop (Observatoire de l'Histoire de la Population française: grandes bases de données et IA), which federates seven laboratories and aims to share the advances of interdisciplinary research in terms of automating the constitution of databases in historical demography. Challenges also include linking (matching individual data) the ECN and ECD databases, and eventually the DMC database (DMC is the city's main employer of women)." - date: 07-22-2024 --- @@ -47,12 +42,12 @@ The Belfort Civil Registers of Birth comprise 39,627 birth declarations inscribe ![A sample image from the ECN dataset with each four components](images/sample%20from%20dataset.png) Studying these invaluable resources is crucial for understanding the expansion of civilizations within Belfort. This necessitates establishing a knowledge database comprising all the information offered by these registers utilizing Artificial intelligence techniques, such as printed and handwritten text recognition models. -The development of these models requires a training dataset to address the challenges imposed by these historical documents, such as text style variation, skewness, and overlapping words and text lines. -Two stages have been carried on to construct the training dataset. First, manual transcription of 1,010 declarations and 984 marginal annotations with a total of 21,939 text lines, 189,976 words, and 1,177,354 characters. This stage involves employing structure tags like XML tags to identify the characteristics of the declarations. -Second, an automatic text line detection method is utilized to extract the text lines within the primary paragraphs and the marginal annotation images in a polygon boundary to preserve the handwritten text. The method is developed based on analyzing the gaps between two consecutive text lines within the images. The detection process initially identifies the core of the text lines regardless of text skewness. Moreover, the process identifies the gaps based on the identified cores. The number of gaps between each two lines is determined based on a predefined parameter value. Each gap is analyzed by examining the density of black pixels within the central third of the gap region. If the black pixel density is low, a segment point is placed at the center. Otherwise, a histogram analysis is performed to identify minimum valleys, which are then used as potential segment points. Finally, all the localized segment points are connected to form the boundary of the text in a polygon shape. +The development of these models requires a training dataset to address the challenges imposed by these historical documents, such as text style variation, skewness, and overlapping words and text lines. +Two stages have been carried on to construct the training dataset. First, manual transcription of 1,010 declarations and 984 marginal annotations with a total of 21,939 text lines, 189,976 words, and 1,177,354 characters. This stage involves employing structure tags like XML tags to identify the characteristics of the declarations. +Second, an automatic text line detection method is utilized to extract the text lines within the primary paragraphs and the marginal annotation images in a polygon boundary to preserve the handwritten text. The method is developed based on analyzing the gaps between two consecutive text lines within the images. The detection process initially identifies the core of the text lines regardless of text skewness. Moreover, the process identifies the gaps based on the identified cores. The number of gaps between each two lines is determined based on a predefined parameter value. Each gap is analyzed by examining the density of black pixels within the central third of the gap region. If the black pixel density is low, a segment point is placed at the center. Otherwise, a histogram analysis is performed to identify minimum valleys, which are then used as potential segment points. Finally, all the localized segment points are connected to form the boundary of the text in a polygon shape. The Intersection over Union (IoU), Detection Rate (DR), Recognition Accuracy (RA), and F-Measure (FM) metrics have been employed to provide a comprehensive evaluation of different performance aspects of the method, achieving accuracies of 97.5% IoU, 99% DA, 98% (RA), and 98.50% (FM) for the detection of the text lines within the primary paragraphs. Moreover, the marginal annotations exhibit accuracies of 93.1%, 96%, 94%, and 94.79% across the same metrics, respectively. A structured data tool has been developed for correlating the extracted text line images with their corresponding transcriptions at both the paragraph and text line levels by generating .xml files. These files structure the information within the registers based on the reading order of the components within the document and assign a unique index number for each. Additionally, several essential properties are incorporated within each component block, including the component name, the coordinates within the image, and the corresponding transcribed text. The .xml file generation processes are ongoing to expand the structured declarations to enrich the dataset essential for training artificial intelligence models. Belfort Civil Registers of Death (ECD) are composed of 39,238 death declarations with 18,381 fully handwritten certificates and 20,857 hybrid certificates. This corpus spans from 1807 to 1919. ECDs have the same resolution (300 dpi) and the same structure as the Civil Registers of Birth (ECN). The information given by each declaration is somewhat different: the name, the age, the profession of the deceased, the place of death, and even the profession of the witness, can be found. Concerning ECDs, a different strategy was chosen for the text segmentation and the data extraction: the Document Attention Network (DAN). This network recently published is used to get rid of the pre-segmentation step which is highly beneficial for the heterogeneity of our dataset. It was developed for the recognition of handwritten dataset such as READ 2016 and RIMES 2009. Moreover, this architecture can focus on relevant parts of the document, improving the precision and identifying and extracting specific segments of interests. The choice was also made because this network is very efficient in handling large volumes of data while maintaining data integrity. -The DAN architecture is made of a Fully Convolutional Network (FCN) encoder to extract feature maps of the input image. This type of network is the most popular approach for pixel-pixel document layout analysis because it maintains spatial hierarchies. Then, a transformer is used as a decoder to predict sequences of variable length. Indeed, the output of this network is a sequence of tokens describing characters of the French language or layout (beginning of paragraph or end of page for instance). These layout tokens or tags were made to structure the layout of a register double page and to unify the ECD and ECN datasets. The ECD training dataset was built by picking around four certificates each year of the full dataset. For the handwritten records (1807-1885) the first two declarations of the double page were annotated and the first four for the hybrid records (1886-1919). This led to annotating 460 declarations for the first period and 558 declarations for the second one to give a total of 1118 annotated death certificates. We are currently verifying these annotations to start the pre-training phase of the DAN in the coming months. \ No newline at end of file +The DAN architecture is made of a Fully Convolutional Network (FCN) encoder to extract feature maps of the input image. This type of network is the most popular approach for pixel-pixel document layout analysis because it maintains spatial hierarchies. Then, a transformer is used as a decoder to predict sequences of variable length. Indeed, the output of this network is a sequence of tokens describing characters of the French language or layout (beginning of paragraph or end of page for instance). These layout tokens or tags were made to structure the layout of a register double page and to unify the ECD and ECN datasets. The ECD training dataset was built by picking around four certificates each year of the full dataset. For the handwritten records (1807-1885) the first two declarations of the double page were annotated and the first four for the hybrid records (1886-1919). This led to annotating 460 declarations for the first period and 558 declarations for the second one to give a total of 1118 annotated death certificates. We are currently verifying these annotations to start the pre-training phase of the DAN in the coming months. diff --git a/submissions/453/index.qmd b/submissions/453/index.qmd index bb62367..9448ac4 100644 --- a/submissions/453/index.qmd +++ b/submissions/453/index.qmd @@ -34,7 +34,7 @@ On the basis of this analysis, it seems essential to introduce training in digit ## Master’s course in digital methodology for historical research -These considerations stem not only from my work as a CNRS researcher who has spent the last fifteen years building collaborative information systems for research (symogih.org, ontome.net, geovistory.org)[@francesco_beretta_donnees_2024], in line with the vision that, as the DFG White Paper points out, "digital infrastructure is essential for research and must be built for long-term service", but also from ten years of experience in teaching digital methodology at bachelor and master level in history, first at the University of Lyon 3 and for the last four years at the University of Neuchâtel, which currently offers courses in digital methodology in the master's programmes in Historical Sciences and in Regional Heritage and Digital Humanities. +These considerations stem not only from my work as a CNRS researcher who has spent the last fifteen years building collaborative information systems for research \(symogih.org, ontome.net, geovistory.org\)[@francesco_beretta_donnees_2024], in line with the vision that, as the DFG White Paper points out, "digital infrastructure is essential for research and must be built for long-term service", but also from ten years of experience in teaching digital methodology at bachelor and master level in history, first at the University of Lyon 3 and for the last four years at the University of Neuchâtel, which currently offers courses in digital methodology in the master's programmes in Historical Sciences and in Regional Heritage and Digital Humanities. But at this point an essential question arises: what should be taught to history students to help them make the most of the digital transition and build a new paradigm? Looking at recent handbooks, e.g. [@antenhofer_digital_2023; @doring_digital_2022; @schuster_routledge_2021], or at educational resources like the [programminghistorian.org](https://programminghistorian.org/en/) project, we can see a huge variety of approaches and areas of application of digital methods, and often the answer to the question depends on the own field of research and experience. In this sense, I will not provide a somewhat abstract review of the literature, and existing courses, but rather share some aspects of my own approach in the hope that they may be of some use or inspiration to others. @@ -64,4 +64,9 @@ At the end of the process, students formulate some possible answers to their res I observed in all these years that if the students invest some time in practising the exercises and follow the learning cycle in this kind of apprenticeship by example during the two semesters, they can achieve amazing results (e.g. [Militant.e.s pour le droits des femmes](https://github.com/AliaBrah/militants_droit_femmes/wiki) and [Fashion Designers](https://github.com/czeacach/fashion_designers/wiki)). But at the same time I have to admit that the learning curve is steep, because in just one year students learn the basics of conceptual modelling, SQL, SPARQL, Python and the essential concepts of various data analysis methods. As well as versioning with GIT and putting data and notebooks online. On the one hand, a certain pedagogical investment is necessary, especially to support students who have less of a natural inclination towards digital technology. On the other hand, the more technical part of this method should be introduced at bachelor level, like GitHub versioning and Python. At the University of Neuchâtel, a brand new minor in Digital Humanities has been introduced in the bachelor's programme, which will enable students who have taken it to benefit more from the master's courses. -As far as the Master's thesis is concerned, it seems that the conceptual modelling and the setting up of a database for the input of information extracted from sources are the most useful, while the venture into collecting data available on the web as a basis for the Master's thesis does not yet seem attractive. However, there are exceptions, as shown by a work using the [Refuge Huguenot database](http://refuge-huguenot.ish-lyon.cnrs.fr/), which I will present in my paper. In conclusion, it seems that at the moment students that take this course can only reach the level of transformative change. But experience shows that it is only with the development of appropriate research infrastructure and the emergence of a wider community of digital disciplinary practices that we will be able to provide students with a context that will allow them to achieve the enabling and substitutive changes, and thus bring about an effective paradigm shift. It is up to the new generations to make this happen. \ No newline at end of file +As far as the Master's thesis is concerned, it seems that the conceptual modelling and the setting up of a database for the input of information extracted from sources are the most useful, while the venture into collecting data available on the web as a basis for the Master's thesis does not yet seem attractive. However, there are exceptions, as shown by a work using the [Refuge Huguenot database](http://refuge-huguenot.ish-lyon.cnrs.fr/), which I will present in my paper. In conclusion, it seems that at the moment students that take this course can only reach the level of transformative change. But experience shows that it is only with the development of appropriate research infrastructure and the emergence of a wider community of digital disciplinary practices that we will be able to provide students with a context that will allow them to achieve the enabling and substitutive changes, and thus bring about an effective paradigm shift. It is up to the new generations to make this happen. + +## References + +::: {#refs} +::: diff --git a/submissions/454/index.qmd b/submissions/454/index.qmd index 50d989e..a83956b 100644 --- a/submissions/454/index.qmd +++ b/submissions/454/index.qmd @@ -43,15 +43,13 @@ Our data collection approach is based on supervised semantic segmentation. Semat The trained segmentation model is used to automate the recognition of the remaining cadastral plats for three cadastral series of Lausanne [@melotte_plans_1722;@berney_plan_1831;@deluz_plans_1886], which total 570 plats. In a first step, the sheets were georeferenced and separated from the legend. Second, the semantic segmentation was applied directly on the georeferenced images. Third, the resulting predictions masks were manually reviewed, and the most salient inconsistencies were corrected. Finally, the raster masks were vectorized, as described in [@vaienti_machine-learning-enhanced_2023], resulting in 69,083 extracted geometries. The data collection workload is estimated to 15 workdays for the specific annotation of 42 plats, 5 workdays for the initial correction of the predictions, and 2 workdays (excluding research and development) for the automatic segmentation and vectorisation. By contrast, the estimated workload for manually vectorising the whole corpus would be over 260 days. - ![Illustration of the raster/pixel annotation of the 1886 cadastre of Lausanne by Louis Deluz, using five semantic classes (red: built, dark grey: non-built, light grey: road network, violet: water, white: contours, black: background).](images/segmentation.svg){#fig-1} ## Persistence We adopt two complementary approaches for studying the structure of land ownership. The first is based on the measure of persistence in the parcel fabric. The second, which will be presented later, focuses on the social structure of land ownership, through the analysis of owners. Regarding persistence, two processes are considered: the division of a plot into two or more plots, or the fusion –i.e. the grouping– of two or more plots together. We call continuity the state in which parcel fabric is unchanged. Finally, discontinuity denotes a situation in which the parcel fabric has been otherwise altered by rezoning. -Fusion and division are measured separately. The detection of both processes is based on the spatial matching of the parcels in two distinct historical layers. Only child parcels whose overlap with the parent is at least 50% are considered for division, and vice versa for fusion. For fusion, the measure of shape similarity is based on the computation of the absolute difference of turning functions [@arkin_efficiently_1991] between the child parcels and the parent one, and vice versa. This method, which intuitively focuses on a shape's salient angles, excels at comparing shapes which exhibit inequal level of detail or resolution, which is precisely the case for the extracted geometries, originating from diverse series[^1]. This process allows us to establish maps of land persistence and continuity (@fig-2, @fig-4, and @fig-6), whose results will be presented and discussed in a chronological order. The relative code is published along with this article. - +Fusion and division are measured separately. The detection of both processes is based on the spatial matching of the parcels in two distinct historical layers. Only child parcels whose overlap with the parent is at least 50% are considered for division, and vice versa for fusion. For fusion, the measure of shape similarity is based on the computation of the absolute difference of turning functions [@arkin_efficiently_1991] between the child parcels and the parent one, and vice versa. This method, which intuitively focuses on a shape's salient angles, excels at comparing shapes which exhibit inequal level of detail or resolution, which is precisely the case for the extracted geometries, originating from diverse series[^1]. This process allows us to establish maps of land persistence and continuity (@fig-2, @fig-4, and @fig-6), whose results will be presented and discussed in a chronological order. The relative code is published along with this article. ![Dynamics of land plot persistence in Lausanne between 1722 and 1831. The red areas underwent fusion processes, whereas division dynamics are observed in blue areas. Magenta indicates continuity. ](images/persistence_1727_1831_light.svg){#fig-2} @@ -59,28 +57,22 @@ Fusion and division are measured separately. The detection of both processes is ![Expansion of the city outside its historical enclosure between 1722 (top) and 1831 (bottom).](images/center_berney_v2.svg){#fig-3} - - While the changes observed in the first time stratum (1722-1831, @fig-2) might appear somehow disordered at first sight, it is in fact due to the complexity of the dynamics observed. At the beginning of the 18th century, the development of the city was mainly contained within its enclosure, to the exception of the three faubourgs: Halle, Martheray and Chêne[^2]. The demand for space was stimulating the densification of buildings within the enclosure [@rickli_lausanne_1978]. The Rôtillon neighbourhood is a good example of this phenomenon. The vegetable gardens found there, whose yield was probably not very profitable since they were in the shadow of the hill of Bourg, were replaced by small buildings. Facing continued demand for space, the city's medieval fortifications were progressively dismantled. The new buildings and the reorganization of land parcels at that time (@fig-2) met four distinct demands. The first two impacted development in the immediate vicinity of the city (@fig-3). First, infrastructures, such as the chapel and the charity school of Valentin, the casino to the south, and the "maison de force" (prison) to the east. Second, the expansion of productive areas, concentrating mainly along the Flon, downstream (sawmills, mills) and upstream (mills, tanneries) of the town. The two remaining demands impacted not only the close suburbs, but the whole commune. The first was the multiplication of dispersed rural housing, with the construction of several small agricultural estates scattered across the territory. We can also observe the emergence of periurban land estates, such as the most spectacular mansion in Mon Repos, owned by Alexandre Perdonnet (@fig-3), but also the new estate of Bellerive (@fig-2), reconstructed around 1787 by the Francillon family, for example [@grandjean_lausanne_1982]. Farms and estates had contrasting effects on the redrawing of parcels, fueling both fusions and divisions. It is noteworthy, however, that their impact is above all localized and dispersed; hence the strong alternation between both dynamics observed in @fig-2. Finally, let us also highlight that many plots to the south-east and south-west of the town show remarkable persistence. These are mainly vineyards, which stability also suggests continuity in the economic and cultural value of these lands. - ![Dynamics of land plot persistence in Lausanne between 1831 and 1883.](images/persistence_1831_1888_light.svg){#fig-4} ![Morcellation of the land in Prélaz (west), as depicted in the 1883 cadastre.](images/prelaz_renove.svg){#fig-5} - This stability contrasts with the dynamics observed in the 19th and 20th centuries (@fig-4). Between 1831 and 1883, the prevailing trend was to extend the city westwards (to Prélaz, @fig-5) and southwards (to Georgette), in the direction of the new railway station. This expansion is taking place precisely at the expense of the vineyards. It is characterized by the division of agricultural plots into smaller subdivisions on which mainly apartment buildings are built. Near Ponthaise[^3], we note a marked fusion trend. This is due to the purchase of fields by the State of Vaud and the Commune for the construction of new training grounds and barracks, to replace the place d'armes in Montbenon, reallocated for the new Federal Court of Justice. It is noteworthy that the plots of land located in the historic city center remain relatively stable, despite the first major works including the construction of the Grand Pont and the vaulting of the Flon and Louve rivers. - ![Dynamics of land plot persistence in Lausanne between 1883 and 2023.](images/persistence_1888_2020_light.svg){#fig-6} ![](images/center_renove.svg) ![Dynamics of parcel fusions in the city center between 1883 (top) and 2023 (bottom). ](images/center_2020.svg){#fig-7} - From 1883 onwards, on the contrary, the layout of the downtown area changed significantly (@fig-6, @fig-7). Following the 1896 publication of the Schnetzler report on health conditions, most of the buildings, deemed dilapidated, and particularly the entire Rôtillon district, were demolished to be replaced by modern housing. The scale of this work can be seen in the impressive dynamics of plot mergers, through which old buildings were being replaced by larger constructions. Urban sprawl, on the other hand, is characterized by a dynamic of morcellation, which no longer impacts only the former vineyards, but in fact all agricultural land, which was sold in allotments to meet the high demand for city. The main exceptions are large infrastructures such as the Milan, Mon Repos, Valency and Bourget public parks, the Montoie cemetery, the Sébeillon train marshalling yard, the Blécherette airport, the CHUV hospital and the Rovéréaz agricultural estate. However, despite the time that separates both cadastres, several plots of land remain, such as the urban houses in Maupas. ## Social structure of land ownership @@ -91,7 +83,6 @@ The aggregated results of that step are described in @fig-8. The first thing tha ![Representative samples of owners (n=60) from the 1831 cadastre, ranked by total area owned (blue, in hectares), along with the number of servants employed in their household (red). The occupation of each owner, extracted from the 1832 or 1835 population census, is indicated after their names, in the labels of the horizontal axis.](images/owners_chart.svg){#fig-8} - ## General discussion and Conclusion Spatial matching and the measure of morphological proximity is key to harnessing the wealth of geodata and operationalizing the analysis of plot dynamics. Spatial analysis of the territory reveals complex, protean trends that would be difficult to uncover without a spatial, quantitative and comparative approach based on digitized sources. Although certain manual steps, such as georeferencing, are still required, segmentation and vectorization are instrumental to processing such large cartographic corpora. @@ -243,7 +234,6 @@ def calculate_angles(polygon: Polygon): return turning_function - def compare_polygons(polygon1: Polygon, polygon2: Polygon): '''Compare two polygons using their turning functions''' @@ -474,4 +464,9 @@ cv2.imwrite('persistence_1888_2020_light.png', light_img) # Light mode [^1]: Note that here we use the absolute difference L1 instead of the squared difference L2, which is favored by Arkin. This is intentional, as we believe that shape dissimilarities should be equally weighted in the present use case. [^2]: Here, we use the historical spelling of place names, as given in the sources. [^3]: idem. -[^4]: In comparison, we can easily estimate the distribution of fortune in Switzerland in 2022 by interpolating the aggregated statistics of the Federal Statistical Office using cubic spline, which yields a value of 0.82, still corresponding to a less unequal distribution. \ No newline at end of file +[^4]: In comparison, we can easily estimate the distribution of fortune in Switzerland in 2022 by interpolating the aggregated statistics of the Federal Statistical Office using cubic spline, which yields a value of 0.82, still corresponding to a less unequal distribution. + +## References + +::: {#refs} +::: diff --git a/submissions/455/index.qmd b/submissions/455/index.qmd index 5bf963d..432c9cc 100644 --- a/submissions/455/index.qmd +++ b/submissions/455/index.qmd @@ -62,4 +62,4 @@ At the end of this class, students did find inspiration from these source materi ## Conclusion: Moving Forward -Today, we presented a dataset of transnational nature. Our team has rediscovered and standardized the transnational elements within the Rockefeller Foundation source materials in what we believe are the most meaningful ways. We have also started using this database to teach students database literacy. To conclude this presentation, we invite your feedback and comments on how we can promote this dataset globally, making it meaningful and alive for people specializing in different regions. \ No newline at end of file +Today, we presented a dataset of transnational nature. Our team has rediscovered and standardized the transnational elements within the Rockefeller Foundation source materials in what we believe are the most meaningful ways. We have also started using this database to teach students database literacy. To conclude this presentation, we invite your feedback and comments on how we can promote this dataset globally, making it meaningful and alive for people specializing in different regions. diff --git a/submissions/456/index.qmd b/submissions/456/index.qmd index 7b2b44b..60bfc9b 100644 --- a/submissions/456/index.qmd +++ b/submissions/456/index.qmd @@ -40,14 +40,14 @@ bibliography: references.bib ## Introduction -Among digital approaches to historical data, we can identify various types of scholarship and methodologies practiced in creating transcription corpora of documents that are valuable for research, specifically quantitative analyses. Demographic, cadastral, and geographic sources are particularly notable for their transcripts, which are instrumental in generating datasets that can be used or reused across different disciplinary contexts. +Among digital approaches to historical data, we can identify various types of scholarship and methodologies practiced in creating transcription corpora of documents that are valuable for research, specifically quantitative analyses. Demographic, cadastral, and geographic sources are particularly notable for their transcripts, which are instrumental in generating datasets that can be used or reused across different disciplinary contexts. From a disciplinary perspective, it is essential to observe certain trends that have fundamentally transformed the methodology of making historical data accessible and studying them. These trends extend the usability of historical data beyond the individual scholar's interpretation to a broader community. The creation of datasets that can then serve again to different communities of study is increasingly practiced by scholars; the data, before their interpretations, acquire an autonomous value, an important authorial legitimacy. Just as is already currently practiced in the natural sciences, the provision of datasets becomes crucial to history as well. Computational approaches, however, follow trends common to other fields, and the increasing use of artificial intelligence to extract, refine, realign, and understand historical data compels the integration of clear and well-described protocols to inform the datasets that are created. ## Lausanne and Venice Census Datasets -The first example are the IIIF (International Image Interoperability Framework) protocols. These protocols allow the community to employ computational approaches to use these sources as a foundation for refining techniques to extract the embedded information. Open sources are made possible by heritage institutions that recognize the added value of studying their collections with computational approaches, thus transforming them into indispensable objects of research for interdisciplinary and ever-expanding community. +The first example are the IIIF (International Image Interoperability Framework) protocols. These protocols allow the community to employ computational approaches to use these sources as a foundation for refining techniques to extract the embedded information. Open sources are made possible by heritage institutions that recognize the added value of studying their collections with computational approaches, thus transforming them into indispensable objects of research for interdisciplinary and ever-expanding community. Another significant trend is the creation of datasets through the extraction of information contained in sources by diverse communities from various disciplinary fields. These communities might be interested in the computational methodologies, extraction techniques, or the historical content of the extracted data. The field of Digital Humanities, addressing the latter aspect, has effectively highlighted that each element extracted from historical documents should, whenever possible, be directly referenced to the document from which it was taken. This can be achieved through IIIF-compatible datasets. Moreover, the methods for extracting, processing, and normalizing the data must always be thoroughly explained and are subject to scholarly critique and evaluation. Emphasizing the importance of maintaining the link between extracted data and original documents ensures transparency and accuracy, thus enhancing the credibility and usability for digital historical research. @@ -72,7 +72,7 @@ This means that millions of units of information can be made "searchable," even By applying these principles to historical datasets, we can ensure that the data extracted is not only accessible and traceable, but also continuously refined and validated through collaborative efforts and methodological advances. -In "Names of Lausanne", @petitpierre_1805-1898_2023 automatically extracted 72 census records of Lausanne. The complete dataset covers a century of historical demography in Lausanne (1805-1898), corresponding to 18,831 pages, and nearly 6 million cells (@fig-1). +In "Names of Lausanne", @petitpierre_1805-1898_2023 automatically extracted 72 census records of Lausanne. The complete dataset covers a century of historical demography in Lausanne (1805-1898), corresponding to 18,831 pages, and nearly 6 million cells (@fig-1). ![Sample page from the “Recensements de la ville de Lausanne”; (AVL) Archives de la Ville de Lausanne](images/fig1.png){#fig-1} @@ -92,9 +92,13 @@ For named entity recognition, a "people" dataset was created, linking individual ![Screenshot of the search interface for the city of Venice. In the example, the filter search targeted the surname “Pasqualigo" in the dataset of Napoleonic cadaster (1808); Under the dataset name, the nature of the data extraction is specified.](images/fig4.png){#fig-4} -An initial version of this dataset will be published and will already allow some key aspects of land tenure to be studied, however, not all fields of information have been standardized, which is why we can build on this version later to improve machine reading and produce new corrections. +An initial version of this dataset will be published and will already allow some key aspects of land tenure to be studied, however, not all fields of information have been standardized, which is why we can build on this version later to improve machine reading and produce new corrections. These projects underscore the transformative impact of OCR and HTR technologies, coupled with language models, on the extraction and correction processes of historical documents. The challenge lies in consistently documenting the computational process origins within datasets, ensuring users can evaluate the reliability of the data. As the quality of data extraction and transcription improves, new historical narratives may emerge, emphasizing the critical need to track data versions and correct older datasets to prevent potential inaccuracies. +[^1]: -[^1]: https://github.com/dhlab-epfl/venice-owners-1808 \ No newline at end of file +## References + +::: {#refs} +::: diff --git a/submissions/457/index.qmd b/submissions/457/index.qmd index ca216fb..fdd9e8a 100644 --- a/submissions/457/index.qmd +++ b/submissions/457/index.qmd @@ -71,3 +71,8 @@ The overall goal of the project is to contribute to theory formation in digital ## Acknowledgements {#acknowledgements .unnumbered} This research was supported by the Swiss National Science Foundation (SNSF) under grant no. 105211_204305. + +## References + +::: {#refs} +::: diff --git a/submissions/458/index.qmd b/submissions/458/index.qmd index 5e6231b..ceaffe9 100644 --- a/submissions/458/index.qmd +++ b/submissions/458/index.qmd @@ -39,28 +39,32 @@ In our 2021 paper, “Network of Words: A Co-Occurrence Analysis of Nation-Build We now propose to re-examine the writings of Chen Duxiu (1879-1942), the co-founder of the Chinese Communist Party, by using the innovative Large Language Models (LLMs) LLMA and ChatGPT.[@ChatGPT_1] We intend to leverage ChatGPT to detect changes in Chen’s thinking about the theory and practice of Marxism and Communism. Through embedding and prompt engineering, we plan to extract topic sentences, generate summary statements and estimate topic opinions on these summaries. Our work introduces two key innovations: 1. the utilization of advanced AI methodologies grounded in extensive language models, as opposed to traditional statistical techniques that often relied on oversimplified assumptions, and -2. an extension of our analysis to sentences rather than individual words, allowing for richer contextual understanding. +2. an extension of our analysis to sentences rather than individual words, allowing for richer contextual understanding. -A brief biography of Chen Duxiu is as follows: As a young boy Chen chafed against the traditional preparation to study for the all-important civil service exam. He deplored the ineffectiveness of gaining governmental positions through rote memorization of ancient classics. Surprisingly he placed first in the entry level of this exam. Soon Chen left to study in Japan, where he first encountered Western democratic philosophies such as those of John Stuart Mills, Jean Jacques Rousseau and Montesquieu. He concluded that the only way to save China was to overthrow the dynasty. Returning to China in 1903 he joined an assassination squad and published newspapers rallying his countrymen to fight foreign imperialism and to overthrow the dynasty. One of his journals, called Xin Qingnian 新青年 [New Youth], whose contributors ultimately consisted of some of the most respected and celebrated scholars and intellectuals of the time, made Chen a celebrated public intellectual. Together these authors pushed through the national language reform, denounced the restrictive Confucian ethos, and advocated scientific thinking, democracy, and individual freedom. -With the disappointing outcome from the Treaty of Versailles in 1919, whereby China’s hope to recover German occupied Shandong peninsula was dashed, Chen and many of his colleagues became disillusioned with Western style democracy. Within two years, Chen co-founded the Chinese Communist Party (CCP) with the help of Russian Comintern agents and turned his attention to political activism. Politics proved to be treacherous, however, and Chen was scapegoated for the failure of Comintern policies in China and ousted from the CCP in 1929. He was jailed by Chiang Kai-shek, head of the Nationalist Party (GMD), from 1932-1937, and released from prison at the onset of the Resist Japan war. Distancing himself from both the CCP and the GMD, Chen became a political pariah whose writings few dared to publish. Undaunted, he continued to comment on the state of Chinese politics and died in penurious circumstances in 1942. +A brief biography of Chen Duxiu is as follows: As a young boy Chen chafed against the traditional preparation to study for the all-important civil service exam. He deplored the ineffectiveness of gaining governmental positions through rote memorization of ancient classics. Surprisingly he placed first in the entry level of this exam. Soon Chen left to study in Japan, where he first encountered Western democratic philosophies such as those of John Stuart Mills, Jean Jacques Rousseau and Montesquieu. He concluded that the only way to save China was to overthrow the dynasty. Returning to China in 1903 he joined an assassination squad and published newspapers rallying his countrymen to fight foreign imperialism and to overthrow the dynasty. One of his journals, called Xin Qingnian 新青年 [New Youth], whose contributors ultimately consisted of some of the most respected and celebrated scholars and intellectuals of the time, made Chen a celebrated public intellectual. Together these authors pushed through the national language reform, denounced the restrictive Confucian ethos, and advocated scientific thinking, democracy, and individual freedom. +With the disappointing outcome from the Treaty of Versailles in 1919, whereby China’s hope to recover German occupied Shandong peninsula was dashed, Chen and many of his colleagues became disillusioned with Western style democracy. Within two years, Chen co-founded the Chinese Communist Party (CCP) with the help of Russian Comintern agents and turned his attention to political activism. Politics proved to be treacherous, however, and Chen was scapegoated for the failure of Comintern policies in China and ousted from the CCP in 1929. He was jailed by Chiang Kai-shek, head of the Nationalist Party (GMD), from 1932-1937, and released from prison at the onset of the Resist Japan war. Distancing himself from both the CCP and the GMD, Chen became a political pariah whose writings few dared to publish. Undaunted, he continued to comment on the state of Chinese politics and died in penurious circumstances in 1942. -What were Chen’s final views on Western democracy, capitalism, and communism? In his youthful optimism, he declared France to have gifted humanity with three powerful concepts: human rights, evolutionary theory, and socialism.[@socialism_2] Caught in the power struggle between Stalin and Trotsky, Chen lost his political leadership. Tragically he also lost two of his sons at the hands of the GMD. Did the vicissitudes of life affect his thinking? We turn to his essays to find out. +What were Chen’s final views on Western democracy, capitalism, and communism? In his youthful optimism, he declared France to have gifted humanity with three powerful concepts: human rights, evolutionary theory, and socialism.[@socialism_2] Caught in the power struggle between Stalin and Trotsky, Chen lost his political leadership. Tragically he also lost two of his sons at the hands of the GMD. Did the vicissitudes of life affect his thinking? We turn to his essays to find out. ## Methodology -The central objective is to compare the efficacy of our earlier co-occurrence network methodology with the novel ChatGPT approach. We seek the best method to detect Chen’s ideological evolution over time. We used the corpus of Chen’s writing, consisting of 892 articles and 1,347,699 characters.[@character_3] From this collection we selected fifteen articles that were salient in his thoughts on political theory, written in the years 1914-1940.[@1930_4] +The central objective is to compare the efficacy of our earlier co-occurrence network methodology with the novel ChatGPT approach. We seek the best method to detect Chen’s ideological evolution over time. We used the corpus of Chen’s writing, consisting of 892 articles and 1,347,699 characters.[@character_3] From this collection we selected fifteen articles that were salient in his thoughts on political theory, written in the years 1914-1940.[@1930_4] The analysis was performed using various Python libraries and tools. Transformers were used for loading pre-trained language models, FAISS for efficient similarity search and clustering, Jieba for Chinese text tokenization, and Docx for handling Word documents. Numpy, Pandas, and Sklearn were used for numerical operations and data handling, and Langchain was used for managing document schemas and prompts. ## Preprocessing and Text Tokenization + The Chinese text was tokenized using Jieba, which segmented the text into meaningful tokens. Sentence Detection: A custom function called “chinese_sentence_detector “was implemented to detect sentences based on punctuation marks specific to Chinese (e.g., "。", "!", "?"). ## Embedding and Similarity Analysis -We loaded the Colossal-LLaMA-2-7b-base model. Sentences from the documents were embedded using the loaded model. The embeddings were stored in a NumPy array. Next, we computed pair-wise cosine similarity scores between the embeddings of all sentences. + +We loaded the Colossal-LLaMA-2-7b-base model. Sentences from the documents were embedded using the loaded model. The embeddings were stored in a NumPy array. Next, we computed pair-wise cosine similarity scores between the embeddings of all sentences. ## Query Processing + The question: “Is this article in favor of communism or capitalism?” [这篇文章是支持共产主义还是支持资本主义?"] was embedded using the same model and tokenizer. The FAISS index was used to search for the top 5 sentences most similar to the query. The indices and distances of the closest matches were retrieved. ## Output Generation + The sentences retrieved from the similarity search provided the context for answering the query. The following steps illustrate the workflow for a specific set of documents: @@ -73,38 +77,52 @@ The following steps illustrate the workflow for a specific set of documents: For the second part of the analysis, we employed a combination of natural language processing (NLP) techniques and the GPT-4 API to classify political opinions in textual documents. The primary objective was to determine whether the texts supported communism or capitalism. We used Python for automating the analysis process, integrating various tools and libraries to achieve this. ## Question Design + We formulated a question to extract the political sentence from the text. The question posed to the GPT-4 model was: "Is this article in favor of communism or capitalism?” [这篇文章是支持共产主义还是支持资本主义?"] ## API Request + The openai.ChatCompletion.create method was utilized to interact with the GPT-4 API. We constructed a prompt with the following messages: -* System Message: Provided context to the model, indicating that it was functioning as a helpful assistant. +* System Message: Provided context to the model, indicating that it was functioning as a helpful assistant. * User Messages: Included the embedded document content, followed by instructions for providing a score on a scale from 0 to 4 (where 0 indicates strong support for communism and 4 indicates strong support for capitalism) and a detailed explanation within 200 to 300 words. ## Response Extraction + The response from the API was parsed to extract the generated content, which included the classification score and explanatory text. In summary, our method leveraged the GPT-4 API to analyze and classify political opinions in text documents. By embedding the documents and querying the model with specific questions and instructions, we automated the classification process and systematically stored the results for subsequent review. ## Results + The summaries of the fifteen articles produced by Llama were perfunctory and inconclusive. At times Llama changed the order of sentences in an effort to answer the query, but it often generated a text that was more a random collection of sentences than a cohesive answer to the question. ChatGPT4 produced a much more systematic and accurate summary for each article. However, we detected five articles that variously missed the intent of the essay, or forced fit the answer, or gave a wrong numerical coding. Of the remaining ten articles, ChatGPT4 best answered the question “Is this article in support communism or capitalism?” when the essay consistently argued for one point of view, such as in “Shehui zhuyi piping” [A critique of socialism] and “Makesi xueshuo” [A study of Marxism]. Where ChatGPT4 ran into problems is when the writing was nuanced and contained multiple points of view. Because of its effort to answer the question, it sometimes force fit the answer by implying connections that did not exist. We briefly describe these five issues below. ### Missing the context and the main argument of the article + In his 1914 article, “Aiguo xin yu zijue xin” [Patriotism and self-awareness], Chen famously shocked his readers by concluding that a nation which did not care for its people should not expect support from its citizens, and that such a nation should be allowed to die. Lacking such context, ChatGPT4 picked up on Chen’s description of an ideal nation but did not underscore the novelty of his perspective. It coded the article as -1 (mildly in support of communism). The article never mentioned communism, and at the same time, the command was to rate articles from 0 to 4, not -2 to 2. ### Trying too hard to answer the question + In the 1915 article, “Fa lanxi ren yu jinshi wenming” [The French and contemporary civilization], Chen praised socialism and mentioned French socialists such as Henri de Saint-Simon, Gracchus Babeuf and Charles Fourier. It concluded that Chen was in favor of socialism but not rejecting capitalism. It erroneously stated that socialism was the same as communism, which did not appear in the text. ### Inaccurate coding + In the 1916 article, “Tan zhengzhi” [Talk about politics], Chen strongly attacked the bourgeois exploitation of the working class, which was accurately picked up by ChatGPT4. However, it gave the article a +1 code, meaning it was somewhat in favor of capitalism. This is surprising since it concluded that the article was in favor of communism. Similarly in the 1924 article, “Women de huita” [Our response], Chen was defending the reason why the CCP was part of the GMD (an ill-fated decision ordered by Stain), and strongly argued for class struggle between the bourgeoisie and the proletariat. ChatGPT4 recognized this interpretation but surprisingly rated the article as +1 (mildly in support of capitalism). ### Missed by a wide margin + Historians often cited the conflicting tone of “Gei Xiliu de xin” [A letter to Xiliu], one of Chen’s last essays, as indicative of his ambivalence about communism. He railed against Stalin’s dictatorship, the lack of freedoms of speech, assembly, publishing, the right to strike, and the right to an opposition party in Russia. He wrote that in his opinion, bourgeois democracy and proletarian democracy are similar in nature but differed only as a matter of degrees.[@degrees_5] Bourgeois democracy, Chen explained, was the culmination of a struggle by millions of people over six hundred years, and that gifted the world with the three big principles of science, democratic system and socialism. “We must recognize” he stated, “that the incomplete democracy of England, France and the United States deserve to be protected.” Chen created a table comparing the freedoms people enjoyed in a capitalist democracy versus those under Stalin and Hitler, and asked, “Would any Communist dare curse bourgeois democracy?” ChatGPT4 picked up the anti-fascist sentiment but missed Chen’s ambivalence toward bourgeois democracy. Instead it described the article as Chen’s way to use the struggle between fascist and Western democratic countries to advance the causes of the proletariat. We found this conclusion to be off the mark and as an attempt to force the conclusion. ## Conclusion -ChatGPT 4 is able to detect strong and consistent sentiments in articles, but would force fit its answer and rationalize connections that did not exist. It does less well when multiple viewpoints are expressed, and it does not know the historical context. Llamas is unable to perform the analysis at all. + +ChatGPT 4 is able to detect strong and consistent sentiments in articles, but would force fit its answer and rationalize connections that did not exist. It does less well when multiple viewpoints are expressed, and it does not know the historical context. Llamas is unable to perform the analysis at all. In conclusion, the use of ChatGPT 4 for analyzing historical texts presents both significant advantages and notable challenges. ChatGPT 4 has demonstrated its capability to detect strong and consistent sentiments in articles, providing systematic and accurate summaries in many cases. However, it occasionally forces connections that do not exist, particularly in texts with nuanced or multiple viewpoints. This limitation highlights the importance of historical context, which the model sometimes misses, leading to misinterpretations of the larger point. Conversely, the Llama 2-7b model was unable to perform the analysis adequately, emphasizing there may be a need for larger models. Our study underscores the need for continuous refinement in leveraging AI methodologies for historical text analysis. Future work should focus on enhancing the model’s contextual understanding and its ability to handle complex, multi-faceted arguments, thereby improving its overall efficacy in historical and ideological analysis. -Through our analysis, we gained deeper insights into the evolution of Chen Duxiu's thoughts. In his early writings, Chen was a fervent supporter of Western democratic ideals. However, his disillusionment with Western democracy, especially after the Treaty of Versailles, led him to co-found the Chinese Communist Party and embrace Marxist ideologies. Despite this shift, Chen's later writings revealed a nuanced perspective. He criticized Stalin's dictatorship and acknowledged the merits of bourgeois democracy, indicating an ambivalence toward both Western and Communist ideologies. This complexity in Chen's thought underscores the importance of a refined approach in analyzing historical texts, which current AI models are still striving to fully achieve. \ No newline at end of file +Through our analysis, we gained deeper insights into the evolution of Chen Duxiu's thoughts. In his early writings, Chen was a fervent supporter of Western democratic ideals. However, his disillusionment with Western democracy, especially after the Treaty of Versailles, led him to co-found the Chinese Communist Party and embrace Marxist ideologies. Despite this shift, Chen's later writings revealed a nuanced perspective. He criticized Stalin's dictatorship and acknowledged the merits of bourgeois democracy, indicating an ambivalence toward both Western and Communist ideologies. This complexity in Chen's thought underscores the importance of a refined approach in analyzing historical texts, which current AI models are still striving to fully achieve. + +## References + +::: {#refs} +::: diff --git a/submissions/459/index.qmd b/submissions/459/index.qmd index 4934638..888a895 100644 --- a/submissions/459/index.qmd +++ b/submissions/459/index.qmd @@ -12,19 +12,15 @@ author: orcid: 0000-0002-0905-2056 email: johanna.schuepbach@unibas.ch affiliations: - - University of Basel - + - University of Basel keywords: - Data Literacy - Academic Libraries - Digital Humanities - Experience Report - abstract: | Libraries are finding their place in the field of data literacy and the opportunities as well as challenges of supporting students and researchers in the field of Digital Humanities. Key aspects of this development are research data management, repositories, libraries as suppliers of data sets, digitisation and more. Over the past few years, the library has undertaken steps to actively bring itself into teaching and facilitate the basics of working with digital sources. The talk shares three experience reports of such endeavours undertaken by subject librarians of the Digital Humanities Work Group (AG DH) at the University Library Basel (UB). - date: 07-26-2024 - --- ## Introduction @@ -34,8 +30,8 @@ As of today, there have been three distinct formats in which the AG DH has intro ## Research Seminar/Semester Course -To this end, the AG DH organised a semester course in close collaboration with Prof. Dr. phil. Erik Petry with whom they have created and then co-taught a curriculum introducing various DH tools and methods to be tried out using the UB's holdings on the topic of the first Zionist Congresses in Basel. The course was attended by MA students from the subjects of History, Jewish Studies and Digital Humanities. This research seminar was designed to provide an introduction to digital methods. -We have divided our course into different phases. The first introduction to work organisation, data management and data literacy was followed by sessions that combined the basics of the topic and introductions to digital methods. We focussed on different forms of sources: images, maps and text, with one session being dedicated to each type. This meant we could offer introductions to a broad spectrum of DH tools and methods such as digital storytelling and IIIF, geomapping and working with GIS, and transcription, text analysis and topic modelling. As a transition to the third phase of the project, we organised a session in which we presented various sources either from the University Library or from other institutions – the Basel-Stadt State Archives and the Jewish Museum Switzerland. The overall aim of the course was to enable students to apply their knowledge directly. To this end, they developed small projects in which they researched source material using digital methods and were able to visualise the results of their work. In the third phase of the course, students were given time to work on their own projects. In a block event at the end of the semester, the groups presented their projects and the status of their work. We were able to see for ourselves the students’ exciting approaches and good realisations. +To this end, the AG DH organised a semester course in close collaboration with Prof. Dr. phil. Erik Petry with whom they have created and then co-taught a curriculum introducing various DH tools and methods to be tried out using the UB's holdings on the topic of the first Zionist Congresses in Basel. The course was attended by MA students from the subjects of History, Jewish Studies and Digital Humanities. This research seminar was designed to provide an introduction to digital methods. +We have divided our course into different phases. The first introduction to work organisation, data management and data literacy was followed by sessions that combined the basics of the topic and introductions to digital methods. We focussed on different forms of sources: images, maps and text, with one session being dedicated to each type. This meant we could offer introductions to a broad spectrum of DH tools and methods such as digital storytelling and IIIF, geomapping and working with GIS, and transcription, text analysis and topic modelling. As a transition to the third phase of the project, we organised a session in which we presented various sources either from the University Library or from other institutions – the Basel-Stadt State Archives and the Jewish Museum Switzerland. The overall aim of the course was to enable students to apply their knowledge directly. To this end, they developed small projects in which they researched source material using digital methods and were able to visualise the results of their work. In the third phase of the course, students were given time to work on their own projects. In a block event at the end of the semester, the groups presented their projects and the status of their work. We were able to see for ourselves the students’ exciting approaches and good realisations. The course was also a good experience for us subject librarians. Above all, we benefited from the broad knowledge in our team as well as the opportunity to gain new insights and experiences in select areas of DH. We particularly appreciated the good collaboration with Prof. Dr. Petry, who treated us as equal partners and experts. Despite the positive experience, this format is not sustainable: The effort involved in creating an entire semester course exceeds the resources available to regularly offer similar semester courses. Nevertheless, for this pilot project of the AG DH, the effort was justified because the course allowed us to make our holdings visible and they were researched. ## Data Literacy – a Session Within an Existing IDM Semester Course @@ -60,7 +56,7 @@ While this session was also very dense, content wise, by hosting it at the UB an ## Conclusion -These three different formats highlight some of the chances but also challenges the AG DH faces with regards to their work on with and for students and researchers, and the experiences and feedback from these different formats throw an important light on the role of the UB in the task of teaching skills in this field. +These three different formats highlight some of the chances but also challenges the AG DH faces with regards to their work on with and for students and researchers, and the experiences and feedback from these different formats throw an important light on the role of the UB in the task of teaching skills in this field. Generally it can be said that it needs an active involvement from and by the AG DH to get into the teaching spaces. Either through directly talking with professors/teaching staff and offering to collaborate with them in contributing to their planned classes or by getting involved in existing course formats like the IDM semester courses. It can thus be shown that libraries play a key role in imparting knowledge and skills as well as guardians of cultural property in their function as reliable and long-lasting institutions. We also want to highlight aspects that can still be improved. Above all, this concerns the awareness and attractiveness of such services as well as cooperation with researchers and teachers from all subject areas that work digitally, and history in particular. The questions that drive the AG DH are many and varied: What are the needs of researchers and students? What do you need from your university library? Where do you see the possibility for the library to support and raise awareness with working with historical documents? diff --git a/submissions/460/index.qmd b/submissions/460/index.qmd index cfbc83e..7e6bac5 100644 --- a/submissions/460/index.qmd +++ b/submissions/460/index.qmd @@ -23,9 +23,11 @@ bibliography: references.bib --- ## Introduction + As part of the author’s PhD project, ‘Glass and its makers in Estonia, c. 1550–1950: an archaeological study,’ the genealogical data about 1,248 migrant glassworkers and their family members working in Estonia from the 16th–19th century were collected using archival records and newspapers. The goal was to use information about key life events to trace the life histories of the glassworkers and their families from childhood to old age to gain an understanding of the community and the industry through one of its most important aspects – the workforce. It was hoped that the data will also assist in identifying the locations and names of glassworks during the period under study. In this paper, the author reflects on the process of this documentary archaeology research. The data collection, storage, and visualisation process are described, followed by the results of the study which have been included in a doctoral dissertation [@mythesis] and a research article [@reppo2023d]. ## Data collection + The aim of this part of the PhD project was to collate, visualise, and publish data on the key life events of migrant glassworkers in post-medieval Estonia. Information on 1,248 individuals was obtained who are mostly of German origin. This list is in no way complete but provides information workers and their family members connected with the glass industry from the 16th century until the 1840s–1860s when the reliance on foreign workers started to lessen due to the abolishment of serfdom in Estonia which allowed locals access to skilled professions previously inaccessible to them [@mythesis, pp. 52]. The data were collected, tabulated, and made Open Access via DataDOI [@datadoi] as a raw dataset [@reppo2023b]. The following life events were considered – birth, baptism, marriage(s), and death. Both the date and place were included where possible to identify migration routes to and within Estonia. With baptisms, the number of godparents as well as names of all the godparents in the order listed in the church records were included. In total, the dataset has 1,249 rows and 22 columns. But how to find, access, and organise data about more than 1,200 individuals at this scale? @@ -35,7 +37,8 @@ In addition to previously published sources and some additional archival informa As the dissertation and most of the articles connected to this thesis was written in English, all collected data was translated into English. For many of the entries on the dataset, the place name in the original source was in German. The currently used name is given first with the German version in brackets, for example, ‘Latvia, Suntaži (Sunzel).’ For Estonian place names, the German version is mostly not given but can be found in the Dictionary of Estonian Place Names (KNR; @knr). For the workers’ profession, the translated version is given first with the title from the original source, for example, ‘Hollow glass maker (Hohlgläser).’ For surnames, there is some change from German to Russian to Estonian and from church warden to another. The most common variations of a surname are given in brackets – for example, ‘Kilias (Kihlgas).’ This translation is not included for the glassworks as all used names and other details such as coordinates, operation dates, owners, and so on are given in another dataset [@reppo2023c]. ## National Archives of Estonia -From the NAE, data were collected by identifying records using the Archival Information System [@ais], the name register for the Lutheran congregations [@luterikpn], and Saaga [@saaga]. With AIS and Saaga, it was possible to find references to records only available as paper copies at the NAE reading rooms in Tartu and Tallinn but also access digitised records, most of which were church books. NAE has estimated that around 34 million images of their physical records have been made available online which is roughly 5% of their collection [@rahvusarhiiv]. NAE adopted Transkribus, an AI-powered platform developed to transcribe and recognise historical handwritten documents and text in October 2022 [@ratranskribus] but a limited number of records are searchable through this feature at present. + +From the NAE, data were collected by identifying records using the Archival Information System [@ais], the name register for the Lutheran congregations [@luterikpn], and Saaga [@saaga]. With AIS and Saaga, it was possible to find references to records only available as paper copies at the NAE reading rooms in Tartu and Tallinn but also access digitised records, most of which were church books. NAE has estimated that around 34 million images of their physical records have been made available online which is roughly 5% of their collection [@rahvusarhiiv]. NAE adopted Transkribus, an AI-powered platform developed to transcribe and recognise historical handwritten documents and text in October 2022 [@ratranskribus] but a limited number of records are searchable through this feature at present. Unfortunately, none of the records related to the glassworkers life events under consideration in this study have been added yet. To test the employability of Transkribus as a non-expert user, a handful of 17th-century documents in Swedish were run through Transkribus [@transkribus] by the author to identify the location of a glassworks in Pärnu, Estonia. These records did not yield results that were hoped for but using Transkribus did speed up the process, even if the transcribed text needed corrections. @@ -46,17 +49,26 @@ The indexes mentioned above are based on these records and list the last name wi Although the crowdsourced indexes allowed identifying the records which included the glassworkers, and most of these were indeed digitised, the use of records from NAE during this study was certainly affected by the need to use traditional research methods to retrieve the information. Thus, thousands of pages of church books were combed through to compile the raw dataset after identifying the parishes with the highest number of glassworks. With further help from transcription services, the process of collecting basic data about key life events of the glassworkers and their family members could be streamlined further. Whilst some 17th-century records were uploaded to Transkribus for transcription to speed up the process of collecting very straightforward data for the individuals – dates and locations of key life events – future studies would certainly be facilitated by the built-in Transkribus engine on NAE. ## National Library of Estonia + Further information about the glassworkers and their family members was collected from the Digital archive of Estonian newspapers [@digar] which is managed by the NLE. As the newspapers available via this database were published from 1811 with some earlier exceptions. Unlike NAE, this collection employs Optical Character Recognition (OCR). The use of OCR for these records did significantly speed up the process of research. There were obviously errors, for example where OCR was unable to detect the layout of the text or where the print ink had bled. The database allows corrections from users. As the author of this study did correct the errors in recognised characters in the sources used for this study, future searches for other researchers should be less error-prone. ## Publication + Publication of raw datasets in Estonian archaeology is a new phenomenon and has been particularly rare for material culture studies which this study was a part of [@mythesis, pp. 38]. In addition to adhering to FAIR principles, the publication of this dataset is tied to an unusual situation – the author is the only archaeologists in Estonia studying post-medieval glass. In fact, three large datasets were published as part of this dissertation – one on archaeological finds [@reppo2023a], another on the workers [@reppo2023b], and a third one the glassworks themselves [@reppo2023c] to avoid research monopoly and encourage other researchers to study the post-medieval glass industry in Estonia. The raw dataset was published Open Access under a CC-BY 4.0 licence via DataDOI, a free data repository which is managed by the University of Tartu library which provides the dataset with a persistent interoperable identifier. As mentioned above, the dataset of life events is tabulated and has 22 columns and 1,249 lines. It is accompanied by a metadata file which includes details on the project, the references, and other information relevant to the raw data. ## Visualisation + One of the goals of this study were to visualise the data to provide easily legible images (charts, models, drawings) which encompass the entirety of the collected data. The data were visualised using Gephi, an open-source visualisation program by extracting the raw data using pivot tables in Microsoft Excel and wrangling the data to remove unnecessary details and columns. This proved that the data is mutable and suitable for network analysis. For Gephi, this data needed to be sorted into nodes and edges which allows visualising the connections between several points of data by means of lines. After cleaning the data, the format was transformed from a Microsoft Excel table (.XLSX) to a .CSV file to run the model. In the model, the node (point) size is representative of the number of connections to the place or family. Glassworks are differentiated from birth, marriage, and death locations by the ‘GW’ (glassworks) in the name. In this model, marriages between families and the connections of those families to places are plotted based on their places of origin, birth, baptism, marriage, and death. With further data wrangling it would be possible to show the connections of the glassworkers and their family members within the larger community beyond marriages by analysing the connections of those individuals who appear as godparents. ## Results + This study explored the network of connections between 1,248 migrant glassworkers and their family members working in Estonia from the 16th–19th century, using Transkribus, OCR, and Gephi as the main tools. A complete list of workers during this period was not the goal of this study. The raw dataset was published via DataDOI, an Open Access repository managed by the University of Tartu library in accordance to FAIR principles. The data shows that a key factor in building and maintaining the glass community was godparenting and marriages between the families. In addition to tracing migration to, within, and from Estonia, the data also allowed identifying the makers of some archaeological glass artefacts and locations and names of glassworks. + +## References + +::: {#refs} +::: diff --git a/submissions/462/index.qmd b/submissions/462/index.qmd index c905885..fb8c168 100644 --- a/submissions/462/index.qmd +++ b/submissions/462/index.qmd @@ -46,9 +46,9 @@ When combining different sources, we must carefully select and interpret results After a short introduction on methodology, the following research questions will be addressed in this extended abstract: -* What can we discover about the interest burden on real estate? -* Did a higher burden of interest lead to an increased number of seizure procedures? -* Who made use of seizure procedures and how does this use relate to interest claims? +* What can we discover about the interest burden on real estate? +* Did a higher burden of interest lead to an increased number of seizure procedures? +* Who made use of seizure procedures and how does this use relate to interest claims? ## Information Extraction @@ -82,7 +82,7 @@ As our annotated dataset is very small, the extraction performance for many clas ### Measuring interest burden on real estate The obvious indicator to see what amount of interest a property is burdened with are description of interests. We may find these in different places: In the form of lists kept by different institutions which received interest, as part of the description of properties when they are mentioned in the documents, most often when a property is sold, and whenever an annuity is established. -The lists of interest payments are problematic to use as they take the perspective of the beneficiary of the interest and do not mention if that property is also burdened with interest by other institutions or persons. The documents tracking lending of annuities would work, but their recognition by our automated system is still lacking. The descriptions of properties are thus a good choice, they are well detected by our automated annotation and contain information about all beneficiaries. +The lists of interest payments are problematic to use as they take the perspective of the beneficiary of the interest and do not mention if that property is also burdened with interest by other institutions or persons. The documents tracking lending of annuities would work, but their recognition by our automated system is still lacking. The descriptions of properties are thus a good choice, they are well detected by our automated annotation and contain information about all beneficiaries. ```{python} #| label: fig-3 @@ -193,7 +193,6 @@ plt.savefig("./images/fig_5.png") plt.show() ``` - Since seizure procedures were linked to interests, one could expect that interest descriptions would be more frequent in the years before the event. In fact, when comparing the 10 years before a seizure with all documents of the same decade, the relative number of entities mentioned in interest descriptions shows no difference. Thus, owing interests in itself was no reason for increased numbers of seizures. When looking at the number of entities mentioned in each interest description, one sees that in certain periods of time, this proportion was much higher for documents leading to seizure events than in the average document (see @fig-5). This is the case mainly for the time period between 1480 and 1540. In the next part, we contrast the entities in the interest descriptions to the entities taking part in seizures as claimants. @@ -223,7 +222,6 @@ plt.grid(True) plt.show() ``` - ## Conclusions The relationship between interest burden and seizures proves to be investigable for the first time with our data. However, it is a complex relationship. The probability of a seizure increases with rising interest burden only during certain periods when seizures were infrequent. The many institutional creditors are reflected in the interest descriptions but not in the number of seizures. Further investigations need to determine whether institutions were generally more patient, whether this was due to lower interest burden, or whether our models fail to adequately capture institutions. This requires unambiguous identification of institutions, which still needs to be carried out. @@ -251,3 +249,8 @@ On a methodological level, it becomes clear that successful recognition of entit | , | O | O | | sonstfrei| O | O | | [DESC] | O | O | + +## References + +::: {#refs} +::: diff --git a/submissions/464/index.qmd b/submissions/464/index.qmd index 61e6896..8d1ebad 100644 --- a/submissions/464/index.qmd +++ b/submissions/464/index.qmd @@ -63,3 +63,8 @@ Existing and developing controlled vocabularies and gazetteers for locations [e. ## Conclusion This paper is a work-in-progress of a section of an ongoing dissertation project, presenting and discussing preliminary findings and continued research angles. There are further considerations to be made regarding the funding, available infrastructure and size of institutions, the priority-setting of collections, and institutional guidelines that determine if, why and how a collection is digitized. Marked changes can already be seen in the approach of institutions towards their online collections, in the direction of the tenets of the FAIR principles, making their objects more findable (DOIs, consistently maintained platforms), accessible (digitization, machine-readable metadata), interoperable (standards, nomenclature, vocabularies, IIIF) and even reusable (open licensing of images and metadata). This is also seen in how institutions are becoming increasingly sensitized to a broader audience engaging with their collections, with explicit use of the FAIR, and in some cases the CARE principles, in their online collection presentation and the steps taken to get there [@Carroll.2021]. + +## References + +::: {#refs} +::: diff --git a/submissions/465/index.qmd b/submissions/465/index.qmd index 5cfa09d..b1f2e60 100644 --- a/submissions/465/index.qmd +++ b/submissions/465/index.qmd @@ -26,11 +26,13 @@ bibliography: references.bib --- ## Introduction + Over the last few years, Machine Learning applications became more and more popular in the humanities and social sciences in general, and therefore also in history. Handwritten Text Recognition (HTR) and various tasks of Natural Language Processing (NLP) are now commonly employed in a plethora of research projects of various sizes. Even for PhD projects it is now feasible to research large corpora like serial legal source, which would not be possible entirely by hand. This acceleration of research processes implies fundamental changes to how we think about sources, data, research and workflows. In history, Machine Learning systems are typically used to speed up the production of research data. As the output of these applications is never entirely accurate or correct, this raises the question how historians can use machine generated data together with manually created data without propagating errors and uncertainties to downstream tasks and investigations. ## Facticity + The question of the combined usability of machine-generated and manually generated data is also a question of the reliability or facticity of data. Data generated by humans are not necessarily complete and correct either, as they are a product of human perception. For example, creating transcriptions depends on the respective transcription guidelines and individual text understanding, which can lead to errors. However, we consider transcriptions by experts as correct and use them for historical research. This issue is even more evident in the field of editions. Even very old editions with methodological challenges are valued for their core content. Errors may exist, but they are largely accepted due to the expertise of the editors, treating the output as authorised. This pragmatic approach enables efficient historical research. Historians trust their ability to detect and correct errors during research. Francesco Beretta represents data, information, and knowledge as a pyramid: data form the base, historical information (created from data through conceptual models and critical methods) forms the middle, and historical knowledge (produced from historical information through theories, statistical models and heuristics) forms the top [@berettaDonneesOuvertesLiees2023, fig. 3]. Interestingly, however, he makes an important distinction regarding digital data: "Digital data does not belong to the epistemic layer of data, but to the layer of information, of which they are the information technical carrier" [Translation: DW. Original Text: "[L]les données numériques n’appartiennent pas à la strate épistémique des données, mais bien à celle de l’information dont elles constituent le support informatique.", @berettaDonneesOuvertesLiees2023, p. 18] @@ -40,6 +42,7 @@ Andreas Fickers adds that digitization transforms the nature of sources, affecti The concept of *factoids* introduced by Michele Pasin and John Bradley, is central to this argument. They define factoids as pieces of information about one or more persons in a primary source. Those factoids are then represented in a semantic network of subject-predicate-object triples [@pasinFactoidbasedProsopographyComputer2015, pp. 89-90]. This involves extracting statements from their original context, placing them in a new context, and outsourcing verification to later steps. Therefore, factoids can be contradictory. Francesco Beretta applies this idea to historical science, viewing the aggregation of factoids as a process aiming for the best possible approximation of facticity [@berettaDonneesOuvertesLiees2023, p. 20]. The challenge is to verify machine output sufficiently for historical research and to assess the usefulness of the factoid concept. Evaluating machine learning models and their outputs is crucial for this. ## Qualifying Error Rates + Evaluating the output of a machine learning system is not trivial. Models can be evaluated using various calculated scores, which is done continuously during the training process. However, these performance metrics are statistical measures that generally refer to the model and are based on a set of test data. Even the probabilities output by machine learning systems when applied to new data are purely computational figures, only partially suitable for quality assurance. This verification is further complicated by the potentially vast scale of the output. Therefore, historical science must find a pragmatic way to translate statistical evaluation metrics into qualitative statements and identify systematic sources of error. In automatic handwriting recognition, models are typically evaluated using character error rate (CER). These metrics only tell us the percentage of characters or words incorrectly recognised compared to a ground truth. They do not reveal the distribution of these errors, which is important when comparing automatic and manual transcriptions. For detailed HTR model evaluation, CERberus is being developed [@haverals2023cerberus]. This tool compares ground truth with HTR output from the same source. Instead of calculating just the character error rate, it breaks down the differences further. Errors are categorised into missing, excess, and incorrectly recognised characters. Additionally, a separate CER is calculated for all characters and Unicode blocks in the text, aggregated into confusion statistics that identify the most frequently confused characters. Confusion plots are generated to show the most common errors for each character. These metrics do not pinpoint specific errors but provide a more precise analysis of the model's behaviour. CERberus cannot evaluate entirely new HTR output without comparison text but is a valuable tool for Digital History, revealing which character forms are often confused and guiding model improvement or post-processing strategies. @@ -49,14 +52,17 @@ In other machine learning applications, such as named entity recognition (NER), The qualitative error analysis presented here does not solve the question of authorizing machine learning output for historical research. Instead, it provides tools to assess models more precisely and analyse training and test datasets. Such investigations extend the crucial source criticism in historical science to digital datasets and the algorithms and models involved in their creation. This requires historians to expand their traditional methods to include new, less familiar areas. ## Three Strategic Directions + In the following last part of this article, the previously raised questions and problem areas will be consolidated, from which three strategic directions for digital history will be derived. These will be suggestions for how the theory, methodology, and practice of Digital History could evolve to address and mitigate the identified problem areas. The three perspectives should not be viewed in isolation or as mutually exclusive. Instead, they are interdependent and should work together to meet the additional challenges. ### Direction 1: Formulating Clear Needs + When data is collected or processed into information in the historical research process a certain pragmatism is involved. Ideally, such a project would fully and consistently transcribe the entire collection with the same thoroughness, but in practice, a compromise is often found between completeness, correctness, and pragmatism. Often, for one's own research purposes, it is sufficient to transcribe a source only to the extent that its meaning can be understood. This compromise has not fully transitioned into Digital History. Even if a good CER is achieved, there is pressure to justify how these potential errors are managed in the subsequent research process. This skepticism is not fundamentally bad, and the epistemological consequences of erroneous machine learning output are worthy of discussion. Nonetheless, the resulting text is usually quite readable and usable. Thus, I argue that digital history must more clearly define and communicate its needs. However, it must be remembered that Digital History also faces broader demands. Especially in machine learning-supported research, the demand for data interoperability is rightly emphasised. Incomplete or erroneous datasets are, of course, less reusable by other research projects. ### Direction 2: Creating Transparency + The second direction for digital history is to move towards greater transparency. The issue of reusability and interoperability of datasets from the first strategic direction can be at least partially mitigated by transparency. As Hodel et al. convincingly argued, it is extremely sensible and desirable for projects using HTR to publish their training data. This allows for gradual development towards models that can generalise as broadly as possible [@hodelGeneralModelsHandwritten2021, pp. 7-8]. If a CERberus error analysis is conducted for HTR that goes beyond the mere CER, it makes sense to publish this alongside the data and the model. With this information, it is easier to assess whether it might be worthwhile to include this dataset in one's own training material. Similarly, when NER models are published, an extended evaluation according to Fu et al. helps to better assess the performance of a model for one's own dataset. @@ -64,8 +70,14 @@ As Hodel et al. convincingly argued, it is extremely sensible and desirable for Pasin and Bradley, in their prosopographic graph database, indicate the provenance of each data point and who captured it [@pasinFactoidbasedProsopographyComputer2015, 91-92]. This principle could also be interesting for Digital History, by indicating in the metadata whether published research data was generated manually or by a machine, ideally with information about the model used and the annotating person for manually generated data. Models provide a confidence estimate with their prediction, indicating how likely the prediction is correct. The most probable prediction would be treated as the first factoid. The second or even third most probable prediction from the systems cloud provide additional factoids that can be incorporate into the source representation. These additional pieces of information can support the further research process by allowing inconsistencies and errors to be better assessed and balanced. ### Direction 3: Data Criticism and Data Hermeneutics + The shift to digital history requires an evaluation and adjustment of our hermeneutic methods. This ongoing discourse is not new, and Torsten Hiltmann has identified three broad directions: first, the debate about extending source criticism to data, algorithms, and interfaces; second, the call for computer-assisted methods to support text understanding; and third, the theorization of data hermeneutics, or the "understanding of and with data" [@hiltmann2024, p. 208]. -Even though these discourse strands cannot be sharply separated, the focus here is primarily on data criticism and hermeneutics. The former can fundamentally orient itself towards classical source criticism. Since digital data is not given but constructed, it is crucial to discuss by whom, for what purpose, and how data was generated. This is no easy task, especially when datasets are poorly documented. Therefore, the call for data and model criticism is closely linked to the plea for more transparency in data and model publication. +Even though these discourse strands cannot be sharply separated, the focus here is primarily on data criticism and hermeneutics. The former can fundamentally orient itself towards classical source criticism. Since digital data is not given but constructed, it is crucial to discuss by whom, for what purpose, and how data was generated. This is no easy task, especially when datasets are poorly documented. Therefore, the call for data and model criticism is closely linked to the plea for more transparency in data and model publication. + +In the move towards data hermeneutics, a thorough rethinking of the factoid principle can be fruitful. If, as suggested above, the second or even third most likely predictions of a model are included as factoids in the publication of research data, this opens up additional perspectives on the sources underlying the data. From these new standpoints, the data -- and thus the sources -- can be analyzed and understood more thoroughly. Additionally, this allows for a more informed critique of the data, and extensive transparency also mitigates the "black box" problem of interpretation described by Silke Schwandt [@schwandtOpeningBlackBox2022]. If we more precisely describe and reflect on how we generate digital data from sources as historians, we will find that our methods are algorithmic [@schwandtOpeningBlackBox2022, pp. 81-82]. This insight can also support the understanding of how machine learning applications work. Data hermeneutics thus requires both a critical reflection of our methods and a more transparent approach to data and metadata. + +## References -In the move towards data hermeneutics, a thorough rethinking of the factoid principle can be fruitful. If, as suggested above, the second or even third most likely predictions of a model are included as factoids in the publication of research data, this opens up additional perspectives on the sources underlying the data. From these new standpoints, the data -- and thus the sources -- can be analyzed and understood more thoroughly. Additionally, this allows for a more informed critique of the data, and extensive transparency also mitigates the "black box" problem of interpretation described by Silke Schwandt [@schwandtOpeningBlackBox2022]. If we more precisely describe and reflect on how we generate digital data from sources as historians, we will find that our methods are algorithmic [@schwandtOpeningBlackBox2022, pp. 81-82]. This insight can also support the understanding of how machine learning applications work. Data hermeneutics thus requires both a critical reflection of our methods and a more transparent approach to data and metadata. \ No newline at end of file +::: {#refs} +::: diff --git a/submissions/468/index.qmd b/submissions/468/index.qmd index 9b0432b..457e923 100644 --- a/submissions/468/index.qmd +++ b/submissions/468/index.qmd @@ -75,7 +75,6 @@ director film. It became, in addition to the state, an important financial suppo And it opened up for them a new, pre-dominantly urban audience that began to be interested in the peasant-mountain world for a variety of reasons. - ![Fig. 1: Milk transport with a handcart and a horse-drawn cart, shown in a remarkable split screen. Film still from the last of the three Swiss milk films (1923–1929), entitled *Wir und die Milch* (1929).[^1]](images/Figure1.jpg) @@ -202,4 +201,4 @@ Geschichte der arbeitenden Tiere in der Zeit der Massenmotorisierung (1950-1980) Fribourg 2023, [YouTube (25.06.2024)](https://youtu.be/_XVWdHNQxv8). [^7]: Moser Peter/Wigger Andreas, Working Animals. Hidden modernisers made visible, in: Video Essays in Rural -History, 1 (2022), [https://www.ruralfilms.eu/essays/videoessay_1_EN.html](https://www.ruralfilms.eu/essays/videoessay_1_EN.html) [16.08.2024]. \ No newline at end of file +History, 1 (2022), [https://www.ruralfilms.eu/essays/videoessay_1_EN.html](https://www.ruralfilms.eu/essays/videoessay_1_EN.html) [16.08.2024]. diff --git a/submissions/469/index.qmd b/submissions/469/index.qmd index e7ffbd5..d27f413 100644 --- a/submissions/469/index.qmd +++ b/submissions/469/index.qmd @@ -31,7 +31,7 @@ abstract: | Consequently, we demonstrated that self-help had consistently played a pivotal role in the Foundation’s development strategy since the Foundation’s inception. Furthermore, we scrutinised the roles ascribed by the Foundation to various actors in the development process. While the Foundation initially regarded the State as the primary actor in development, by the study period’s end, new participants such as private companies, communities, and individuals had become integral to this process. All the necessary data and scripts to reproduce this presentation can be found [here](https://github.com/ivanldf13/Master-thesis-). --- -# Introduction +## Introduction In our presentation, we will explore how Digital Humanities tools can be used to analyse the concept of development from a historiographical perspective. We will begin with a brief introduction to the topic, followed by an overview of our primary sources. The core of our presentation will focus on the methodology, where we will justify our choice of Structural Topic Modelling over other techniques like Hierarchical Clustering on Principal Components. Finally, we will present the results of our analysis and some remarks. @@ -41,29 +41,24 @@ From the outset, governmental and non-governmental actors have been involved in [^2]: From now on referred to as the Foundation -# Primary sources +## Primary sources -We chose as primary sources the Foundation’s Annual Reports for two reasons. The first one is quantitative. The Annual Reports were published annually from 1915 to 2016, with the 1913 and 1914 reports issued jointly in 1915. With this extensive temporal coverage, the Foundation lends itself as an excellent observatory to study the evolution of the concept of development before and after the emergence of this concept. +We chose as primary sources the Foundation’s Annual Reports for two reasons. The first one is quantitative. The Annual Reports were published annually from 1915 to 2016, with the 1913 and 1914 reports issued jointly in 1915. With this extensive temporal coverage, the Foundation lends itself as an excellent observatory to study the evolution of the concept of development before and after the emergence of this concept. -The second reason is qualitative. The main objective of annual reports is to communicate the activities of the Foundation, its financial operations, its priorities, its vision of the issues it faces, and a self-assessment of its own actions in the past and those to be adopted in the future. Although the structure of the annual reports has changed over time, the content has remained stable. The Foundation presents with them a summary of its activities but also presents a narrative that seeks to communicate the reasoning and justification behind the Foundation’s activities. In this sense, the annual reports are a showcase in which the Foundation displays, promotes and justifies its values. +The second reason is qualitative. The main objective of annual reports is to communicate the activities of the Foundation, its financial operations, its priorities, its vision of the issues it faces, and a self-assessment of its own actions in the past and those to be adopted in the future. Although the structure of the annual reports has changed over time, the content has remained stable. The Foundation presents with them a summary of its activities but also presents a narrative that seeks to communicate the reasoning and justification behind the Foundation’s activities. In this sense, the annual reports are a showcase in which the Foundation displays, promotes and justifies its values. Moreover, since these reports are public, they serve two functions. The first is purely functional. The reports inform the reader of the Foundation’s activities, its financial state, and other relevant details. The second function is symbolic. As Peter Goldmark Jr. (president of the Foundation from 1988 to 1997) noted, philanthropic foundations lack the three disciplines American life has: the test of the markets, the test of the elections and the press that analyses every move. [@rockefeller_foundation_annual_1998, pp. 3] Therefore, the Foundation uses the annual reports as a form of self-evaluation, as a way to make itself accountable to the public and to offer a promotion and justification of the values that guide its activities. [@rockefeller_foundation_annual_1955, pp. 3] - -# Methodology and its twists and turns +## Methodology and its twists and turns Confronted with the enormous amount of reports to be analysed and inspired by the working paper “Bankspeak” by Moretti and Pestre, [@moretti_bankspeak_2015] we undertook a quantitative analysis of the language used in this reports. Then, guided by the results of this analysis we interpreted the activities and institutions in which the Foundation was involved to reconstruct the evolution of its concept of development. - We began our quantitative analysis by importing the PDF reports into R using the ‘tidy’ principle [@silge_tidytext_2016, pp.1] and then performing the necessary text cleaning to reduce the size of the corpus. This increased the efficiency and effectiveness of the analysis.[@gurusamy_preprocessing_2014] We then proceeded with the analysis itself. - -Initially, we employed basic text analysis techniques, namely counting the most frequent words per year and per period and using the TF-IDF. These techniques yielded promising results but were insufficient. Although the Foundation had the same objective throughout the period – “*to promote the well-being of mankind throughout the world*” – ,[@rockefeller_foundation_annual_1915, pp.7; @rockefeller_foundation_annual_1964, pp.3; @rockefeller_foundation_annual_2014, pp.3] it used different words in absolute and relative terms to describe and justify its activities. - +Initially, we employed basic text analysis techniques, namely counting the most frequent words per year and per period and using the TF-IDF. These techniques yielded promising results but were insufficient. Although the Foundation had the same objective throughout the period – “*to promote the well-being of mankind throughout the world*” – ,[@rockefeller_foundation_annual_1915, pp.7; @rockefeller_foundation_annual_1964, pp.3; @rockefeller_foundation_annual_2014, pp.3] it used different words in absolute and relative terms to describe and justify its activities. However, in terms of visualisation, precision and displaying temporal dynamics, the capabilities of these two techniques are worse than those of Hierarchical Clustering on Principal Components (HCPC) and Structural Topic Modelling (STM). Moreover, the former techniques are unable to create clusters and topics, unlike the latter two. - We continued with the HCPC, using only nouns, as this part of speech is the most suitable for analysing topics.[@suh_socialterm-extractor_2019, pp.2] This technique confirmed the findings of the absolute frequency analysis and the TF-IDF. That is, there is structure in the use of words by the Foundation, as reflected both in the biplot created by the Correspondence Analysis (CA) necessary to perform the HCPC and in the final clusters. In the biplot in @fig-1.top25, the documents are organised in a temporal manner and, being together with each other, this indicates that they favour and avoid the same words regardless of the number of words in each document.[@becue-bertaut_textual_2019, pp.18-19] Specifically, we observed that the Foundation used more frequently terms such as ‘infection’ or ‘hookworm’ and less frequently terms such as ‘resilience’ or ‘climate’ at the beginning of the period. Furthermore, when clustering after the CA and analysing the words contained in each cluster, it is observed that the Foundation, over time, diversifies the topics in which it engages, following a chronological trend. However, the visualisation of the clusters does not significantly enhance our understanding of the matter. ![Top 25 contributors to the two first dimensions](images/Figure%201.Top_25_Contributors.png){#fig-1.top25} @@ -77,7 +72,7 @@ Next, we employed the STM using also only nouns. As a topic modelling technique, ![Table with the topical content](images/Topics_DHCH.png){#tbl-topics} -Once we chose the number of topics, we obtained two lists of nouns associated with each topic, as shown in @tbl-topics. One list groups the nouns most likely to appear in each topic (Highest Prob list), while the other groups those that are frequent and exclusive (FREX list). These lists allow us to discover the central topics without our prior biases. We then named each topic using both lists and analysed the most representative reports for each topic. Therefore, this approach is a mixture of the methods suggested by Roberts et al.[-@roberts_structural_2014, pp.1068] and Grajzl & Murrell.[-@grajzl_toward_2019, pp.10] +Once we chose the number of topics, we obtained two lists of nouns associated with each topic, as shown in @tbl-topics. One list groups the nouns most likely to appear in each topic (Highest Prob list), while the other groups those that are frequent and exclusive (FREX list). These lists allow us to discover the central topics without our prior biases. We then named each topic using both lists and analysed the most representative reports for each topic. Therefore, this approach is a mixture of the methods suggested by Roberts et al.[-@roberts_structural_2014, pp.1068] and Grajzl & Murrell.[-@grajzl_toward_2019, pp.10] ![Topical prevalence of the topics correlated with time](images/plot.expected.topic.proportions.png){#fig-topical_prev} @@ -87,13 +82,12 @@ Subsequently, we calculated the frequency of each topic in our corpus, as shown Furthermore, by using STM with temporal metadata, we identified which topics the Foundation addressed in its annual reports and their statistical relation to time. This approach enabled us to observe how the frequency of particular topics changed over time. The distribution of these topics over time is illustrated in @fig-top_corr_time. - Finally, using the Highest Prob and FREX lists @tbl-topics, and the most prominent reports of each topic, we examined the activities and institutions in which the Foundation was involved to reconstruct the development concept and the ideas underpinning it. -# Results and conclusion +## Results and conclusion This approach provided an innovative way to understand the main topics in which the Foundation was involved in its promotion of development. Despite having the ultimate goal “to promote well-being of mankind throughout the world”, before the coining of the concept of development, the Foundation was already engaged in activities later considered development-related. These activities were strongly influenced by the political, epistemic, and economic context. Thus, we observed how the first layer of meaning of development – health – was gradually joined by economic, social, cultural, and finally, environmental layers. However, this methodology proved inefficient in analysing the role of the self-help mentality and the market-oriented mentality. To address this, we had to perform a close reading to conclude the centrality of both in the Foundation's thinking, especially in the 21st century. Indeed, throughout its existence, the Foundation sought to ensure that the actors it helped to develop became autonomous agents who could solve their problems without recourse to third parties. Furthermore, we observed how the importance of these actors in the development process also changed over time. At the beginning of the period, the Foundation conceived of the State as the primary catalyst for development. By the end of the period, it advocated development involving the State, private enterprise, civil society, and individuals. As the State’s credibility as a guarantor of rights and provider of welfare-related services wanes, the Foundation encourages individuals to find their own means to cope with the risks present in contemporary society without waiting for help from the State. -This limitation of STM revealed the importance of working hypotheses created through a sound bibliographical review and the hermeneutical work of the historian, despite the use of new methodologies. It was only through the insights gained from the bibliographical review that we anticipated a change in the role of different political actors in the development arena and recognised the significance of the self-help and market-oriented mentality in the Foundation’s development concept. When interpreting the STM results, we found that we could not answer these questions solely with the digital tools. Consequently, we had to conduct a close reading to address these issues, highlighting the critical role of hermeneutical work both in analysing the results of Digital Humanities tools and in the close reading exercise. \ No newline at end of file +This limitation of STM revealed the importance of working hypotheses created through a sound bibliographical review and the hermeneutical work of the historian, despite the use of new methodologies. It was only through the insights gained from the bibliographical review that we anticipated a change in the role of different political actors in the development arena and recognised the significance of the self-help and market-oriented mentality in the Foundation’s development concept. When interpreting the STM results, we found that we could not answer these questions solely with the digital tools. Consequently, we had to conduct a close reading to address these issues, highlighting the critical role of hermeneutical work both in analysing the results of Digital Humanities tools and in the close reading exercise. diff --git a/submissions/473/index.qmd b/submissions/473/index.qmd index 2becfc4..f3e8f75 100644 --- a/submissions/473/index.qmd +++ b/submissions/473/index.qmd @@ -47,3 +47,8 @@ Currently, the interfaces developed are mostly internal tools that help us sharp **Team members** Moritz Greiner-Petter, Mario Schulze, Sarine Waltenspül + +## References + +::: {#refs} +::: diff --git a/submissions/474/index.qmd b/submissions/474/index.qmd index 716fecf..1d055ed 100644 --- a/submissions/474/index.qmd +++ b/submissions/474/index.qmd @@ -44,6 +44,7 @@ Digital literacy and source criticism of born-digital objects employing basic co Another approach to source criticism inspired by digital forensics is to read between the lines, or "between the bits," more precisely. Thorsten Ries, for example, has demonstrated that text files can contain much more than they might reveal at first sight, i.e., overcoming the perspective of screen essentialism [@riesDigitalHistoryBorndigital2022, p. 176-178]. In his examples, he reads the Revision Identifier for Style Definition (RSID) automatically attached to each MS Word file to make statements about a file's creation and revision history. Similarly, Trevor Owens has demonstrated that important information about a file's history can be retrieved simply by changing its file-type extension and thus opening it with different software [@owensTheoryCraftDigital2018, p. 43]. In Owens' example, he opens an .mp3 music file with a simple text editor by changing the extension to .txt, which enables him literally to "read" all of the file's metadata. Depending on the scheme employed by the file managing system, this reading "against the grain" might reveal information that is not accessible with a simple right-click. Similarly, it might be worth a try to open a file of unknown format with the vim editor for a first inspection. ## System and Environment: Contextualizing digital objects + All elements of the expanded and updated version of source criticism outlined above point to an increased attention towards the systems and environments into which the production and processing of digital objects are embedded. On a basic level, computation always relies on specific logical, material, and technical systems and environments, i.e., the operating system, hardware, storage media, exchange formats, transmission protocols, etc. Inspired by platform studies and the "new materialism," recent research on digital objects has emphasized the platform character of all digital media and objects and argued for their understanding as "assemblages" [@owensTheoryCraftDigital2018; @zuanniTheorizingBornDigital2021]. This line of research emphasizes the multiple relations and dependencies of all digital objects to systems and environments. All data has to be organized according to certain file formats and standards to be transferable and processable; file formats require specific applications to be read and manipulated; applications and programs, in turn, rely on operating systems, which again are bound to specific hardware configurations and must be maintained and updated, and so on. With networked computing, web-based applications, and cloud storage, complex and nested platforms and assemblages have become the norm. Consequently, any concept of digital literacy or data literacy must incorporate critical reflection on the relations, dependencies, and determinations of systems and infrastructure [@grayDataInfrastructureLiteracy2018]. This is in line with recent research in science and technology studies and the materialistic turn in the history of computing, which center on connectivity and reliance on large and complex infrastructure networks in their studies [@parksSignalTrafficCritical2015; @gallowayProtokollKontrolleUnd2014; @edwardsIntroductionAgendaInfrastructure2009]. Here again, following the historical unfolding and development of these infrastructures helps to understand both their general functionality and their specifics, which are sometimes more the result of traditions and path dependencies than of technical necessity. In the same way that historians trace back the provenance, perspective, and implicit presuppositions of a "classic" paper-based source, they must reflect on the system-environment of a digital object, its relations to it, its location within it, and the epistemic consequences of that positionality and relations. @@ -55,8 +56,14 @@ This quote touches upon a different meaning of the term "environment," referring Including context, systems, and environment into the analysis and reflection of computing and born-digital objects is therefore at the same time a productive research agenda for the history of computing and an approach to an updated variant of source criticism and general digital literacy. Historians, trained to contextualize and situate information provided by sources within specific historic, social, cultural, and spatial contexts, can apply their instruments of critique and evaluation easily to digital objects and additionally provide guidance to the formulation of a general digital literacy. ## Contextualization and Critique + This essay has so far argued that historians already have some valuable methods and approaches at their disposal to adapt their inquiries to novel, digital-born artifacts and media, and that they need to incorporate knowledge about the basic principles of computing and its history into their toolbox to be able to make sense of new media and archives. However, it is pivotal to keep in mind that providing interpretation and critique remains their core task. Decoding a digital artifact, tracing the history of its emergence, and understanding its relation to its technical environment serve but one objective: to make claims and arguments about its meaning. This is and remains a fundamentally critical approach that does not exclude a reflection on the methods themselves. Zack Lischer-Katz, for example, reminds us that digital forensics were not developed by and for historians, but serve a specific task in police investigations and the courtroom: "However, caution must be exercised when considering forensics as a guiding approach to archives. The epistemological basis of forensic science embeds particular assumptions about knowledge and particular systems of verification and evidence that are based on hierarchical relations of power, positivist constructions of knowledge, and the role of evidence [...] A critical approach to the tools of digital forensics by archivists and media scholars requires thinking through how the forensic imagination may impose forms of knowing that reproduce particular power relations" [@lischer-katzStudyingMaterialityMedia2016, p. 5-6]. At a very basic level, historians and humanists, in general, are particularly strong exactly when their findings are more than just an addition and re-arrangement of available information. Using the distinctions between symbols and signals, Berry and colleagues have formulated an eloquent reminder of this task: "Digital humanists must address the limits of signal processing head-on, which becomes even more pressing if we also consider another question brought about by the analogy to Shannon and Weaver’s model of communication. The sender-receiver model describes the transmission of information. The charge of the digital humanities is, instead, the production of knowledge. An uncritical trust in signal processing becomes, from this perspective, quite problematic, insofar as it can confuse information for knowledge, and vice versa. [...] Neither encoding or coding (textual analysis) is in fact a substitute for humanistic critique (understood in the broad sense)" [@berryNoSignalSymbol2019]. Critique is and must remain the central concern of historians. This critique must be directed to the authenticity and credibility of born-digital objects and the systems that produce them. To do so, they must learn from the tools and approaches of computer forensics. But what distinguishes historians from the forensic experts is that they don't stop at the limits of the technical systems but extend their contextualization to the broader cultural, economic, and social structures that enable the development of specific technologies. This is why the historian's perspective and approach are indispensable for general digital literacy. + +## References + +::: {#refs} +::: diff --git a/submissions/480/index.qmd b/submissions/480/index.qmd index 0a5f075..b8479e6 100644 --- a/submissions/480/index.qmd +++ b/submissions/480/index.qmd @@ -63,3 +63,8 @@ The data and edition platform [_hallerNet_](https://hallernet.org/) opens up his ## Conclusion Reconstructing the whole interaction in which floras and herbaria interplayed, difficulties arise in integrating digital approaches to historical correspondence networks [e.g. @EdmondsonEdelstein2019] with digitization methods for floras and herbaria, which are located in different scientific disciplines. The challenge for the data and edition platform [_hallerNet_](https://hallernet.org/) is therefore to find interdisciplinary solutions. With tools, methods and workflows of the digital humanities, traceable relations between text, scans and structural data are determined in an innovative way. That allows to rely today's botanical authority data systematically to the historical information such as changing plant names, specimens, locality information and plant collectors. For the interoperability of the data, the orientation towards the [Darwin Core](https://dwc.tdwg.org/) standard is mandatory, for the sustainable editorial quality the [_TEI_](https://tei-c.org/) guidelines. Originally developed in the natural sciences, the [_FAIR data principles_](https://www.go-fair.org/fair-principles/) became a standard in the humanities (especially for _GLAM_ institutions), and thus serve as an overarching guideline; in particular, _FAIR_ guarantees the sustainable handling of data, which therefore remains 'reusable' for future generations of users because the traces of the normalization and flexibilization processes can be traced in detail. With this integration of different disciplinary standards and different types of sources, _hallerNet_ could become a dynamic and cross-collection instrument for the interdisciplinary research of historical plants and biodiversity in Switzerland in the period before 1850. The current transformation of _hallerNet_ into the national collaborative platform _République des Lettres_ will further strengthen this potential. + +## References + +::: {#refs} +::: diff --git a/submissions/482/index.qmd b/submissions/482/index.qmd index a864c81..98ae1e0 100644 --- a/submissions/482/index.qmd +++ b/submissions/482/index.qmd @@ -39,7 +39,7 @@ Furthermore, digital methods allow us to quantitatively study broad changes in p ## Available Digitized Sources -Because patents constitute a legal claim of exclusivity, a detailed description of the invention is required by the law, inter alia to allow courts to assess whether competing technical devices or processes constitute illegal imitations. From the second half of the nineteenth century onward, most countries systematically published these descriptions, so-called patent specifications. We rely on a large corpus of these documents that have been digitized by patent offices for their own current activity (especially for assessing the novelty of patent applications). +Because patents constitute a legal claim of exclusivity, a detailed description of the invention is required by the law, *inter alia* to allow courts to assess whether competing technical devices or processes constitute illegal imitations. From the second half of the nineteenth century onward, most countries systematically published these descriptions, so-called patent specifications. We rely on a large corpus of these documents that have been digitized by patent offices for their own current activity (especially for assessing the novelty of patent applications). Our dataset currently includes around 4 million patent specifications from four large industrial countries: France, Germany, the United Kingdom and the United States. These countries represent major players in the discussions around the Paris Convention. Their residents also account for a large proportion of patents taken abroad. Furthermore, it has been proposed that they constitute distinct "patent cultures" that have constituted models for other countries [@Gooday2020]. This corpus however reproduces the historiographical tendency to neglect smaller and less-studied countries. To account for this, we plan to include patent specifications from additional states at a later point. @@ -55,7 +55,6 @@ Exploration leads to a further possible operation: matching the drawings printed ![Example of two pages from different patents featuring the same drawing (left: French patent 325,985; right: German patent 142,688).](images/plates-example.png){#fig-1} - ## Implementing Image Matching In recent years, computational exploration and analysis of images have attracted a growing interest in digital history [@Arnold2019; @Arnold2023; @Wevers2020]. Among other approaches, historians have used convolutional neural networks (CNNs) to generate numerical representations of images, so-called *image embeddings*, and then find similar pictures. Pre-trained CNNs have been shown to be useful on historical documents even though they were built for another purpose, typically classifying colored photographs in categories such as `iPod` and `hair_spray`, as in the ImageNet training data. @@ -69,3 +68,8 @@ Best results were obtained by combining a CNN and SIFT. Image embeddings and an ## Back to Exploration and Future Work Our early results prompt further exploration, leading to new insights. For instance, some of the matches point to metadata errors, e.g. wrong country indications in French patents. It also leads us to question some of our assumptions. Assuming that a patent in one country would have one corresponding patent in another, we used embeddings to get, for each segment in country A, the two segments most similar to it in country B. However, in our early results, one French patents matched *six* different German patents. This suggests that we might need to apply SIFT to compare each segment A to a greater number of similar B segments. Further future work includes combining matching the drawings and matching other data points. All in all, our use of computer vision methods, while not yet robust enough to answer our research questions, yields promising results demonstrating that the concept of internationalization can be operationalized in this way. + +## References + +::: {#refs} +::: diff --git a/submissions/486/index.qmd b/submissions/486/index.qmd index 9468c90..c586aa6 100644 --- a/submissions/486/index.qmd +++ b/submissions/486/index.qmd @@ -45,6 +45,7 @@ Vector embeddings are an essential tool used in NLP to represent words as numeri Regarding linking named entities in a text, e.g. persons, this would mean that we embed them based on the context in which they appear. If there are two viable options (such as the same first name, last name, and time period) for a match between a name and a person, but the name we are searching for appears in an article about architecture and one of the two options is an architect and one a medical doctor, we can now take into account this semantic context as an additional parameter to calculate a possible match. ## Methods + In our presentation, we will show a glimpse of the current state of our ambitious project, which aims to create a robust and scalable pipeline for applying embeddings-based NEL to historical texts. In our work, we focus on three key aspects. Firstly, on embeddings-based linking and disambiguation workflow applied to a historical corpus of Swiss magazines (E-Periodica) that uses Wikipedia, Gemeinsame Normdatei (GND), and – since our primary use cases deal with historical material from Switzerland – the Historical Dictionary of Switzerland (HDS) as reference knowledge bases. This part aims to develop a performant and modular pipeline to recognize named entities in retro-digitized texts and link them to so-called authority files (Normdaten), e.g., the German Authority File (GND). With this workflow, we will help to identify historical actors in source material and contribute to the in-depth FAIRification of large datasets through persistent identifiers on the text level. Our proposed pipeline is modular with respect to the embedding model, enabling performance comparison across different embedding model choices and leaving room for future improved embedding models, which capture semantic similarities even better than current popular open-source models such as BERT. Secondly, we plan to use this case study to reflect upon the interpretation of metrics provided by algorithmic models and their relevance in historical research methodology. We will focus on three key areas: Contextual Sensitivity, Ambiguity Resolution, and Computational Efficiency. By focusing on these aspects, we will provide a comprehensive insight into the models' operational capabilities, particularly in large-scale historical text analysis. Given the challenges of retro-digitized historical data (OCR quality, heterogeneous contents in large collections, etc.), it is necessary to not only select appropriate models and methods to the specific needs of such material but also to create representative ground truth data for OCR, NER, and NEL. Furthermore, scale considerations drive our case study, as some of our use cases consist of millions of pages. @@ -54,4 +55,10 @@ Finally, we will discuss the role of GLAM (galleries, libraries, archives, and m ![Pipeline of end-to-end Named Entity Linking.](images/graph.png) ## Conclusion -Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research. + +Current solutions for NEL need more accuracy and scalability. At the same time, such enrichment processes will become standard processes for GLAM institutions so that they can offer enriched data layers to their users as a service. This raises several challenges: The technical challenge to improve the linking workflow itself, the challenge to document the workflow in a transparent and reproducible form, and finally, the methodological challenge to negotiate and interpret the results at the intersection of GLAM institutions, data science, and historical research. + +## References + +::: {#refs} +::: diff --git a/submissions/687/index.qmd b/submissions/687/index.qmd index e539974..5b1a89d 100644 --- a/submissions/687/index.qmd +++ b/submissions/687/index.qmd @@ -18,14 +18,16 @@ bibliography: references.bib --- ## Introduction + As historians today, we profit from an unmatched availability of historical sources online, with most of the information contained in these sources digitally accessible. This greatly facilitates the use of computer-assisted methods to support or augment historical analyses. How and when to use which methods in a research endeavor are questions that cannot easily be answered, as the application of appropriate techniques more often than not is something to be clarified or revised during a project. Therefore, we need to find a way to not only teach computer-assisted methods to history students, but also how to enable them to conceptualize a historical research project and how to solve technical problems along the way, empowering them to develop and apply different methods in a practical and inspiring way. In the following, I will discuss an approach that proposes designing semester-long courses with a thematic focus, where students progressively learn how to use computational tools through continuous engagement with a historical source. ## Motivation and Course Design + In a text-based field like history, techniques such as text recognition, text/data mining or natural language processing are very valuable for historical analyses [see for example @jockers_text-mining_2016]. However, university courses for history students should go beyond merely teaching a specific technique. They should also equip digital novices with the skills to navigate the digital realm, whether that involves (basic) computer skills, effective collaboration on projects or questions related to data management [for two recent handbooks on how to teach digital history see @battershill_using_2022; @guiliano_primer_2022]. Over the years, I have experimented with various course designs, with introductions to specific software as well as to programming languages. Approaching the topic from the perspective of a course based on programming in order to analyse historical sources, however, has consistently produced the best results in both project outcomes and course evaluations.[^4] Now, with the rise of large language models, it has been argued that AI can easily generate any script, prompting some to question the necessity of teaching programming. For effective use of this technology, though, learning basic programming skills is essential. Relying on AI generated output without understanding its mechanics will result in mistakes, unnoticed misinterpretations, and, eventually, useless research. By learning the basics of scripting, students not only acquire the ability to perform their own analyses on a data set, but also learn how to use generative AI productively, enabling them to critically assess, correct and refine the output. -The current curriculum at the Department of History at the University of Basel does not include foundational, semester-long courses that cover digital literacy or computational skills on a broad basis for all students. Since 2022, however, a self-paced introductory course to digital history has become a mandatory part of the first semester [@serif_introduction_2022]. This course provides an initial overview of digital methods and their use for historical research, along with a practical component where students learn to apply different methods to a corpus. They are gently introduced to the command line,[^3] learning about APIs, regular expressions, string extraction, automation and other relevant techniques, as well as ways to visualize first results. By encountering a computer-based approach to historical sources early in their studies, students become more aware of the subject when planning their courses for the following semesters. +The current curriculum at the Department of History at the University of Basel does not include foundational, semester-long courses that cover digital literacy or computational skills on a broad basis for all students. Since 2022, however, a self-paced introductory course to digital history has become a mandatory part of the first semester [@serif_introduction_2022]. This course provides an initial overview of digital methods and their use for historical research, along with a practical component where students learn to apply different methods to a corpus. They are gently introduced to the command line,[^3] learning about APIs, regular expressions, string extraction, automation and other relevant techniques, as well as ways to visualize first results. By encountering a computer-based approach to historical sources early in their studies, students become more aware of the subject when planning their courses for the following semesters. In the absence of a comprehensive introductory course, the digital history courses I offer still begin at a basic level. By showing students the command line as a way to use the computer, I aim to dispel any unfounded fears and encourage a different way of thinking: A task that initially seems overwhelming can be broken down into several small steps, leading to its completion. I let students work on a small multi-step task that increases their motivation and demonstrates the potential relevance of these methods for historical research. The courses are student-project based, and while we also discuss some examples of digital history projects and reflect on the methods used [reading assignments include @romein_state_2020; @graham_exploring_2022; and @lemercier_quantitative_2019], the focus lies on learning by doing, this is by working with their own material, towards the completion of their project. @@ -42,11 +44,14 @@ Through small programming exercises, students learn principles of automation, st [^4]: Either R or Python is taught, as both offer a wide range of packages and libraries for humanities data as well as abundant tutorials for different methods. ## Conclusion and Outlook -The described courses provide an opportunity for every student to learn how to use computer-assisted methods for historical research. From the course evaluations we know that the courses have been largely appreciated, and that there is a strong demand for more classes of this kind. Furthermore, a significant portion of the participants come from other humanities disciplines, as their own curricula lack equivalent courses. Admittedly, the learning curve is quite steep, and the pace at the beginning is fast. In the current setting, this is unavoidable, but those who persevere often enjoy experimenting with their new skills and achieve unexpected results. + +The described courses provide an opportunity for every student to learn how to use computer-assisted methods for historical research. From the course evaluations we know that the courses have been largely appreciated, and that there is a strong demand for more classes of this kind. Furthermore, a significant portion of the participants come from other humanities disciplines, as their own curricula lack equivalent courses. Admittedly, the learning curve is quite steep, and the pace at the beginning is fast. In the current setting, this is unavoidable, but those who persevere often enjoy experimenting with their new skills and achieve unexpected results. So far, only few students choose to focus on computational analyses for their bachelor's or master's thesis,[^5] mostly because they do not feel fully confident with their new skill set (and also because potential supervisors often lack sufficient expertise to support them). Consequently, changes in the humanities curriculum seem necessary if we aim to educate more students in digital methods for historical research. With the increasing prominence of large language models, it seems all the more crucial to ensure that future historians can produce verifiable and reproducible results, leveraging computer-assisted methods both effectively and meaningfully. -[^5]: Some of the underlying ideas for the analytic part in the master thesis of @dickmann_topographien_2022 was developed by him in a course of mine in fall 2020, see https://github.com/LarsDIK/avis-analysis. +[^5]: Some of the underlying ideas for the analytic part in the master thesis of @dickmann_topographien_2022 was developed by him in a course of mine in fall 2020, see . -### References +## References +::: {#refs} +::: diff --git a/submissions/poster/440/index.qmd b/submissions/poster/440/index.qmd index b122796..f47c780 100644 --- a/submissions/poster/440/index.qmd +++ b/submissions/poster/440/index.qmd @@ -43,7 +43,7 @@ The structure of this paper begins with an overview of the Schweizer Filmwochens The Schweizer Filmwochenschau was shown every week in the supporting programme of Swiss cinemas between 1940 and 1975 in the three national languages German, French and Italian. The programmes were commissioned by the Federal Council and featured news from the worlds of politics, culture, social and sport. In 2015, the Swiss Film Archive (*Cinémathèque suisse*)[^4], Memoriav[^5] and the SFA made the films available in the three languages as part of a joint project[^6]. The digitisation made it possible to preserve analogue films consisting of 35mm nitrate and acetate elements, positives and negatives in digital formats, as well as to provide a new access. The digitised film collection J2.143*#20 with 1,651 editions and a total running time of approximately 200 hours is part of the Federal Archives’ online access platform[^7]. Depending on their access authorisation, users can search for a variety of metatada and primary data according to their access authorisation or download the films directly via video streaming. Like all other formats, this film collection is described according to the International Standard Archival Description and indexed down to document level in the archive tree. -Given the rapidly growing volume of analogue and digital born archival content[^8], in-depth exploration and indexing conducted solely by humans is hardly conceivable for the SFA. In this context, the SFA see automation and AI as resource-efficient ways of complementing traditional archiving work and offering innovative services to users. The new Archipanion Filmwochenschau platform[^9] extends the conventional search methods to include content-based discovery in the Schweizer Filmwochenschau. Archipanion is a service provided by 4eyes GmbH[^10] and is implemented on top of vitrivr[^11] [@sauter_general_2024]. +Given the rapidly growing volume of analogue and digital born archival content[^8], in-depth exploration and indexing conducted solely by humans is hardly conceivable for the SFA. In this context, the SFA see automation and AI as resource-efficient ways of complementing traditional archiving work and offering innovative services to users. The new Archipanion Filmwochenschau platform[^9] extends the conventional search methods to include content-based discovery in the Schweizer Filmwochenschau. Archipanion is a service provided by 4eyes GmbH[^10] and is implemented on top of vitrivr[^11] [@sauter_general_2024]. ## The deployment of vitrivr and Archipanion @@ -69,7 +69,7 @@ AI introduces both opportunities and challenges in research. Archipanion, specia Despite the transformative promise of predictive capabilities in AI, their effectiveness depends on fundamental factors such as the quality of training data, the accuracy of feature extraction methods, and the appropriateness of similarity metrics. Tools such as vitrivr have demonstrated progress in retrieval tasks, yet there remains a critical need for rigorous research to fully understand AI mechanisms and validate their practical applications. Human expertise remains essential in interpreting and contextualising AI-generated insights to ensure that query results are not only useful but also meaningful in advancing research agendas. -The integration of AI into archival research presents both innovative methodologies and potential limitations, as highlighted by Jaillant and Aske [-@jaillant_are_2023]. Their research underscores the importance of critical engagement with AI tools, and advocates for interdisciplinary collaboration to effectively harness the capabilities of AI while navigating its inherent biases and limitations. They argue that while AI facilitates improved access to and analysis of archival materials, including textual and visual content, its use requires ongoing scrutiny and refinement to ensure scholarly integrity and accuracy. This perspective calls for an ongoing dialogue between AI practitioners, humanities scholars, and archival professionals to optimise AI technologies. +The integration of AI into archival research presents both innovative methodologies and potential limitations, as highlighted by Jaillant and Aske [-@jaillant_are_2023]. Their research underscores the importance of critical engagement with AI tools, and advocates for interdisciplinary collaboration to effectively harness the capabilities of AI while navigating its inherent biases and limitations. They argue that while AI facilitates improved access to and analysis of archival materials, including textual and visual content, its use requires ongoing scrutiny and refinement to ensure scholarly integrity and accuracy. This perspective calls for an ongoing dialogue between AI practitioners, humanities scholars, and archival professionals to optimise AI technologies. Collaborative efforts are essential not only to refine AI applications in archival settings, but also to maximise their societal benefits and relevance. It is crucial to address various risks associated with AI, such as potential biases in algorithmic decision-making, privacy and security issues, and challenges in maintaining the long-term accessibility and usability of digital archives. @@ -77,16 +77,20 @@ Collaborative efforts are essential not only to refine AI applications in archiv Ultimately, the SFA remains committed to improving the accessibility and usability of its archival collections for both users and staff. The implementation of Archipanion represents a significant leap forward in improving access to the Filmwochenschau and other digital holdings. By harnessing AI technologies, the SFA not only facilitates the exploration and retrieval of historical materials, but also provides researchers with advanced tools for in-depth analysis and discovery. Looking ahead, the SFA continues to explore innovative ways to enhance search capabilities and ensure that its vast archival resources remain accessible and relevant in the digital age. Collaborative efforts are essential not only to refine AI applications in archival settings but also to amplify their societal benefits and relevance, thus preserving historical records and advancing scholarly research. - -[^1]: Swiss Federal Archives: -[^2]: Minutes of the Federal Council (1848-1972): -[^3]: tcc-metadata-anonymization: -[^4]: National Film Archive: -[^5]: Memoriav: -[^6]: Filmbestand Schweizer Filmwochenschau (1940-1975): -[^7]: Schweizer Filmwochenschau, 1940-1975 (Series): -[^8]: Swiss Federal Archives’ facts and figures: -[^9]: Schweizerisches Bundesarchiv Filmwochenschauen: -[^10]: 4eyes: -[^11]: vitrivr: -[^12]: DeepL API: +[^1]: Swiss Federal Archives: +[^2]: Minutes of the Federal Council (1848-1972): +[^3]: tcc-metadata-anonymization: +[^4]: National Film Archive: +[^5]: Memoriav: +[^6]: Filmbestand Schweizer Filmwochenschau (1940-1975): +[^7]: Schweizer Filmwochenschau, 1940-1975 (Series): +[^8]: Swiss Federal Archives’ facts and figures: +[^9]: Schweizerisches Bundesarchiv Filmwochenschauen: +[^10]: 4eyes: +[^11]: vitrivr: +[^12]: DeepL API: + +## References + +::: {#refs} +::: diff --git a/submissions/poster/463/index.qmd b/submissions/poster/463/index.qmd index a71af85..6737c28 100644 --- a/submissions/poster/463/index.qmd +++ b/submissions/poster/463/index.qmd @@ -55,3 +55,8 @@ The central aspects of *transcriptiones* are accessibility, transparency, collab ## Conclusion *transcriptiones* provides the infrastructure for sharing and editing transcriptions, which it understands as research data. By doing so, it takes this type of data to the age of FAIR and open research data. As an open and collaborative platform that requires metadata during uploads to ensure proper attribution to the source and offers various search strategies, it ensures that transcriptions are findable. Accessibility is guaranteed through the free web application, which allows viewing transcriptions without registration as well as through the various export formats and the API. The latter is also an important cornerstone in providing transcriptions and metadata interoperably. Reusability is achieved through the plethora of metadata and the versioning of edited transcriptions and metadata [For further information about what the FAIR data principles are, see @wilkinsonFAIRGuidingPrinciples2016]. At the same time, *transcriptiones* prompts a reconsideration of the perception of transcriptions, encouraging contributors to open up their work to collaboration. All these parts play together towards understanding transcriptions as invaluable research data which is worth gathering, sharing, enhancing and documenting so that many historians can use them for downstream research. + +## References + +::: {#refs} +::: diff --git a/submissions/poster/466/index.qmd b/submissions/poster/466/index.qmd index 39b665c..91e7d59 100644 --- a/submissions/poster/466/index.qmd +++ b/submissions/poster/466/index.qmd @@ -36,7 +36,6 @@ author: email: katrin.fuchs@unibas.ch affiliations: - University of Basel - date: 08-28-2024 --- @@ -47,4 +46,4 @@ One of the key accomplishments of this endeavor is the successful application of When combined with property information, the extracted data offers a unique opportunity to visualize historical events and transactions on Geographical Information Systems. This process allows for analyzing normative and semantic shifts in the real estate market over time, shedding light on historical changes in language and practice. -Ultimately, this project signifies a milestone in historical data analysis. Machine learning techniques have matured so that even extensive datasets and intricate finding aids can be effectively processed. As a result, innovative approaches to large-scale historical data analysis are now within reach, offering new perspectives on dynamic urban economies during pre-modern times. This venture showcases how technological approaches and humanities deliberations go hand in hand to understand complex patterns in economic history. \ No newline at end of file +Ultimately, this project signifies a milestone in historical data analysis. Machine learning techniques have matured so that even extensive datasets and intricate finding aids can be effectively processed. As a result, innovative approaches to large-scale historical data analysis are now within reach, offering new perspectives on dynamic urban economies during pre-modern times. This venture showcases how technological approaches and humanities deliberations go hand in hand to understand complex patterns in economic history. diff --git a/submissions/poster/472/index.qmd b/submissions/poster/472/index.qmd index 8af36fd..1c63923 100644 --- a/submissions/poster/472/index.qmd +++ b/submissions/poster/472/index.qmd @@ -26,4 +26,4 @@ Following the introduction of the first community space for the research communi Digital method and source criticism has become one of the central challenges of the digital humanities. Until now, research data has generally been published on institutional repositories or platforms such as Zenodo, but without the kind of quality control that is customary for journal articles. As a result, datasets often remain unused for further processing because it remains unclear what quality the research data has and what it might be suitable for. -From the experience of the first funding phase of Discuss Data, it has become clear that more energy must be put into attracting data curators in order to ensure that the community spaces are supported by the community in the long term. Positive examples are needed for this. For example, the integration of discussions as micropublications could help to demonstrate the individual added value. \ No newline at end of file +From the experience of the first funding phase of Discuss Data, it has become clear that more energy must be put into attracting data curators in order to ensure that the community spaces are supported by the community in the long term. Positive examples are needed for this. For example, the integration of discussions as micropublications could help to demonstrate the individual added value. diff --git a/submissions/poster/476/index.qmd b/submissions/poster/476/index.qmd index d9d2fac..8e5873d 100644 --- a/submissions/poster/476/index.qmd +++ b/submissions/poster/476/index.qmd @@ -39,17 +39,20 @@ We began by manually creating a semi-formal qualitative model based on Piketty's ## LLMs and the automatic generation of historiographical diagrams: starting with a small article Our initial exploration will involve using Google’s LLM (Gemini 1.5 Pro) to convert a concise historical article by Piketty into a simplified causal diagram. This article will be A Historical Approach to Property, Inequality and Debt: Reflections on Capital in the 21st Century . -Our previous experience with manually constructing a causal model based on Piketty's work highlighted the potential for automation using LLMs. LLMs have demonstrated remarkable capabilities in various domains, including understanding and generating code, translating languages, and even creating different creative text formats. We believe that LLMs can be trained to analyze historical texts, identify causal relationships, and automatically generate corresponding diagrammatic models. This could significantly enhance our ability to visualize and comprehend complex historical narratives, making implicit connections explicit and facilitating further exploration and analysis. +Our previous experience with manually constructing a causal model based on Piketty's work highlighted the potential for automation using LLMs. LLMs have demonstrated remarkable capabilities in various domains, including understanding and generating code, translating languages, and even creating different creative text formats. We believe that LLMs can be trained to analyze historical texts, identify causal relationships, and automatically generate corresponding diagrammatic models. This could significantly enhance our ability to visualize and comprehend complex historical narratives, making implicit connections explicit and facilitating further exploration and analysis. Historiographical theories explore the nature of historical inquiry, focusing on how historians represent and interpret the past. The use of diagrams has been considered as a means to enhance the communication and understanding of these complex theories. -Diagrams have been utilized to represent causal narratives in historiography, providing a visual means to support historical understanding and communicate research findings effectively Diagrams have indeed been employed to represent historiographical theories, particularly to illustrate causal narratives and enhance the clarity of historical explanations. +Diagrams have been utilized to represent causal narratives in historiography, providing a visual means to support historical understanding and communicate research findings effectively Diagrams have indeed been employed to represent historiographical theories, particularly to illustrate causal narratives and enhance the clarity of historical explanations. On the other hand, Large Language Models (LLMs) have been increasingly integrated into various aspects of coding, from understanding and generating code to assisting in software development and customization. These models leverage vast amounts of data to provide support for a range of programming-related tasks. LLMs are proving to be versatile tools in the realm of coding, capable of understanding, generating, and customizing code across various programming languages and applications. They offer improvements in code-related tasks, user-friendly interactions, and support for low-resource languages. However, challenges such as bias in code generation and the need for human oversight in code review remain. Overall, LLMs are becoming an integral part of the software development process, offering both opportunities and areas for further research and development. - ## Benefits and implications of that research The ability to automatically generate historiographical diagrams using LLMs offers several potential benefits: -- Enhanced understanding of complex historical narratives: Visual representations can clarify intricate causal relationships and make historical analysis more accessible to a wider audience. -- Identification of uncertainties and biases: LLMs can be trained to recognize subtle markers of uncertainty and bias within historical texts, encouraging critical engagement with historical interpretations. -- Efficiency and scalability: Automating the process of diagram generation would save time and resources, allowing researchers and teachers to explore a wider range of historical topics and narratives. +- Enhanced understanding of complex historical narratives: Visual representations can clarify intricate causal relationships and make historical analysis more accessible to a wider audience. +- Identification of uncertainties and biases: LLMs can be trained to recognize subtle markers of uncertainty and bias within historical texts, encouraging critical engagement with historical interpretations. +- Efficiency and scalability: Automating the process of diagram generation would save time and resources, allowing researchers and teachers to explore a wider range of historical topics and narratives. + +## References +::: {#refs} +::: diff --git a/submissions/poster/484/index.qmd b/submissions/poster/484/index.qmd index 44b2ad8..d65d774 100644 --- a/submissions/poster/484/index.qmd +++ b/submissions/poster/484/index.qmd @@ -29,4 +29,4 @@ The four partner libraries are currently working on a project (“Google Books f The central question is how libraries, as cultural and memory institutions, can offer relatively generic infrastructure in the digital space and keep it stable while still being able to use it flexibly enough for very specific research questions and methods. -As part of the poster session, we will present the results of the preliminary project and would like to explore these further with the audience. \ No newline at end of file +As part of the poster session, we will present the results of the preliminary project and would like to explore these further with the audience.