Introducing the ACDC Project, Part I: Training Data Production and the Diversity of the Islamicate Manuscript Tradition
One of OpenITI’s major deliverables in our most recent round of work is the Automatic Collation for Diversifying Corpora (ACDC) project and ensuing tool, which we are now making available for wider use and experimentation: the relevant code and instructions for installation and use are available on Github, and additional data and documentation will be forthcoming. ACDC is one component in our multi-pronged strategy to significantly boost handwritten text recognition for Arabic script as manifest within the manuscript tradition, a tradition of text production that dates from around the time of the Qur’an’s codification all the way down to the present (in some parts of the Islamicate world manuscript production for various purposes remains very much alive!). In this week’s blog entry, I’d like to briefly introduce the project (or, rather, allow one of our other team members do so), then focus on one aspect of its production: the creation of training data via selection of a limited manuscript corpus and then manual transcription of exemplars from that corpus. We’ll see the rationale that lay behind our selections, and in so doing get a valuable glimpse of the immense diversity (and, from a technical standpoint, challenges!) of the Islamicate manuscript tradition, and reflect on the nature of ‘canonicity’ in Islamicate texts and the multiple forms of use that these manuscripts underwent in their ‘working lifetimes.’ We’ll return to ACDC later in the spring semester, as we explore its capacities more and finish documentation and archiving of the relevant data.
Our team member David Smith (Northeastern University), the main driver behind the ACDC project, has just released a video introduction and tutorial to the project and the ensuing tool, wherein he explains the logic of the project and gives a detailed walk-through of use, do have a watch!
The method was published in this paper: David A. Smith, Jacob Murel, Jonathan Parkes Allen, Matthew Thomas Miller: "Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition", CHR 2023: Computational Humanities Research Conference, December 6 – 8, 2023, Paris, France. PDF
Library of Congress PK6450 .G2 1593