Skip to content
/ ACDC Public

Automatic Collation for Diversifying Corpora (ACDC)

License

Notifications You must be signed in to change notification settings

OpenITI/ACDC

Repository files navigation

Automatic Collation for Diversifying Corpora (ACDC)

Introducing the ACDC Project, Part I: Training Data Production and the Diversity of the Islamicate Manuscript Tradition

One of OpenITI’s major deliverables in our most recent round of work is the Automatic Collation for Diversifying Corpora (ACDC) project and ensuing tool, which we are now making available for wider use and experimentation: the relevant code and instructions for installation and use are available on Github, and additional data and documentation will be forthcoming. ACDC is one component in our multi-pronged strategy to significantly boost handwritten text recognition for Arabic script as manifest within the manuscript tradition, a tradition of text production that dates from around the time of the Qur’an’s codification all the way down to the present (in some parts of the Islamicate world manuscript production for various purposes remains very much alive!). In this week’s blog entry, I’d like to briefly introduce the project (or, rather, allow one of our other team members do so), then focus on one aspect of its production: the creation of training data via selection of a limited manuscript corpus and then manual transcription of exemplars from that corpus. We’ll see the rationale that lay behind our selections, and in so doing get a valuable glimpse of the immense diversity (and, from a technical standpoint, challenges!) of the Islamicate manuscript tradition, and reflect on the nature of ‘canonicity’ in Islamicate texts and the multiple forms of use that these manuscripts underwent in their ‘working lifetimes.’ We’ll return to ACDC later in the spring semester, as we explore its capacities more and finish documentation and archiving of the relevant data.

Our team member David Smith (Northeastern University), the main driver behind the ACDC project, has just released a video introduction and tutorial to the project and the ensuing tool, wherein he explains the logic of the project and gives a detailed walk-through of use, do have a watch!

The method was published in this paper: David A. Smith, Jacob Murel, Jonathan Parkes Allen, Matthew Thomas Miller: "Automatic Collation for Diversifying Corpora: Commonly Copied Texts as Distant Supervision for Handwritten Text Recognition", CHR 2023: Computational Humanities Research Conference, December 6 – 8, 2023, Paris, France. PDF

Library of Congress PK6450 .G2 1593

Read more

About

Automatic Collation for Diversifying Corpora (ACDC)

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published