This repository contains the data used to perform topic modeling on premodern texts; the article that discusses the application and the outcomes can be found here: https://doi.org/10.57813/20220623-153139-0
There are two folders:
- textual data
- topics explorer
The folder "textual data" contains the txt-files and the stopword list. The txt-files were produced in two different ways: For the medieval manuscripts as well as for the two editions of the chronicle of Jakob Twinger as well as of the Oberrheinische Chronik, the software Transkribus was used for automatic text recognition. The resulting output was minimally cleaned: During text recognition, words with diacritics were often split into separate strings, e.g. "brů" and "der", so a dissolution of the diacritics was performed, resulting in one string, e.g. "bruoder". Abbreviations were dissolved, e.g. "andˀ" to "ander". Also, information about the digitizing institution that in some cases is shown on the digital copies and therefore recognised as text was deleted completely.
Four different HTR/HTR+ models were used for text recognition:
Dresden, UB, Mscr. F 98: Medieval_Scripts_M2 (model ID 35164) (CITLab HTR+)
Freiburg, UB, Hs. 471: Charter Scripts XIII-XV_M4 (model ID 6091753) (CITLab HTR+)
Heidelberg, UB, Cpg 116: German_Kurrent_XVI-XVIII_M1 (model ID 19584) (CITLab HTR+)
Heidelberg, UB, Cpg 475: Thun Missiven M3 (model ID 431) (CITLab HTR)
München, BSB, Cgm 568: Charter Scripts XIII-XV_M4 (model ID 6091753) (CITLab HTR+)
Stuttgart, LB, HB V 22: Thun Missiven M3 (model ID 431) (CITLab HTR)
Wolfenbüttel, HAB, Cod. 16.17.: Charter Scripts XIII-XV_M4 (model ID 6091753) (CITLab HTR+)
(As of autums 2022, Transkribus suspended the HTR+ engine and only supports PyLaia trained models; while there are models that fit the different writings in the used manuscripts, the transcription output might differ a lot from the output achieved with the other engine.)
For the edition of the chronicle of Ulrich Richental, the text of the Aulendorfer version was used. The text of the edited chronicle of Petermann Etterlin could be reused. For the edition of "Das Leben des Heiligen Ulrich", I also could reuse already existing full text; unfortunately, the edition is not openly accessible, so I cannot provide the text here.
The stopword list is an enhanced list of the Middle High German stopword list of the Classical Language Toolkit.
The folder "topics explorer" contains the output of the topic modeling performed with the software Dariah Topics Explorer. The output consists of several csv-files that can be used for visualisations.
If you want to run the analysis yourself, there is one caveat: Development and maintenance of the Topics Explorer might not be up to date in case of recent updates of your operating system; it should always work if you download the source code and run the programme in a virtual environment. If it does not, you still can try your luck with the provided notebooks. Having a look at the Topics Explorer issues or opening a new one might lead to a quick solution.