Skip to content

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

License

Notifications You must be signed in to change notification settings

OCR-D/gt_structure_text

Repository files navigation

gt_structure_text

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include. The data is based on transcription data stored in the German Text Archive (DTA) (https://www.deutschestextarchiv.de/).

Metadata

Language:
eng, fra, deu, heb, lat
Format:
Page-XML
Time:
1500-1900
GT Type:
data_structure_and_text
License:
CC-BY-SA-4.0
Transcription Guidelines:
OCR-D Ground Truth Guidelines https://ocr-d.de/en/gt-guidelines/trans/
Project:
OCR-D
Project-URL:
https://ocr-d.de/

Sources

The volume of transcriptions:

TextLine Page TxtRegion ImgRegion GraphRegion TabRegion SepRegion MathRegion MusicRegion NoiseRegion
6609 217 1648 1 74 3 141 1 4 17

List of transcriptions

document TxtRegion ImgRegion LineDrawRegion GraphRegion TabRegion ChartRegion SepRegion MathRegion ChemRegion MusicRegion AdRegion NoiseRegion UnknownRegion CustomRegion TextLine Page
justi_abhandlung01_1758 37 1 1 131 4
lohenstein_agrippina_1665 56 3 1 109 3
brenz_abentmal_1550 22 89 4
basilius_legendi_1515 12 2 82 3
nn_mirabilia_1500 10 2 58 3
rhegius_artzney_1529 12 1 80 3
schiller_raeuber_1781 15 2 54 2
huebner_handbuch_1696 26 4 4 78 3
luz_blitz_1784 17 1 4 110 4
vespucci_insule_1506 7 62 2
bebel_frau_1879 20 3 164 4
nn_lied_1515 6 25 1
luther_babstum_1526 7 2 51 2
ballenstedt_delatio_1777 26 3 98 3
pinder_epiphanie_1506 31 1 5 169 4
trota_mordtbrenner_1540 20 2 44 2
petrarca_psalmi_1506 13 2 64 3
loeber_heuschrecken_1693 15 1 3 87 3
clauren_mimil_1815 44 1 206 9
pistoris_regiment_1506 12 90 3
estor_rechtsgelehrsamkeit02_1758 44 1 3 153 4
aventinus_grammatica_1515 29 19 1 129 3
sachs_drey_1553 7 54 2
praetorius_syntagma02_1619_teil2 30 1 5 136 4
herder_geschichte03_1787 5 3 14 1
arnold_ketzerhistorie01_1699 43 6 378 4
hohberg_georgica01_1682_teil1 14 3 66 2
gerstner_mechaniktafeln01_1831 2 1 2 1
witzstat_buchszbaum_1540 13 47 2
osiander_predigt_1553 7 57 2
lessing_menschengeschlecht_1780 8 1 15 1
aepinus_bekentnis_1548 20 3 101 4
hilbert_zahlkoerper_1897 46 4 5
nn_lied_1520 5 1 1 22 1
alberti_pictura_1540 22 1 94 3
rollenhagen_reysen_1603 22 1 81 3
praetorius_verrichtung_1668 38 2 197 5
reinkingk_policey_1653_teil1 20 1 146 3
nn_vertrag_1525 5 35 2
valentinus_occulta_1603 22 1 1 164 6
hohberg_georgica01_1682_teil2 27 159 2
dannhauer_catechismus10_1673 18 151 4
blumenbach_anatomie_1805 20 84 3
ruempler_gartenbau_1882 105 2 3 9 1 6
buerger_gedichte_1778 14 6 52 2
heyden_paedono_1548 19 72 3
vischer_aesthetikregister_1858 1 1
kant_aufklaerung_1784 15 4 55 2
wecker_kochbuch_1598 35 156 4
luther_auszlegunge_1520 10 59 2
glauber_opera01_1658 127 3 2 376 6
oesterreicher_sachsen_1548 8 2 48 2
weigel_gnothi02_1618 22 1 128 4
silesius_seelenlust01_1657 38 1 7 4 137 5
nn_historia_1500 5 1 35 2
arnimb_goethe03_1835 5 1 22 1
boeschenstain_gedicht_1520 9 1 45 1
benner_herrnhuterey04_1748 37 6 144 4
euler_rechenkunst01_1738 94 8 31 234 6
praetorius_syntagma02_1619_teil1 72 1 4 168 4
laube_europa0202_1837 15 2 7 43 5
karlstadt_sermon_1523 5 1 1 65 2
bernd_lebensbeschreibung_1738 15 4 1 71 3
nn_besuch_1780 5 3 1 76 4
bohse_helicon_1696 35 3 2 121 5
meyfart_rhetorica_1634 27 4 113 4
reinkingk_policey_1653_teil2 21 1 108 2
kistler_kraeuter_1500 14 58 2
calvi_beutelschneider01_1627 21 3 87 3

Extent

In this section they can insert additional information, instructions or notes.

About

The OCR-D Ground Truth text and structure corpus was created between 2015 -2017. In the years since 2017, this corpus has been further curated and supplemented with metadata where appropriate. The corpus includes page XML files within annotations of the text and structure include.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •