This meta-repository is a comprehensive collection of all official OCR-D Ground Truth repositories with structural annotations (i.e. only layout, but no text).
Together, these datasets make up the OCR-D Structure GT corpus, which contains images and their respective annotations in PAGE format, capturing the structural elements (segments=regions but not lines) of printed pages (with a total of 25441 pages).
It was established as part of the DFG funded project OCR-D.
- datasets/gt_structure_1_1
- datasets/gt_structure_1_2
- datasets/gt_structure_1_3
- datasets/gt_structure_1_4
- datasets/gt_structure_2_1
- datasets/gt_structure_2_2
- datasets/gt_structure_2_3
- datasets/gt_structure_2_4
- datasets/gt_structure_3_1
- datasets/gt_structure_3_2
- datasets/gt_structure_3_3
- datasets/gt_structure_4_1
- datasets/gt_structure_4_2
- datasets/gt_structure_4_3
- datasets/gt_structure_5_1
- datasets/gt_structure_5_2
- datasets/gt_structure_5_3
git clone --recurse-submodules -j8 https://github.com/OCR-D/gt_structure_all.git
All data records are also published in Zenodo, and thus have a DOI. Whenever changes are made and a new release is created, the respective dataset will receive a new DOI.
Access to the OCR-D datasets in Zenodo via this search.
If you wish to incorporate text data into these structural datasets, then please use the datasets or data from gt_structure_dtaText repository.