ocrd_froc

Perform font classification and text recognition (in one step) on historic documents.

> Open and deserialize PAGE input files and their respective images,
> iterating over the element hierarchy down to the text line level.

> Then for each line, retrieve the raw image and feed it to the font
> classifier and/or the  OCR.

> Annotate font predictions by name and score as a comma-separated
> list under ``./TextStyle/@fontFamily``, if any.

> Annotate the text prediction as a string under ``./TextEquiv``.

> If ``method`` is `adaptive`, then use `SelOCR` if font classification is confident
> enough, otherwise use `COCR`.

> Finally, produce a new PAGE output file by serialising the resulting hierarchy.

Installation

Models

Default

The default.froc model is composed of a SelOCR network and a COCR architecture, and is trained to classify and OCR textlines on the following 12 classes:

Antiqua
Bastarda
Fraktur
Textura
Schwabacher
Greek *
Italic
Hebrew *
Gotico-antiqua
Manuscript *
Rotunda
No class/Ignore

* Greek, Hebrew and Manuscript font groups do not currently provide good results due to a lack of training data.

Usage

OCR-D processor interface ocrd-froc

To be used with PAGE-XML documents in an OCR-D annotation workflow.

Parameters:

   "ocr_method" [string - "none"]
    The method to use for text recognition
    Possible values: ["none", "SelOCR", "COCR", "adaptive"]
   "replace_textstyle" [bool - true]
    Whether to replace existing textStyle
   "network" [string]
    The file name of the neural network to use, including sufficient path
    information. Defaults to the model bundled with ocrd_froc.
   "fast_cocr" [boolean - true]
    Whether to use optimization steps on the COCR strategy
   "adaptive_threshold" [number - 95]
    Threshold of certitude needed to use SelOCR when using the adaptive
    strategy
   "font_class_priors" [array - []]
    List of font classes which are known to be present on the data when
    using the adaptive/SelOCR strategies. If this option is specified,
    any font classes not included are ignored. If 'other' is
    included in the list, no font classification is output and
    a generic model is used for transcriptions.

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.github/workflows		.github/workflows
ocrd_froc		ocrd_froc
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
ocrd-tool.json		ocrd-tool.json
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ocrd_froc

Installation

Models

Default

Usage

About

Releases 5

Packages

Contributors 5

Languages

License

OCR-D/ocrd_froc

Folders and files

Latest commit

History

Repository files navigation

ocrd_froc

Installation

Models

Default

Usage

About

Resources

License

Stars

Watchers

Forks

Releases 5

Packages 0

Contributors 5

Languages

Packages