Training on Voynich Manuscript #681

bi3mw · 2025-01-24T18:42:18Z

Hello,
I would like to train on the so-called Voynich Manuscript. My current problem is as follows.

I get this error message when I start the training:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'

Here is my XML - File ( example ):

*PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
*Metadata />
*Page imageFilename="000006.png">
*TextRegion id="block_1" custom="structure {type:region_type;}">
*Coords points="2,34 2,72 897,72 897,34" />
*TextLine id="line_1">
*Baseline points="2,40 177,36 385,34 544,34 745,36" />
*TextEquiv>
Unicode>ysheees chetchy teodar otcheol tockhy/Unicode>
*/TextEquiv>
*Word>
Unicode>ysheees/Unicode>
*/Word>
*Word>
Unicode>chetchy/Unicode>
*/Word>
*Word>
Unicode>teodar/Unicode>
*/Word>
*Word>
Unicode>otcheol/Unicode>
*/Word>
*Word>
Unicode>tockhy/Unicode>
*/Word>
*/TextLine>
*/TextRegion>
*/Page>
*/PcGts>

Note: * is an opening bracket, but it makes the post invisible.

bi3mw · 2025-02-01T16:28:24Z

Is there no solution to this problem ?

mittagessen · 2025-02-01T20:05:25Z

Sorry, I forgot to actually send the response. First there is probably some fundamental misunderstanding. You *can* use synthetic training data for training recognition models but this has it's limits and it is unlikely to work with handwritten material. Then the line generator will only produce data that is only compatible with the legacy bounding box segmenter which doesn't really work with manuscripts. Which is why the generator outputs a deprecation warning. A basic explanation what each of them does is here [0]. Last but not least, the symbols that you see in the console output are probably correct. The glyphs in the Voynich manuscript are not in the Unicode standard. There are two ways to encode them anyways (PUA or transliteration into existing code points) although neither will produce correct display on arbitrary systems by default (and is exceedingly unlikely to show up correctly on the command line even if rendering in the output images is correct). What I think is happening is that your transcription maps symbols in the manuscript onto code points of the Latin alphabet (?a-z). This also seems to be the case for the typeface you're using, at least from a quick glance (Latin alphabet code points have manuscript glyphs in there). You need to make sure that the transliteration of your text file is the same as the mapping in the typeface otherwise you're going to get garbled output. Basically this: glyph on page -> Latin transliteration -> glyph in typeface This is really just a basic Unicode and text rendering thing that got nothing to do with kraken as such. [0] https://kraken.re/main/advanced.html#page-segmentation

bi3mw · 2025-02-02T14:16:34Z

Thanks for the feedback, I will think carefully about the comments.

I have now succeeded in creating well-formed XML files. It should be noted that the page tag should look like this, for example:

*Page imageFilename="000000.png" imageWidth="1426" imageHeight="78">

Without the specification of “imageWidth” and “imageHeight”, Ketos aborts the training with an error message.

My question is whether the Coords points must be specified after each Word tag, like this:

    *Word>
      *Unicode>tchodar</Unicode>
      *Coords points="1,26 172,26 172,49 1,49" />
    */Word>
    *Word>

or is this information unnecessary ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training on Voynich Manuscript #681

Training on Voynich Manuscript #681

bi3mw commented Jan 24, 2025 •

edited

Loading

bi3mw commented Feb 1, 2025

mittagessen commented Feb 1, 2025 via email

bi3mw commented Feb 2, 2025

Training on Voynich Manuscript #681

Training on Voynich Manuscript #681

Comments

bi3mw commented Jan 24, 2025 • edited Loading

bi3mw commented Feb 1, 2025

mittagessen commented Feb 1, 2025 via email

bi3mw commented Feb 2, 2025

bi3mw commented Jan 24, 2025 •

edited

Loading