-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training on Voynich Manuscript #681
Comments
Is there no solution to this problem ? |
Sorry, I forgot to actually send the response.
First there is probably some fundamental misunderstanding. You *can* use
synthetic training data for training recognition models but this has
it's limits and it is unlikely to work with handwritten material.
Then the line generator will only produce data that is only compatible
with the legacy bounding box segmenter which doesn't really work with
manuscripts. Which is why the generator outputs a deprecation warning. A
basic explanation what each of them does is here [0].
Last but not least, the symbols that you see in the console output are
probably correct. The glyphs in the Voynich manuscript are not in the
Unicode standard. There are two ways to encode them anyways (PUA or
transliteration into existing code points) although neither will produce
correct display on arbitrary systems by default (and is exceedingly
unlikely to show up correctly on the command line even if rendering in
the output images is correct).
What I think is happening is that your transcription maps symbols in the
manuscript onto code points of the Latin alphabet (?a-z). This also
seems to be the case for the typeface you're using, at least from a
quick glance (Latin alphabet code points have manuscript glyphs in
there). You need to make sure that the transliteration of your text file
is the same as the mapping in the typeface otherwise you're going to get
garbled output. Basically this:
glyph on page -> Latin transliteration -> glyph in typeface
This is really just a basic Unicode and text rendering thing that got
nothing to do with kraken as such.
[0] https://kraken.re/main/advanced.html#page-segmentation
|
Thanks for the feedback, I will think carefully about the comments. I have now succeeded in creating well-formed XML files. It should be noted that the page tag should look like this, for example: *Page imageFilename="000000.png" imageWidth="1426" imageHeight="78"> Without the specification of “imageWidth” and “imageHeight”, Ketos aborts the training with an error message. My question is whether the Coords points must be specified after each Word tag, like this:
or is this information unnecessary ? |
Hello,
I would like to train on the so-called Voynich Manuscript. My current problem is as follows.
I get this error message when I start the training:
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
Here is my XML - File ( example ):
*PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15 http://schema.primaresearch.org/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd">
*Metadata />
*Page imageFilename="000006.png">
*TextRegion id="block_1" custom="structure {type:region_type;}">
*Coords points="2,34 2,72 897,72 897,34" />
*TextLine id="line_1">
*Baseline points="2,40 177,36 385,34 544,34 745,36" />
*TextEquiv>
Unicode>ysheees chetchy teodar otcheol tockhy/Unicode>
*/TextEquiv>
*Word>
Unicode>ysheees/Unicode>
*/Word>
*Word>
Unicode>chetchy/Unicode>
*/Word>
*Word>
Unicode>teodar/Unicode>
*/Word>
*Word>
Unicode>otcheol/Unicode>
*/Word>
*Word>
Unicode>tockhy/Unicode>
*/Word>
*/TextLine>
*/TextRegion>
*/Page>
*/PcGts>
Note: * is an opening bracket, but it makes the post invisible.
The text was updated successfully, but these errors were encountered: