Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Not an Issue] How to keep the trained model as close as possible to groundtruth #646

Open
johnlockejrr opened this issue Sep 27, 2024 · 3 comments

Comments

@johnlockejrr
Copy link

I try to train a segmentation model for a modern printed Judaeo-Arabic dataset. The problem I face is that in the trained model I mainly loose vowel signs below the line. What can be done? I tried from scratch training and finetuning.

ketos segtrain --line-width 10 -mr Main:textzone --precision 16 -d cuda:0 -f page -t output.txt --resize both -tl -i /home/incognito/kraken-train/teyman_print/biblialong02_se3_2_tl.mlmodel -q early --min-epochs 80 -o /home/incognito/kraken-train/teyman_print/teyman_print_scr_cl/teyman_print_tl_v3

Manual segmentation as groundtruth:
manual

Segmentation with the new trained model (small data, is preliminary):
trained

@johnlockejrr
Copy link
Author

Should I try to train it as center line and not top line as normal for Hebrew?

@dstoekl
Copy link

dstoekl commented Sep 28, 2024

it will not help I think. use the api to improve the polyons by calculating average line distance and extrapolating from there.

@johnlockejrr
Copy link
Author

johnlockejrr commented Sep 28, 2024

it will not help I think. use the api to improve the polyons by calculating average line distance and extrapolating from there.

Is not a problem with the dataset but the model output. Using API to do what? The model should perform better.

Maybe you are aware of a Hebrew segmentation model that can properly handle nikkud and cantillation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants