-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenization for Hindi (e.g. क्या
) is weird
#42
Comments
The same is true for both Chinese and Korean as well. sacremoses splits all characters: Here's some Chinese:
And some Korean:
Which is a shame, as I'd really like to use sacremoses as the tokenizer with LASER instead of using subprocess and temp files to call the moses perl scripts. |
Expected behavior for zh and ko:
|
क्या
is weirdक्या
) and CJK is weird
Looks like it's the the The issue comes from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L420 where the non-alphanumeric characters are padded with spaces. It looks like the
But when we check
Using the
|
क्या
) and CJK is weirdक्या
) is weird
@johnfarina Thanks for spotting that! The latest PR should #60 resolve the CJK issues. The Hindi one is a little more complicated, so leaving this PR open.
|
Oh wow, comment on a github issue, go to bed, wake up, bug is fixed! Thanks so much @alvations !! |
@alvations any update on the hindi tokenization issue? |
The text was updated successfully, but these errors were encountered: