-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Weird results for Tamil and Russian tokenization #73
Comments
Just a note: Moses tokenizer has the same behavior:
|
Yes actually this is the same as the Hindi problem at #42 There's a way to resolve this but it requires a little more digging and understanding of Indian languages in the unicode charset =( This is caused by https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L41 padding spaces to characters which it this isn't alphanumeric from https://github.com/alvations/sacremoses/blob/master/sacremoses/tokenize.py#L24 Adding the |
And the problem also exists for Russian:
|
Same outputs from default
|
Hi, the results for Tamil tokenization is weird:
The text was updated successfully, but these errors were encountered: