-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tokenizer clean_up_tokenization_spaces #65
Comments
I'm a bit confused, why does it say that it is true by default, then immediately after it is false by default? |
As I read it; it is set to When they move to version 4.45 it is going to be set to |
Would True or False be better in this case? Perhaps it is useful to have reversible string, but that would be bad if it affects the final results (regression) |
I do not know the code good enough to answer that. What I found, from some simple tests, is that formatting is not preserved. E.g. Input text:
Output:
Setting As the project is for now, I do not see any feature for reverting and not sure if that is something that would be of need. Perhaps in some sub-component? Preserving formatting on the other hand would be nice, but a rather huge challenge when it comes to the nature of translations. E.g:
Where if anywhere would one add the extra white-space etc.; And a rather big beast to handle - unless this is something already a feature or being worked on in the models. |
Hmm I see. In this case, i'm happy to merge a PR if you are interested in creating one! |
FYI
Changes in transformers tokenizer gives deprecation warning.
Something like this can be used:
Have not found any difference in result by using True vs False, but then again I just started looking at this project.
The text was updated successfully, but these errors were encountered: