Inconsistent behaviour of PreTrainedTokenizerFast
s on diacritics marked texts
#1663
Open
2 of 4 tasks
Labels
bug
Something isn't working
System Info
transformers
version: 4.45.2Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
BatchEncoding.encodings[0].word_ids
has alignment errors when working with diacritics (i.e. special accents). Here is a minimal working example:Expected behavior
NOTE. I am aware that similar problem have been raised (huggingface/transformers#9637), and that the problem also exists with other models, even with ASCII-only examples (e.g. setting
model_name
tobert-base-uncased
and using the second example only), butFacebookAI/xlm-roberta-base
produces seemless alignment with ASCII chars.I think there should be at least a warning as misalignments can have dramatic downstream consequences (thinking notably of token classification tasks).
The text was updated successfully, but these errors were encountered: