Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts #1663

sven-nm · 2024-10-11T09:30:00Z

System Info

transformers version: 4.45.2
Platform: Linux-5.4.0-193-generic-x86_64-with-glibc2.31
Python version: 3.12.7
Huggingface_hub version: 0.25.2
Safetensors version: 0.4.5
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): not installed (NA)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?:

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

BatchEncoding.encodings[0].word_ids has alignment errors when working with diacritics (i.e. special accents). Here is a minimal working example:

from typing import List
import transformers
import unicodedata

# Instanciate the PreTrainedTokenizerFast
model_name = 'FacebookAI/xlm-roberta-base'
tokenizer = transformers.AutoTokenizer.from_pretrained(model_name, add_prefix_space=False)

# Example sentences
sentences: List[str] = [
    """And Odys- seus in 1. 1367, without caring to resent the sneer, simply reaffirms his right to take a line of his own, and pleads the reasonableness of his trying to win those in authority over to his side. On which Agamemnon (1. 1358) throws the entire responsibility on Odysseus, and Odysseus says (1. 1369), ‘ That makes no differ ence. Your consent, in whatever terms it is granted, will be equally kind.” If this is rejected, 1. 1366 must refer not to Odysseus’ words, but merely to his attitude of dissent. 1. 1367 is thus less pointed. For the meaning given to ἐνθάδ᾽ ἵξομαι, l. 136%, cp. Eur. Androm. 342, ἀλλ’ εἶσιν of xpf,—and for ὡς dv, 1. 1369, cp. O. C. 1361, and note. 1371. σοὶ μέν, ker A] For this un- gracious expression, cp. O. T. 671, 2, τό γὰρ σόν, οὐ τὸ τοῦδ᾽, ἐποικτείρω στόμα | ἐλεινόν, οὗτος δ᾽, ἔνθ᾽ ἂν 7, στυγήσεται. 1372. κἀκεῖ κἀνθάδ᾽ | E.on 1,. 841.}.γ8. 1373. σοὶ δὲ. ἃ Ἐχρή.] ‘You may do what you must:’ an ill-humoured way of saying, ‘Do as you please.” χρή, although rejected by Dindgrf and others in favour of χρῇς, i.e χρήζεις, is not inexpressive,and is possibly right. Cp. El. 606.—Exit Agamemnon. 1375. τοιοῦτον ὄντα] ‘While you act in this way. Cp. Phil. 1049, οὗ γὰρ τοιούτων δεῖ, τοιοῦτός εἰμ᾽ ἐγώ,""",
    # """Hello, this is a long sentence in ASCII with a lot of words, and it should be tokenized correctly 1234 how do you feel ? Hello, my name Foo, I'm a friend of Bar.""",

]

# Convert to NFC to make sure there is no floating combining character
sentences = [unicodedata.normalize('NFC', s) for s in sentences]

# Let's start with a working scenario. Here, I pre-tokenize inputs my self, with a blunt
# split. After that, we run the tokenizer and compare the maximum index in the 
# `BatchEncoding.encodings[0].word_ids`... It should be equal to the length of the input -1. 
sentences_pretokenized: List[List[str]] = [s.split() for s in sentences]

batch_encoding = tokenizer(sentences_pretokenized, # ⚠️ Using the pretokenized sentences (List[List[str]])
                           padding=True,
                           truncation=True,
                           max_length=tokenizer.model_max_length,
                           pad_to_multiple_of=tokenizer.model_max_length,
                           add_special_tokens=True,
                           is_split_into_words=True) # ⚠️ Setting this to True


max_word_id = max([word_id for word_id in batch_encoding.encodings[0].word_ids if word_id is not None])
number_of_words = len(sentences_pretokenized[0])

print(f"Max word_id: {max_word_id}") # 225 ✅
print(f"Real number of words: {number_of_words}") # 226

# Good, this is what we were hoping to see. Alignment is correct. However, let's look at what 
# happens if I pass the sentences directly, as Tokenizer should accept them: 
batch_encoding = tokenizer(sentences, # ⚠️ Using the raw sentences (List[str])
                           padding=True,
                           truncation=True,
                           max_length=tokenizer.model_max_length,
                           pad_to_multiple_of=tokenizer.model_max_length,
                           add_special_tokens=True,
                           is_split_into_words=False) # ⚠️ Setting this to False (default, but explicit for clarity)


max_word_id = max([word_id for word_id in batch_encoding.encodings[0].word_ids if word_id is not None])
number_of_words = len(sentences_pretokenized[0])

print(f"Max word_id: {max_word_id}") # 231 ❌  WRONG! 
print(f"Real number of words: {number_of_words}") # 226


# Now let us see where the alignment starts to mismatch: 
for word_id, token in zip(batch_encoding.encodings[0].word_ids, batch_encoding.encodings[0].tokens):
    if word_id is None:
        print(f"Token: {token}")
        continue
    try:
        print(f"Word: {sentences_pretokenized[0][word_id]},\t\tToken: {token}")
    except:
        print("-------ERROR-------")
        print(f"Token: {token}")

# .....
# Word: ἐνθάδ᾽,		Token: ▁ἐ
# Word: ἐνθάδ᾽,		Token: ν
# Word: ἐνθάδ᾽,		Token: θ
# Word: ἐνθάδ᾽,		Token: άδ
# Word: ἵξομαι,,		Token: ▁
# Word: ἵξομαι,,		Token: ̓  # <--- This is a combining diacritic seems to be causing the issue
# Word: l.,		Token: ▁
# Word: l.,		Token: ἵ
# Word: l.,		Token: ξ
# Word: l.,		Token: ομαι

Expected behavior

NOTE. I am aware that similar problem have been raised (huggingface/transformers#9637), and that the problem also exists with other models, even with ASCII-only examples (e.g. setting model_name to bert-base-uncased and using the second example only), but FacebookAI/xlm-roberta-base produces seemless alignment with ASCII chars.

I think there should be at least a warning as misalignments can have dramatic downstream consequences (thinking notably of token classification tasks).

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-10-22T13:20:51Z

Hey! Pretty sure this is related to the tokenizers library directly. I don't have time to investigate as of right now, hope someone can help! 🤗

sven-nm added the bug Something isn't working label Oct 11, 2024

ArthurZucker transferred this issue from huggingface/transformers Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts #1663

Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts #1663

sven-nm commented Oct 11, 2024 •

edited

Loading

ArthurZucker commented Oct 22, 2024

Inconsistent behaviour of PreTrainedTokenizerFasts on diacritics marked texts #1663

Inconsistent behaviour of PreTrainedTokenizerFasts on diacritics marked texts #1663

Comments

sven-nm commented Oct 11, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Oct 22, 2024

Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts #1663

Inconsistent behaviour of `PreTrainedTokenizerFast`s on diacritics marked texts #1663

sven-nm commented Oct 11, 2024 •

edited

Loading