Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pattern key for split pretokenizer #38

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

jackzhxng
Copy link
Contributor

@jackzhxng jackzhxng commented Mar 27, 2025

Was missing the "Regex" key, e.g.

"pretokenizers": [
      {
        "type": "Split",
        "pattern": {
          "Regex": "[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]*[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]+(?i:'s|'t|'re|'ve|'m|'ll|'d)?|[^\\r\\n\\p{L}\\p{N}]?[\\p{Lu}\\p{Lt}\\p{Lm}\\p{Lo}\\p{M}]+[\\p{Ll}\\p{Lm}\\p{Lo}\\p{M}]*(?i:'s|'t|'re|'ve|'m|'ll|'d)?|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n/]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+"
        },
        "behavior": "Removed",
        "invert": true
      },
      ...
]

Test (looking into addressing the invalid perl operator negative lookahead):

>> cmake-out/examples/tokenize_tool/tokenize_tool hf_tokenizer ~/hf/models--microsoft--Phi-4-mini-instruct/snapshots/c0fb9e74abda11b496b7907a9c6c9009a7a0488f/tokenizer.json "Hello world!"

WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1743095300.202419 3915421 re2.cc:237] Error parsing '([^\r\n\p{L}\p{N}]?[\p{Lu}\p{Lt}\p{Lm}\p{Lo}\p{M}]*[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(?i:'s|'t|'re|'ve|'m|'l...': invalid perl operator: (?!
Vocab Size: 200029
BOS: 199999
EOS: 199999

PROMPT:
Hello world!

Encoding...
E0000 00:00:1743095300.500576 3915421 re2.cc:921] Invalid RE2: invalid perl operator: (?!
[ ]

Decoding...

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants