Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support newer merges format in tokenizer.json files #392

Merged
merged 2 commits into from
Oct 23, 2024

Conversation

robertknight
Copy link
Owner

@robertknight robertknight commented Oct 23, 2024

tokenizer.json files generated using the newest versions of the HuggingFace
tools represent the BPE merge list as an array of pairs of strings instead of an array of
space-separated strings. Update the tokenizer.json parser and Bpe model to
support this. A merge_pairs_from_lines function has been added to convert the
old format into the new format for code using the Bpe model directly.

Fixes #391.

`tokenizer.json` files generated using the newest versions of the HuggingFace
tools represent the BPE merge list as an array of pairs instead of an array of
space-separated strings. Update the `tokenizer.json` parser and `Bpe` model to
support this. A `merge_pairs_from_lines` function has been added to convert the
old format into the new format for code using the `Bpe` model directly.

Fixes #391.
This was generated using transformers v4.45 and tokenizers v0.20.1.
@robertknight robertknight merged commit 72e4c7a into main Oct 23, 2024
2 checks passed
@robertknight robertknight deleted the tokenizer-json-merge-pairs branch October 23, 2024 07:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parsing of merges field in tokenizer.json fails
1 participant