Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

Open
2 of 4 tasks
OdinOps opened this issue Nov 15, 2024 · 9 comments
Open
2 of 4 tasks

tokenizer.json modified after tokenizer.save_pretrained of OLMO models #34744

OdinOps opened this issue Nov 15, 2024 · 9 comments
Labels

Comments

@OdinOps
Copy link

OdinOps commented Nov 15, 2024

System Info

  • transformers version: 4.45.0
  • Platform: Linux-6.8.0-48-generic-x86_64-with-glibc2.39
  • Python version: 3.10.15
  • Huggingface_hub version: 0.26.2
  • Safetensors version: 0.4.5
  • Accelerate version: 1.0.1
  • Accelerate config: not found
  • PyTorch version (GPU?): 2.4.0+rocm6.1 (True)
  • Tensorflow version (GPU?): not installed (NA)
  • Flax version (CPU?/GPU?/TPU?): not installed (NA)
  • Jax version: not installed
  • JaxLib version: not installed
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: AMD Instinct MI250X/MI250

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When I load and then save the tokenizer with OLMO models, the tokenizer.json files appear different, particularly with the merge key.

image

The code to reproduce that is :

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("allenai/OLMo-1B-0724-hf")
tokenizer.save_pretrained("saved_tokenizer")

Expected behavior

The original tokenizer.json and the saved tokenizer.json should be the same.

@OdinOps OdinOps added the bug label Nov 15, 2024
@LysandreJik
Copy link
Member

Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change.

I will let @ArthurZucker comment on it, thank you!

@DtYXs
Copy link

DtYXs commented Nov 24, 2024

Hey @zzf1130, I believe this change is to make the tokenizer.json more flexible/future-proof, and is therefore a voluntary change.

I will let @ArthurZucker comment on it, thank you!

Hello. This change cause a error when I use vllm to run a server.

  File "/opt/conda/lib/python3.11/site-packages/transformers/tokenization_utils_fast.py", line 115, in __init__
    fast_tokenizer = TokenizerFast.from_file(fast_tokenizer_file)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Exception: data did not match any variant of untagged enum ModelWrapper at line 757443 column 3

And I have to replace it with the original one.

@ArthurZucker
Copy link
Collaborator

Hey! This is somewhat expected, we had a breaking change in tokenizers which breaks forward compatibility (so new models won't be laoded) make sure to update transformers

@davidmezzetti
Copy link
Contributor

@ArthurZucker

I ran into this when quantizing a Llama model. This doubles the file size of tokenizer.json and when attempting to check in the quantized model, git rejected tokenizer.json (it went from 9MB to 17MB).

This might require a lot of projects to add tokenizer.json to .gitattributes as a lfs file. I had to spend quite a while tracing why quantization would change the tokenizer.

I ended up copying the tokenizer files from the base model as I didn't think it would be good practice to change the tokenizer for a quantized model.

@ArthurZucker
Copy link
Collaborator

Ah yes that's an issue. This is related to the pretty-print format, will try to fix for tokenizers==0.21.1 !

@davidmezzetti
Copy link
Contributor

That should do the trick!

@Narsil
Copy link
Contributor

Narsil commented Dec 2, 2024

@ArthurZucker

I ran into this when quantizing a Llama model. This doubles the file size of tokenizer.json and when attempting to check in the quantized model, git rejected tokenizer.json (it went from 9MB to 17MB).

This might require a lot of projects to add tokenizer.json to .gitattributes as a lfs file. I had to spend quite a while tracing why quantization would change the tokenizer.

I ended up copying the tokenizer files from the base model as I didn't think it would be good practice to change the tokenizer for a quantized model.

9MB is really high already, the limit is 10MB for lfs, so any nudge will push it into LFS territory. Removed pretty printing and it's still at 12MB. Everything zipped is 2.5MB (both the 17MB and 12MB versions) showing how much redundant that data is...
Culprit: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct/raw/main/tokenizer.json

Copying the tokenizer

Seems like a good strategy too.

@Narsil
Copy link
Contributor

Narsil commented Dec 2, 2024

Linked to huggingface/tokenizers#1656.

Personally I think if we're going to disable pretty printing I'd remove JSON altogether to get to this 2.5MB target directly.
Another option would be to do better merges which would be to store them as they are in memory which is instead of storing pairs of tokens as bytes, we can store only the token id in the merge. Depending on the size of the tokens it could be a compression (would definitely save 4 bytes on each of the " quoting the strings + removing any utf-8 there).

Edit: I don't think moving away from JSON is good, JSON is made to be readable editable, which is great. But making it unreadable by not pretty-printing doesn't yield enough wins it seems, and therefore if we care that much about size there are probably better options.

@ArthurZucker
Copy link
Collaborator

So TLDR do you want to go with using ids instead of strings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants