Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About eos_token_id in config file (20M, 1B) #757

Open
lllabmaster opened this issue Nov 29, 2024 · 0 comments
Open

About eos_token_id in config file (20M, 1B) #757

lllabmaster opened this issue Nov 29, 2024 · 0 comments
Labels
type/question An issue that's a question

Comments

@lllabmaster
Copy link

❓ The question

In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0, pad_token_id: 1, and tokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json.

However, in the tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json file, I noticed that id 0 corresponds to "|||IP_ADDRESS|||", while <|endoftext|> is assigned id 50279. This seems to contradict the configuration, especially when compared with the tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json file.

Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:

list(( np.where(data == 50279)[0][1:] - np.where(d == 50279)[0][:-1] ) / 4)

This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...] (length = 7386).

I have two questions:

  1. What should I set as the eos_token_id in the configuration file when using the allenai_eleuther-ai-gpt-neox-20b-pii-special.json tokenizer?
  2. Was the preprocessed data (gpt-neox-olmo-dolma-v1_5/part-X-00000.npy) encoded with eos_token_id=50279?

Thank you for your assistance!

@lllabmaster lllabmaster added the type/question An issue that's a question label Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question An issue that's a question
Projects
None yet
Development

No branches or pull requests

1 participant