About eos_token_id in config file (20M, 1B) #757

lllabmaster · 2024-11-29T03:30:17Z

❓ The question

In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0, pad_token_id: 1, and tokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json.

However, in the tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json file, I noticed that id 0 corresponds to "|||IP_ADDRESS|||", while <|endoftext|> is assigned id 50279. This seems to contradict the configuration, especially when compared with the tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json file.

Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:

list(( np.where(data == 50279)[0][1:] - np.where(d == 50279)[0][:-1] ) / 4)

This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...] (length = 7386).

I have two questions:

What should I set as the eos_token_id in the configuration file when using the allenai_eleuther-ai-gpt-neox-20b-pii-special.json tokenizer?
Was the preprocessed data (gpt-neox-olmo-dolma-v1_5/part-X-00000.npy) encoded with eos_token_id=50279?

Thank you for your assistance!

The text was updated successfully, but these errors were encountered:

lllabmaster added the type/question An issue that's a question label Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

About eos_token_id in config file (20M, 1B) #757

About eos_token_id in config file (20M, 1B) #757

lllabmaster commented Nov 29, 2024

About eos_token_id in config file (20M, 1B) #757

About eos_token_id in config file (20M, 1B) #757

Comments

lllabmaster commented Nov 29, 2024

❓ The question