You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the 20M configuration file (OLMo-20M.yaml), the settings specify: eos_token_id: 0, pad_token_id: 1, and tokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json.
However, in the tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json file, I noticed that id 0 corresponds to "|||IP_ADDRESS|||", while <|endoftext|> is assigned id 50279. This seems to contradict the configuration, especially when compared with the tokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json file.
Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:
This assumes an average of 4 tokens per English word. The results were: [27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...] (length = 7386).
I have two questions:
What should I set as the eos_token_id in the configuration file when using the allenai_eleuther-ai-gpt-neox-20b-pii-special.json tokenizer?
Was the preprocessed data (gpt-neox-olmo-dolma-v1_5/part-X-00000.npy) encoded with eos_token_id=50279?
Thank you for your assistance!
The text was updated successfully, but these errors were encountered:
❓ The question
In the 20M configuration file (OLMo-20M.yaml), the settings specify:
eos_token_id: 0
,pad_token_id: 1
, andtokenizer: tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json
.However, in the
tokenizers/allenai_gpt-neox-olmo-dolma-v1_5.json
file, I noticed thatid 0
corresponds to"|||IP_ADDRESS|||"
, while<|endoftext|>
is assignedid 50279
. This seems to contradict the configuration, especially when compared with thetokenizers/allenai_eleuther-ai-gpt-neox-20b-pii-special.json
file.Additionally, I inspected the preprocessed data file (part-1-00000.npy) and ran the following analysis:
This assumes an average of 4 tokens per English word. The results were:
[27639.0, 16304.25, 23344.5, 26183.25, 35961.75, 6302.0, 42492.0, 4867.0, 7313.5, ...]
(length = 7386).I have two questions:
eos_token_id
in the configuration file when using theallenai_eleuther-ai-gpt-neox-20b-pii-special.json
tokenizer?gpt-neox-olmo-dolma-v1_5/part-X-00000.npy
) encoded witheos_token_id=50279
?Thank you for your assistance!
The text was updated successfully, but these errors were encountered: