You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using some pretrained tokenizers doesn't yield the same tokenization of "\n" or " \n" as Tiktokenizer or Xenova's playground.
For example, Xenova/llama-3-tokenizer tokenizes "\n" as [198] and " \n" as [720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]
Similarly for Llama 2, Xenova/llama-tokenizer tokenizes "\n" as [1, 29871, 13] while Xenova's playground yields [1, 320, 29876].
Reproduction
import{AutoTokenizer}from"@huggingface/transformers";consttokenizer=awaitAutoTokenizer.from_pretrained("Xenova/llama-3-tokenizer");consttokens=tokenizer("\n",{return_tensor: false}).input_ids;console.log(tokens)// prints [198] while [1734] is expected
Similar issue with Xenova/llama-tokenizer.
The text was updated successfully, but these errors were encountered:
Hi! I actually get consistent results with the python tokenizers:
tokenizer=Tokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print(tokenizer.encode("\n").ids) # [1, 29871, 13] same as Xenova/llama-tokenizertokenizer=Tokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(tokenizer.encode("\n").ids) # [128000, 198] Xenova/llama-3-tokenizer doesn't include special token by defaultprint(tokenizer.encode("\n", add_special_tokens=False).ids) # [198] same as Xenova/llama-3-tokenizer
So both javascript and python yield different tokenization from the playgrounds. Am I comparing different tokenization settings?
System Info
TypeScript 5.5.4
transformers.js 3.0.2
Node.js v20.170
Environment/Platform
Description
Using some pretrained tokenizers doesn't yield the same tokenization of
"\n"
or" \n"
as Tiktokenizer or Xenova's playground.For example,
Xenova/llama-3-tokenizer
tokenizes"\n"
as[198]
and" \n"
as[720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]
Similarly for Llama 2,
Xenova/llama-tokenizer
tokenizes "\n" as[1, 29871, 13]
while Xenova's playground yields[1, 320, 29876]
.Reproduction
Similar issue with
Xenova/llama-tokenizer
.The text was updated successfully, but these errors were encountered: