Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

JulienVig · 2024-11-07T15:35:17Z

System Info

TypeScript 5.5.4
transformers.js 3.0.2
Node.js v20.170

Environment/Platform

Description

Using some pretrained tokenizers doesn't yield the same tokenization of "\n" or " \n" as Tiktokenizer or Xenova's playground.

For example, Xenova/llama-3-tokenizer tokenizes "\n" as [198] and " \n" as [720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]

Similarly for Llama 2, Xenova/llama-tokenizer tokenizes "\n" as [1, 29871, 13] while Xenova's playground yields [1, 320, 29876].

Reproduction

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/llama-3-tokenizer");

const tokens = tokenizer("\n", { return_tensor: false }).input_ids;
console.log(tokens) // prints [198] while [1734] is expected

Similar issue with Xenova/llama-tokenizer.

The text was updated successfully, but these errors were encountered:

xenova · 2024-11-07T18:23:32Z

Hi there 👋 can you reproduce this with the python transformers library?

JulienVig · 2024-11-08T21:49:24Z

Hi! I actually get consistent results with the python tokenizers:

tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print(tokenizer.encode("\n").ids) # [1, 29871, 13] same as Xenova/llama-tokenizer

tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(tokenizer.encode("\n").ids) # [128000, 198]  Xenova/llama-3-tokenizer doesn't include special token by default
print(tokenizer.encode("\n", add_special_tokens=False).ids) # [198] same as Xenova/llama-3-tokenizer

So both javascript and python yield different tokenization from the playgrounds. Am I comparing different tokenization settings?

JulienVig added the bug Something isn't working label Nov 7, 2024

JulienVig mentioned this issue Nov 7, 2024

Fix and rework GPT-TF.js epfml/disco#807

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

JulienVig commented Nov 7, 2024

xenova commented Nov 7, 2024

JulienVig commented Nov 8, 2024

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

Comments

JulienVig commented Nov 7, 2024

System Info

Environment/Platform

Description

Reproduction

xenova commented Nov 7, 2024

JulienVig commented Nov 8, 2024