Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pretrained Llama tokenizers don't yield the expected tokenization of "\n" #1019

Open
2 of 5 tasks
JulienVig opened this issue Nov 7, 2024 · 2 comments
Open
2 of 5 tasks
Labels
bug Something isn't working

Comments

@JulienVig
Copy link

System Info

TypeScript 5.5.4
transformers.js 3.0.2
Node.js v20.170

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

Using some pretrained tokenizers doesn't yield the same tokenization of "\n" or " \n" as Tiktokenizer or Xenova's playground.

For example, Xenova/llama-3-tokenizer tokenizes "\n" as [198] and " \n" as [720]
In both playgrounds, selecting Llama 3 in Xenova's playground and meta-llama/Meta-Llama-3-8B in Tiktokenizer, the Llama 3 tokenizer should tokenize "\n" [1734] as and " \n" as [1144, 77]

Similarly for Llama 2, Xenova/llama-tokenizer tokenizes "\n" as [1, 29871, 13] while Xenova's playground yields [1, 320, 29876].

Reproduction

import { AutoTokenizer } from "@huggingface/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/llama-3-tokenizer");

const tokens = tokenizer("\n", { return_tensor: false }).input_ids;
console.log(tokens) // prints [198] while [1734] is expected

Similar issue with Xenova/llama-tokenizer.

@JulienVig JulienVig added the bug Something isn't working label Nov 7, 2024
@xenova
Copy link
Collaborator

xenova commented Nov 7, 2024

Hi there 👋 can you reproduce this with the python transformers library?

@JulienVig
Copy link
Author

Hi! I actually get consistent results with the python tokenizers:

tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
print(tokenizer.encode("\n").ids) # [1, 29871, 13] same as Xenova/llama-tokenizer

tokenizer = Tokenizer.from_pretrained("meta-llama/Llama-3.1-8B")
print(tokenizer.encode("\n").ids) # [128000, 198]  Xenova/llama-3-tokenizer doesn't include special token by default
print(tokenizer.encode("\n", add_special_tokens=False).ids) # [198] same as Xenova/llama-3-tokenizer

So both javascript and python yield different tokenization from the playgrounds. Am I comparing different tokenization settings?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants