-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: PreProcessor split by token (tiktoken & Hugging Face) #5276
feat: PreProcessor split by token (tiktoken & Hugging Face) #5276
Conversation
Hey @benheckmann, tests are failing because we don't allow HTTP requests in our unit tests. You will need to mock the tokenizer in the unit tests, and/or make an integration test to test against a real tokenizer. |
Oh right, sorry @ZanSara! My test is loading the tokenizer from Hugging Face's model hub. I will mock it. |
886f7ff
to
9dadbc4
Compare
Pull Request Test Coverage Report for Build 6968825217Warning: This coverage report may be inaccurate.We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
💛 - Coveralls |
9dadbc4
to
13a535e
Compare
@ZanSara Fixed the checks. |
13a535e
to
96b3156
Compare
@ZanSara Would be great if you could take a look :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't checked the tests yet, but htere are a few comments.
Besides, I'm not convinced that the behavior of lossy tokenizers can be accepted as it is. PreProcessor should at least do its best to try not altering the original test, while with this approach it seems like we would very often lose/add/alter chars all around the text. Do you think we can at least make an attempt at keeping the old text? Can we use the tokenizer output to simply "track" where to slice the original, instead of reconstructing it?
@@ -11,15 +11,14 @@ | |||
|
|||
from tqdm import tqdm | |||
from more_itertools import windowed | |||
from transformers import PreTrainedTokenizerBase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't just import from transformers anymore: it's now an optional dependency. See, for example, how the retrievers handle it:
with LazyImport(message="Run 'pip install farm-haystack[inference]'") as torch_and_transformers_import: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it would just be:
with LazyImport("Run 'pip install transformers'") as tokenizer_import:
from transformers import PreTrainedTokenizerBase
instead of the current import?
if split_respect_sentence_boundary and split_by in ["word", "token"]: | ||
|
||
def split_function(text): | ||
if split_by == "token": | ||
return self._split_tokens(text, tokenizer=tokenizer) | ||
else: | ||
return text.split() | ||
|
||
text_splits, splits_pages, splits_start_idxs = self._split_into_units_respecting_sent_boundary( | ||
text=text, split_length=split_length, split_overlap=split_overlap, split_function=split_function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why passing an inner function instead of simply split_by
and the tokenizer
? There seem to be no need for this convolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason is that split()
is called in multiple places in _split_into_units_respecting_sent_boundary
and also in a function called by it, so it makes more sense to me to only check split_by
once and then reuse the split_function
. Also, one extra parameter is better than two.
But I will of course change it, if you prefer.
try: | ||
import tiktoken | ||
except ImportError: | ||
raise ImportError( | ||
"Tiktoken is needed in order to split documents by tokens. " | ||
"Please install it with `pip install tiktoken`." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's handle this optional import at the top, just like with transformers.
Here we'll need only tiktoken_import.check()
try: | ||
from transformers import AutoTokenizer | ||
except ImportError: | ||
raise ImportError( | ||
"HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. " | ||
"Please install it with `pip install transformers`." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
if tokenizer == "tiktoken": | ||
try: | ||
import tiktoken | ||
except ImportError: | ||
raise ImportError( | ||
"Tiktoken is needed in order to split documents by tokens. " | ||
"Please install it with `pip install tiktoken`." | ||
) | ||
enc = tiktoken.get_encoding("cl100k_base") | ||
integer_tokens = enc.encode(text, allowed_special="all", disallowed_special=()) | ||
elements = [enc.decode_single_token_bytes(token).decode() for token in integer_tokens] | ||
elif isinstance(tokenizer, str): | ||
try: | ||
from transformers import AutoTokenizer | ||
except ImportError: | ||
raise ImportError( | ||
"HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. " | ||
"Please install it with `pip install transformers`." | ||
) | ||
try: | ||
tokenizer = AutoTokenizer.from_pretrained(tokenizer) | ||
except Exception: | ||
raise ValueError( | ||
f"Could not load tokenizer '{tokenizer}' from HuggingFace model hub. " | ||
f"Please make sure that the tokenizer is correct and exists." | ||
) | ||
elements = tokenizer.tokenize(text) | ||
elif isinstance(tokenizer, PreTrainedTokenizerBase): | ||
elements = tokenizer.tokenize(text) | ||
else: | ||
raise ValueError( | ||
f"Unsupported tokenizer specification {tokenizer}. " | ||
f"Please provide either the string 'tiktoken' or a HuggingFace tokenizer (PreTrainedTokenizerBase)." | ||
) | ||
return elements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick: if you return elements into each if block, you can avoid the elif
s and the else
at the end. Let's simplify this block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
9705104
to
278a772
Compare
@ZanSara Fixed the lossy tokenizer behavior by using the |
278a772
to
fa5bcce
Compare
…with HuggingFace tokenizer
…objects) and added an example to the HF token splitting test
…rror for tiktoken
fa5bcce
to
6ae6270
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few fixes in the tests and this is ready to be merged. Sorry for the late feedback! 🙈
assert len(token_split_docs_not_respecting_sentences) == 4 | ||
enc = tiktoken.get_encoding("cl100k_base") | ||
split_documents_encoded = [ | ||
enc.encode(d.content, allowed_special="all", disallowed_special=()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use params that the mock tokenizer doesn't need.
enc.encode(d.content, allowed_special="all", disallowed_special=()) | |
enc.encode(d.content) |
split_overlap=0, | ||
tokenizer="tiktoken", | ||
).process(docs) | ||
assert len(token_split_docs_not_respecting_sentences) == 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's actually check the splits for precision:
assert len(token_split_docs_not_respecting_sentences) == 4 | |
assert token_split_docs_not_respecting_sentences == [ | |
"This is a document. It has two sentences and eleve", | |
"n words.", | |
"This is a document with a long sentence (longer tha", | |
"n my split length), it has seventeen words." | |
] |
split_overlap=0, | ||
tokenizer="tiktoken", | ||
).process(docs) | ||
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check the splits here too:
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences | |
assert token_split_docs_not_respecting_sentences == [ | |
"This is a document. ", | |
"It has two sentences and eleven words.", | |
"This is a document with a long sentence (longer than my split length), it has seventeen words." | |
] |
tokenizer="tiktoken", | ||
).process(docs) | ||
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's check a few extra combination, just to make sure it plays nicely with them (for both tokenizers):
- One with a sentence longer than max_chars
- One with a split overlap other than zero
enc.encode(d.content, allowed_special="all", disallowed_special=()) | ||
for d in token_split_docs_not_respecting_sentences | ||
] | ||
assert all([len(d) <= split_length for d in split_documents_encoded]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this more precise by specifying the expected lengths?
for d in token_split_docs_not_respecting_sentences | ||
] | ||
assert all([len(d) <= split_length for d in split_documents_encoded]) | ||
token_split_docs_respecting_sentences = PreProcessor( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From here on I think it's best to separate it out in another test, to make clear what is failing in case of trouble.
|
||
|
||
@pytest.mark.unit | ||
def test_preprocess_huggingface_token_split(mock_huggingface_tokenizer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of the comments above apply to this test too.
@benheckmann You will also need to make an extra small step and use Reno to document your changes. No worries it's really simple: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes I can also help if you have trouble with it. |
Related Issues
Implements #4983. This allows a user to split documents by token count using the
PreProcessor
class.Proposed Changes:
Interface:
The user can choose any of
["token", "word", "sentence", "passage"]
as the split by parameter. The user passes the string"tiktoken"
or any Hugging Face tokenizer (either the model name or aPreTrainedTokenizerBase
object) as the newtokenizer
parameter (default is tiktoken).Just like with word splitting,
split_respect_sentence_boundary
can be set withsplit_by="token"
and will prevent a sentence from being split into different documents.Implementation:
Kept the changes minimal and used the existing code wherever possible. The main addition is the
_split_tokens
method, which returns a list of tokens (in string representation) given a text and tokenizer. This is called in_split_into_units
when the sentence boundaries are not respected. I generalized_split_into_words_respecting_sent_boundary
to_split_into_units_respecting_sent_boundary
(which is used when sentence boundaries are respected) and added asplit_function
parameter.How did you test it?
Added two unit test in
test_preprocessor.py
,test_preprocess_tiktoken_token_split
, andtest_preprocess_huggingface_token_split
. Both of them test(1) that no document has more than
split_length
tokens in non-sentence-keeping mode by resplitting with the appropriate tokenizer and assertinglen(d) <= split_length
.(2) that there are no more documents than sentences when in sentence-keeping mode
Notes for the reviewer
The PreProcessor docs should be extended if this gets merged. I can suggest an edit when merged.
Checklist
✓ I have read the contributors guidelines and the code of conduct
✓ I have updated the related issue with new insights and changes
✓ I added unit tests and updated the docstrings
✓ I've used one of the conventional commit types for my PR title:
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.✓ I documented my code
✓ I ran pre-commit hooks and fixed any issue