Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: PreProcessor split by token (tiktoken & Hugging Face) #5276

Merged
merged 10 commits into from
Nov 23, 2023

Conversation

benheckmann
Copy link
Contributor

@benheckmann benheckmann commented Jul 5, 2023

Related Issues

Implements #4983. This allows a user to split documents by token count using the PreProcessor class.

Proposed Changes:

Interface:

The user can choose any of ["token", "word", "sentence", "passage"] as the split by parameter. The user passes the string "tiktoken" or any Hugging Face tokenizer (either the model name or a PreTrainedTokenizerBase object) as the new tokenizer parameter (default is tiktoken).
Just like with word splitting, split_respect_sentence_boundary can be set with split_by="token" and will prevent a sentence from being split into different documents.

Implementation:

Kept the changes minimal and used the existing code wherever possible. The main addition is the _split_tokens method, which returns a list of tokens (in string representation) given a text and tokenizer. This is called in _split_into_units when the sentence boundaries are not respected. I generalized _split_into_words_respecting_sent_boundary to _split_into_units_respecting_sent_boundary (which is used when sentence boundaries are respected) and added a split_function parameter.

How did you test it?

Added two unit test in test_preprocessor.py, test_preprocess_tiktoken_token_split, and test_preprocess_huggingface_token_split. Both of them test
(1) that no document has more than split_length tokens in non-sentence-keeping mode by resplitting with the appropriate tokenizer and asserting len(d) <= split_length.
(2) that there are no more documents than sentences when in sentence-keeping mode

Notes for the reviewer

The PreProcessor docs should be extended if this gets merged. I can suggest an edit when merged.

Checklist

✓ I have read the contributors guidelines and the code of conduct
✓ I have updated the related issue with new insights and changes
✓ I added unit tests and updated the docstrings
✓ I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
✓ I documented my code
✓ I ran pre-commit hooks and fixed any issue

@benheckmann benheckmann requested a review from a team as a code owner July 5, 2023 08:40
@benheckmann benheckmann requested review from ZanSara and removed request for a team July 5, 2023 08:40
@ZanSara
Copy link
Contributor

ZanSara commented Jul 5, 2023

Hey @benheckmann, tests are failing because we don't allow HTTP requests in our unit tests. You will need to mock the tokenizer in the unit tests, and/or make an integration test to test against a real tokenizer.

@benheckmann
Copy link
Contributor Author

Oh right, sorry @ZanSara! My test is loading the tokenizer from Hugging Face's model hub. I will mock it.

@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch 3 times, most recently from 886f7ff to 9dadbc4 Compare July 6, 2023 11:52
@coveralls
Copy link
Collaborator

coveralls commented Jul 6, 2023

Pull Request Test Coverage Report for Build 6968825217

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 3114 unchanged lines in 81 files lost coverage.
  • Overall coverage decreased (-7.0%) to 40.167%

Files with Coverage Reduction New Missed Lines %
nodes/file_converter/base.py 1 72.73%
environment.py 2 90.38%
nodes/prompt/invocation_layer/hugging_face.py 3 94.19%
preview/components/audio/init.py 3 0.0%
preview/components/retrievers/init.py 3 0.0%
nodes/other/route_documents.py 4 86.84%
nodes/preprocessor/base.py 4 81.82%
nodes/reader/farm.py 4 40.0%
nodes/summarizer/base.py 4 60.0%
agents/base.py 5 95.93%
Totals Coverage Status
Change from base Build 5811510771: -7.0%
Covered Lines: 10656
Relevant Lines: 26529

💛 - Coveralls

@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch from 9dadbc4 to 13a535e Compare July 6, 2023 12:22
@benheckmann
Copy link
Contributor Author

@ZanSara Fixed the checks.

@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch from 13a535e to 96b3156 Compare July 17, 2023 08:55
@benheckmann
Copy link
Contributor Author

@ZanSara Would be great if you could take a look :)

Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked the tests yet, but htere are a few comments.

Besides, I'm not convinced that the behavior of lossy tokenizers can be accepted as it is. PreProcessor should at least do its best to try not altering the original test, while with this approach it seems like we would very often lose/add/alter chars all around the text. Do you think we can at least make an attempt at keeping the old text? Can we use the tokenizer output to simply "track" where to slice the original, instead of reconstructing it?

@@ -11,15 +11,14 @@

from tqdm import tqdm
from more_itertools import windowed
from transformers import PreTrainedTokenizerBase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't just import from transformers anymore: it's now an optional dependency. See, for example, how the retrievers handle it:

with LazyImport(message="Run 'pip install farm-haystack[inference]'") as torch_and_transformers_import:

Copy link
Contributor Author

@benheckmann benheckmann Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it would just be:

with LazyImport("Run 'pip install transformers'") as tokenizer_import:
    from transformers import PreTrainedTokenizerBase

instead of the current import?

Comment on lines 392 to 403
if split_respect_sentence_boundary and split_by in ["word", "token"]:

def split_function(text):
if split_by == "token":
return self._split_tokens(text, tokenizer=tokenizer)
else:
return text.split()

text_splits, splits_pages, splits_start_idxs = self._split_into_units_respecting_sent_boundary(
text=text, split_length=split_length, split_overlap=split_overlap, split_function=split_function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why passing an inner function instead of simply split_by and the tokenizer? There seem to be no need for this convolution.

Copy link
Contributor Author

@benheckmann benheckmann Jul 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason is that split() is called in multiple places in _split_into_units_respecting_sent_boundary and also in a function called by it, so it makes more sense to me to only check split_by once and then reuse the split_function. Also, one extra parameter is better than two.
But I will of course change it, if you prefer.

Comment on lines 860 to 866
try:
import tiktoken
except ImportError:
raise ImportError(
"Tiktoken is needed in order to split documents by tokens. "
"Please install it with `pip install tiktoken`."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's handle this optional import at the top, just like with transformers.

Here we'll need only tiktoken_import.check()

Comment on lines 871 to 877
try:
from transformers import AutoTokenizer
except ImportError:
raise ImportError(
"HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. "
"Please install it with `pip install transformers`."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Comment on lines 859 to 893
if tokenizer == "tiktoken":
try:
import tiktoken
except ImportError:
raise ImportError(
"Tiktoken is needed in order to split documents by tokens. "
"Please install it with `pip install tiktoken`."
)
enc = tiktoken.get_encoding("cl100k_base")
integer_tokens = enc.encode(text, allowed_special="all", disallowed_special=())
elements = [enc.decode_single_token_bytes(token).decode() for token in integer_tokens]
elif isinstance(tokenizer, str):
try:
from transformers import AutoTokenizer
except ImportError:
raise ImportError(
"HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. "
"Please install it with `pip install transformers`."
)
try:
tokenizer = AutoTokenizer.from_pretrained(tokenizer)
except Exception:
raise ValueError(
f"Could not load tokenizer '{tokenizer}' from HuggingFace model hub. "
f"Please make sure that the tokenizer is correct and exists."
)
elements = tokenizer.tokenize(text)
elif isinstance(tokenizer, PreTrainedTokenizerBase):
elements = tokenizer.tokenize(text)
else:
raise ValueError(
f"Unsupported tokenizer specification {tokenizer}. "
f"Please provide either the string 'tiktoken' or a HuggingFace tokenizer (PreTrainedTokenizerBase)."
)
return elements
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: if you return elements into each if block, you can avoid the elifs and the else at the end. Let's simplify this block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch 2 times, most recently from 9705104 to 278a772 Compare July 20, 2023 16:55
@benheckmann
Copy link
Contributor Author

benheckmann commented Jul 20, 2023

@ZanSara Fixed the lossy tokenizer behavior by using the return_offsets_mapping kwarg and then splitting the text at the offsets. I didn't know about this kwarg when I first created this PR, but now HF tokenizers also split as expected 👍🏼

@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch from 278a772 to fa5bcce Compare July 20, 2023 21:35
@benheckmann benheckmann force-pushed the 4983-preprocessor-split-by-token branch from fa5bcce to 6ae6270 Compare July 26, 2023 07:59
Copy link
Contributor

@ZanSara ZanSara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few fixes in the tests and this is ready to be merged. Sorry for the late feedback! 🙈

assert len(token_split_docs_not_respecting_sentences) == 4
enc = tiktoken.get_encoding("cl100k_base")
split_documents_encoded = [
enc.encode(d.content, allowed_special="all", disallowed_special=())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use params that the mock tokenizer doesn't need.

Suggested change
enc.encode(d.content, allowed_special="all", disallowed_special=())
enc.encode(d.content)

split_overlap=0,
tokenizer="tiktoken",
).process(docs)
assert len(token_split_docs_not_respecting_sentences) == 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's actually check the splits for precision:

Suggested change
assert len(token_split_docs_not_respecting_sentences) == 4
assert token_split_docs_not_respecting_sentences == [
"This is a document. It has two sentences and eleve",
"n words.",
"This is a document with a long sentence (longer tha",
"n my split length), it has seventeen words."
]

split_overlap=0,
tokenizer="tiktoken",
).process(docs)
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check the splits here too:

Suggested change
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences
assert token_split_docs_not_respecting_sentences == [
"This is a document. ",
"It has two sentences and eleven words.",
"This is a document with a long sentence (longer than my split length), it has seventeen words."
]

tokenizer="tiktoken",
).process(docs)
assert len(token_split_docs_respecting_sentences) == 3 # should not be more than there are sentences

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's check a few extra combination, just to make sure it plays nicely with them (for both tokenizers):

  • One with a sentence longer than max_chars
  • One with a split overlap other than zero

enc.encode(d.content, allowed_special="all", disallowed_special=())
for d in token_split_docs_not_respecting_sentences
]
assert all([len(d) <= split_length for d in split_documents_encoded])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this more precise by specifying the expected lengths?

for d in token_split_docs_not_respecting_sentences
]
assert all([len(d) <= split_length for d in split_documents_encoded])
token_split_docs_respecting_sentences = PreProcessor(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From here on I think it's best to separate it out in another test, to make clear what is failing in case of trouble.



@pytest.mark.unit
def test_preprocess_huggingface_token_split(mock_huggingface_tokenizer):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most of the comments above apply to this test too.

@ZanSara
Copy link
Contributor

ZanSara commented Aug 10, 2023

@benheckmann You will also need to make an extra small step and use Reno to document your changes. No worries it's really simple: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md#release-notes I can also help if you have trouble with it.

@ZanSara ZanSara requested a review from a team as a code owner November 23, 2023 10:50
@ZanSara ZanSara requested review from dfokina and removed request for a team November 23, 2023 10:50
@ZanSara ZanSara merged commit a492771 into deepset-ai:main Nov 23, 2023
54 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants