feat: PreProcessor split by token (tiktoken & Hugging Face) #5276

benheckmann · 2023-07-05T08:40:58Z

Related Issues

Implements #4983. This allows a user to split documents by token count using the PreProcessor class.

Proposed Changes:

Interface:

The user can choose any of ["token", "word", "sentence", "passage"] as the split by parameter. The user passes the string "tiktoken" or any Hugging Face tokenizer (either the model name or a PreTrainedTokenizerBase object) as the new tokenizer parameter (default is tiktoken).
Just like with word splitting, split_respect_sentence_boundary can be set with split_by="token" and will prevent a sentence from being split into different documents.

Implementation:

Kept the changes minimal and used the existing code wherever possible. The main addition is the _split_tokens method, which returns a list of tokens (in string representation) given a text and tokenizer. This is called in _split_into_units when the sentence boundaries are not respected. I generalized _split_into_words_respecting_sent_boundary to _split_into_units_respecting_sent_boundary (which is used when sentence boundaries are respected) and added a split_function parameter.

How did you test it?

Added two unit test in test_preprocessor.py, test_preprocess_tiktoken_token_split, and test_preprocess_huggingface_token_split. Both of them test
(1) that no document has more than split_length tokens in non-sentence-keeping mode by resplitting with the appropriate tokenizer and asserting len(d) <= split_length.
(2) that there are no more documents than sentences when in sentence-keeping mode

Notes for the reviewer

The PreProcessor docs should be extended if this gets merged. I can suggest an edit when merged.

Checklist

✓ I have read the contributors guidelines and the code of conduct
✓ I have updated the related issue with new insights and changes
✓ I added unit tests and updated the docstrings
✓ I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
✓ I documented my code
✓ I ran pre-commit hooks and fixed any issue

ZanSara · 2023-07-05T09:20:50Z

Hey @benheckmann, tests are failing because we don't allow HTTP requests in our unit tests. You will need to mock the tokenizer in the unit tests, and/or make an integration test to test against a real tokenizer.

benheckmann · 2023-07-06T07:54:19Z

Oh right, sorry @ZanSara! My test is loading the tokenizer from Hugging Face's model hub. I will mock it.

coveralls · 2023-07-06T12:13:07Z

Pull Request Test Coverage Report for Build 6968825217

Warning: This coverage report may be inaccurate.

We've detected an issue with your CI configuration that might affect the accuracy of this pull request's coverage report.
To ensure accuracy in future PRs, please see these guidelines.
A quick fix for this PR: rebase it; your next report should be accurate.

0 of 0 changed or added relevant lines in 0 files are covered.
3114 unchanged lines in 81 files lost coverage.
Overall coverage decreased (-7.0%) to 40.167%

Files with Coverage Reduction	New Missed Lines	%
nodes/file_converter/base.py	1	72.73%
environment.py	2	90.38%
nodes/prompt/invocation_layer/hugging_face.py	3	94.19%
preview/components/audio/init.py	3	0.0%
preview/components/retrievers/init.py	3	0.0%
nodes/other/route_documents.py	4	86.84%
nodes/preprocessor/base.py	4	81.82%
nodes/reader/farm.py	4	40.0%
nodes/summarizer/base.py	4	60.0%
agents/base.py	5	95.93%

Totals
Change from base Build 5811510771:	-7.0%
Covered Lines:	10656
Relevant Lines:	26529

💛 - Coveralls

benheckmann · 2023-07-06T12:45:28Z

@ZanSara Fixed the checks.

benheckmann · 2023-07-17T09:21:47Z

@ZanSara Would be great if you could take a look :)

ZanSara

I haven't checked the tests yet, but htere are a few comments.

Besides, I'm not convinced that the behavior of lossy tokenizers can be accepted as it is. PreProcessor should at least do its best to try not altering the original test, while with this approach it seems like we would very often lose/add/alter chars all around the text. Do you think we can at least make an attempt at keeping the old text? Can we use the tokenizer output to simply "track" where to slice the original, instead of reconstructing it?

ZanSara · 2023-07-17T12:39:17Z

haystack/nodes/preprocessor/preprocessor.py

@@ -11,15 +11,14 @@

 from tqdm import tqdm
 from more_itertools import windowed
+from transformers import PreTrainedTokenizerBase


You can't just import from transformers anymore: it's now an optional dependency. See, for example, how the retrievers handle it:

haystack/haystack/nodes/retriever/dense.py

Line 29 in adfabdd

with LazyImport(message="Run 'pip install farm-haystack[inference]'") as torch_and_transformers_import:

So it would just be:

with LazyImport("Run 'pip install transformers'") as tokenizer_import: from transformers import PreTrainedTokenizerBase

instead of the current import?

ZanSara · 2023-07-17T12:43:07Z

haystack/nodes/preprocessor/preprocessor.py

+        if split_respect_sentence_boundary and split_by in ["word", "token"]:
+
+            def split_function(text):
+                if split_by == "token":
+                    return self._split_tokens(text, tokenizer=tokenizer)
+                else:
+                    return text.split()
+
+            text_splits, splits_pages, splits_start_idxs = self._split_into_units_respecting_sent_boundary(
+                text=text, split_length=split_length, split_overlap=split_overlap, split_function=split_function


Why passing an inner function instead of simply split_by and the tokenizer? There seem to be no need for this convolution.

The reason is that split() is called in multiple places in _split_into_units_respecting_sent_boundary and also in a function called by it, so it makes more sense to me to only check split_by once and then reuse the split_function. Also, one extra parameter is better than two.
But I will of course change it, if you prefer.

ZanSara · 2023-07-17T12:44:45Z

haystack/nodes/preprocessor/preprocessor.py

+            try:
+                import tiktoken
+            except ImportError:
+                raise ImportError(
+                    "Tiktoken is needed in order to split documents by tokens. "
+                    "Please install it with `pip install tiktoken`."
+                )


Let's handle this optional import at the top, just like with transformers.

Here we'll need only tiktoken_import.check()

ZanSara · 2023-07-17T12:46:02Z

haystack/nodes/preprocessor/preprocessor.py

+            try:
+                from transformers import AutoTokenizer
+            except ImportError:
+                raise ImportError(
+                    "HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. "
+                    "Please install it with `pip install transformers`."
+                )


Same as above

ZanSara · 2023-07-17T12:47:06Z

haystack/nodes/preprocessor/preprocessor.py

+        if tokenizer == "tiktoken":
+            try:
+                import tiktoken
+            except ImportError:
+                raise ImportError(
+                    "Tiktoken is needed in order to split documents by tokens. "
+                    "Please install it with `pip install tiktoken`."
+                )
+            enc = tiktoken.get_encoding("cl100k_base")
+            integer_tokens = enc.encode(text, allowed_special="all", disallowed_special=())
+            elements = [enc.decode_single_token_bytes(token).decode() for token in integer_tokens]
+        elif isinstance(tokenizer, str):
+            try:
+                from transformers import AutoTokenizer
+            except ImportError:
+                raise ImportError(
+                    "HuggingFace transformers is needed in order to split documents using HuggingFace tokenizers. "
+                    "Please install it with `pip install transformers`."
+                )
+            try:
+                tokenizer = AutoTokenizer.from_pretrained(tokenizer)
+            except Exception:
+                raise ValueError(
+                    f"Could not load tokenizer '{tokenizer}' from HuggingFace model hub. "
+                    f"Please make sure that the tokenizer is correct and exists."
+                )
+            elements = tokenizer.tokenize(text)
+        elif isinstance(tokenizer, PreTrainedTokenizerBase):
+            elements = tokenizer.tokenize(text)
+        else:
+            raise ValueError(
+                f"Unsupported tokenizer specification {tokenizer}. "
+                f"Please provide either the string 'tiktoken' or a HuggingFace tokenizer (PreTrainedTokenizerBase)."
+            )
+        return elements


Nitpick: if you return elements into each if block, you can avoid the elifs and the else at the end. Let's simplify this block.

benheckmann · 2023-07-20T17:05:01Z

@ZanSara Fixed the lossy tokenizer behavior by using the return_offsets_mapping kwarg and then splitting the text at the offsets. I didn't know about this kwarg when I first created this PR, but now HF tokenizers also split as expected 👍🏼

…with HuggingFace tokenizer

…objects) and added an example to the HF token splitting test

…rror for tiktoken

ZanSara

Few fixes in the tests and this is ready to be merged. Sorry for the late feedback! 🙈

ZanSara · 2023-08-10T09:18:39Z