Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a Recursive Chunking strategy #8548

Open
davidsbatista opened this issue Nov 15, 2024 · 7 comments · May be fixed by #8605
Open

Add a Recursive Chunking strategy #8548

davidsbatista opened this issue Nov 15, 2024 · 7 comments · May be fixed by #8605
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available
Milestone

Comments

@davidsbatista
Copy link
Contributor

davidsbatista commented Nov 15, 2024

Use a set of predefined separators to split text recursively. The process follows these steps:

  • It starts with a list of separator characters, typically ordered from most to least specific (e.g., ["\n\n", "\n", " ", ""]).
  • The splitter attempts to divide the text using the first separator ("\n\n" in this case).
  • If the resulting chunks are still larger than the specified chunk size, it moves to the next separator in the list ("\n").
  • This process continues recursively, using progressively less specific separators until the chunks meet the desired size criteria.
@davidsbatista davidsbatista self-assigned this Nov 15, 2024
@sjrl
Copy link
Contributor

sjrl commented Nov 20, 2024

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

@sjrl
Copy link
Contributor

sjrl commented Nov 20, 2024

Also I wanted to ask will the splitting by separators (e.g. ["\n\n", ".", " "]) be handled using a regex splitter? I think supporting regex would be great so we could provide more complicated separators to better handle complex documents and do things like header detection.

@davidsbatista
Copy link
Contributor Author

that's a good suggestions, I will take it into consideration

@davidsbatista
Copy link
Contributor Author

@davidsbatista This sounds great! One idea I had for this is some way to indicate that we'd like to utilize something like NLTK to do sentence splitting. So normally I think the list of separator characters would look like ["\n\n", ".", " "] to accomplish splitting by paragrah, then sentence, and then by word. And I was wondering if we could replace "." with something like "nltk" or some other tag to indicate we'd like to use a separate algorithm to handle the splitting.

What do you think?

I would suggest using "sentence" and we use NLTK's sent_tokenize(text), but I now noticed that @vblagoje implemented something more robust.

I think we could use the SentenceSplitter here, but maybe we can also move it out of that file into some utils package or file so that can be reused by any component that wants to implement some splitting/chunking technique.

What do you say?

Also, this NLTKDocumentSplitter seems to be an exact copy of the DocumentSplitter except that it uses NLTK's sentence boundary detection algorithm. Maybe we could also merge these two in the future?

@sjrl
Copy link
Contributor

sjrl commented Dec 3, 2024

I would suggest using "sentence" and we use NLTK's sent_tokenize(text), but I now noticed that @vblagoje implemented something more robust.

That sounds good to me!

I think we could use the SentenceSplitter here, but maybe we can also move it out of that file into some utils package or file so that can be reused by any component that wants to implement some splitting/chunking technique.

What do you say?

Yes I also agree. Let's reuse that and move it into utils.

Also, this NLTKDocumentSplitter seems to be an exact copy of the DocumentSplitter except that it uses NLTK's sentence boundary detection algorithm. Maybe we could also merge these two in the future?

This is totally correct! I asked the same question here and it does seem like we would like to merge these two in the future. Sounds like we should open an issue for this.

@davidsbatista
Copy link
Contributor Author

@davidsbatista davidsbatista added this to the 2.9.0 milestone Dec 9, 2024
@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Dec 16, 2024
@bhavnicksm
Copy link

Hey @davidsbatista~

The idea seems to be exactly how SemChunk works. Just curious if it would be preferable to use the SemChunk or to implement from scratch here? Given that in my experience, semchunk is very well written.

Link: semchunk

Also, can we add support for more chunking methods? Full disclosure: I write a lot of chunking and splitting methods at Chonkie

Thanks! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants