-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestions for new chunkers #4
Comments
Thank you for this - all of these sound good.
I haven't had time to improve the tool recently but I'd love help on it if
you're up for it.
…On Tue, Aug 27, 2024 at 5:59 AM Dominik Weckmüller ***@***.***> wrote:
Hey, super useful tool!
There's been some development in the chunking community. If you'd like to
keep your app up to date here are a few suggestions. Also, considerung that
all of the options struggle with correctly identifying sentence boundaries
(quickly tested with some texts) and tend to chop off parts, it would be
nice to have more choice.
Python
- https://github.com/benbrandt/text-splitter - Python API for Rust
Package, at some point also available in JS via WebAssembly. It's my
personal preference at the moment, yields "human-like" chunks
- https://github.com/umarbutler/semchunk - claims to be faster, didn't
test enough yet to evaluate
JS
- https://github.com/askorama/chunker - didn't test yet, looks like a
very simplistic tool, no documentation afaik
- https://gist.github.com/hanxiao/3f60354cf6dc5ac698bc9154163b4e6a -
JinaAI tokenizer. See LinkedIn post here
<https://www.linkedin.com/posts/hxiao87_based-%F0%9D%90%92%F0%9D%90%9E%F0%9D%90%A6%F0%9D%90%9A%F0%9D%90%A7%F0%9D%90%AD%F0%9D%90%A2%F0%9D%90%9C-%F0%9D%90%9C%F0%9D%90%A1%F0%9D%90%AE%F0%9D%90%A7%F0%9D%90%A4%F0%9D%90%A2%F0%9D%90%A7%F0%9D%90%A0-activity-7230113200833253376-66b1>
and read first comment for some exceptions; didn't test yet.
Maybe another idea would be to include the option to allow for any regex
like we did in SemanticFinder <https://github.com/do-me/SemanticFinder>.
I tried to come up with a good regex for sentence boundaries but it's
incredibly hard.
—
Reply to this email directly, view it on GitHub
<#4>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACK22PJJ34IQJBGCFMFOSX3ZTRZZPAVCNFSM6AAAAABNGCUJSGVHI2DSMVQWIX3LMV43ASLTON2WKOZSGQ4DSMRZGIZTMOI>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Greg Kamradt
Twitter <https://twitter.com/GregKamradt>, LinkedIn
<https://www.linkedin.com/in/gregkamradt/>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Hey, super useful tool!
There's been some development in the chunking community. If you'd like to keep your app up to date here are a few suggestions. Also, considerung that all of the options struggle with correctly identifying sentence boundaries (quickly tested with some texts) and tend to chop off parts, it would be nice to have more choice.
Python
JS
Maybe another idea would be to include the option to allow for any regex like we did in SemanticFinder. I tried to come up with a good regex for sentence boundaries but it's incredibly hard.
The text was updated successfully, but these errors were encountered: