-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
langchain[patch]: fix: Include separator length when checking chunk size #4849
langchain[patch]: fix: Include separator length when checking chunk size #4849
Conversation
When checking splits for chunk size, only 1 separator was considered, this could lead to chunks exceeding the maximum size.
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
langchain/src/text_splitter.ts
Outdated
@@ -188,8 +188,7 @@ export abstract class TextSplitter | |||
for (const d of splits) { | |||
const _len = await this.lengthFunction(d); | |||
if ( | |||
total + _len + (currentDoc.length > 0 ? separator.length : 0) > | |||
this.chunkSize | |||
total + _len + (currentDoc.length * separator.length) > this.chunkSize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The python version solves this slightly differently by adding and substracting the separator.length to total which adds a bit more clutter
- https://github.com/langchain-ai/langchain/blob/5d220975fc563a92f41aeb0907e8c3819da073f5/libs/text-splitters/langchain_text_splitters/base.py#L133
- https://github.com/langchain-ai/langchain/blob/5d220975fc563a92f41aeb0907e8c3819da073f5/libs/text-splitters/langchain_text_splitters/base.py#L138
- https://github.com/langchain-ai/langchain/blob/5d220975fc563a92f41aeb0907e8c3819da073f5/libs/text-splitters/langchain_text_splitters/base.py#L142
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be additive rather than multiplicative though?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me think through it a bit more! It's been a while since I've looked at this code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, never mind, the naming is bad. I understand now, makes sense!
Thank you for this, and sorry for the delayed review! |
…fix/separatorLengthInTextSplitter
When checking splits for chunk size, only 1 separator was considered, this could lead to chunks exceeding the maximum size.