Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

langchain[patch]: fix: Include separator length when checking chunk size #4849

Merged

Conversation

Dschoordsch
Copy link
Contributor

When checking splits for chunk size, only 1 separator was considered, this could lead to chunks exceeding the maximum size.

When checking splits for chunk size, only 1 separator was considered,
this could lead to chunks exceeding the maximum size.
Copy link

vercel bot commented Mar 21, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchainjs-api-refs ✅ Ready (Inspect) Visit Preview 💬 Add feedback Mar 25, 2024 3:52am
langchainjs-docs ✅ Ready (Inspect) Visit Preview Mar 25, 2024 3:52am

@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. auto:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Mar 21, 2024
@@ -188,8 +188,7 @@ export abstract class TextSplitter
for (const d of splits) {
const _len = await this.lengthFunction(d);
if (
total + _len + (currentDoc.length > 0 ? separator.length : 0) >
this.chunkSize
total + _len + (currentDoc.length * separator.length) > this.chunkSize
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be additive rather than multiplicative though?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me think through it a bit more! It's been a while since I've looked at this code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, never mind, the naming is bad. I understand now, makes sense!

@jacoblee93 jacoblee93 changed the title fix: Include separator length when checking chunk size langchain[patch]: fix: Include separator length when checking chunk size Mar 24, 2024
@jacoblee93
Copy link
Collaborator

Thank you for this, and sorry for the delayed review!

@jacoblee93 jacoblee93 added the lgtm PRs that are ready to be merged as-is label Mar 25, 2024
@jacoblee93 jacoblee93 merged commit 5662147 into langchain-ai:main Mar 25, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto:bug Related to a bug, vulnerability, unexpected error with an existing feature lgtm PRs that are ready to be merged as-is size:S This PR changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants