docs[experimental]: Make docs clearer and add min_chunk_size #26398

tibor-reiss · 2024-09-12T18:33:20Z

added some clarification text for the keyword argument breakpoint_threshold_amount
added min_chunk_size: together with breakpoint_threshold_amount, too small/big chunk sizes can be avoided

Note: the langchain-experimental was moved to a separate repo, so only the doc change stays here.

vercel · 2024-09-12T18:33:23Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Nov 1, 2024 6:11am

efriis

could you add some unit tests? I believe this has a bug where it will drop chunks if some of the final chunks are shorter than min_chunk_size

This is also relatively achievable as a postprocessing step, so not confident this needs to be part of this implementation in particular.

e.g.

def merge_short_docs(docs, min_chunk_size = 20):
    rtn = []
    doc = None
    for d in docs:
        if doc is None:
            doc = d
        else:
            # use first doc's metadata
            doc = Document(doc.page_content + d.page_content, metadata=doc.metadata)

        if len(doc.page_content) < min_chunk_size:
            continue
        rtn.append(doc)
        doc = None
    if doc is not None:
        rtn.append(doc)
    return rtn

note this allows you to configure metadata handling as well if you want to merge in a custom way.

Thoughts on documenting this instead of adding the param?

tibor-reiss · 2024-09-18T04:35:49Z

@efriis sure, I will try to add some tests. However, since start_index is not incremented, the second if will catch this:

# The last group, if any sentences remain
if start_index < len(sentences):

tibor-reiss · 2024-09-18T18:42:32Z

@efriis Test added. The current implementation does not miss the last sentence (or what is left) and does the same thing as your snippet just in less lines of code :)

However, I like the idea of adding a postprocessing step. On the other hand, it's not that simple to cover all edge cases - e.g. if the last chunk is smaller than min_chunk_size, it still remains as such. But if I would add it to the previous, then that previous chunk could grow too much, and so on. That was also the reason I did not add a max_chunk_size, because this can be already controlled with breakpoint_threshold_amount.
Let me know what you think.

tibor-reiss · 2024-09-20T17:49:54Z

@efriis is there a way to restart vercel, please? Or how can I access the logs/build from this workflow? I am getting 404.

efriis · 2024-09-26T02:08:35Z

hey sorry for the delay!

feel free to reopen the code changes against the langchain-experimental repo (this package moved)! https://github.com/langchain-ai/langchain-experimental

when you merge in master, the vercel build should work better. (some issues with the vercel build last week that have been fixed by upgrading to docusaurus 3)

tibor-reiss · 2024-09-26T19:30:19Z

Hi @efriis, as requested, I moved the code change to the other repo, leaving just the doc change here.

Moving code changes here from langchain-ai/langchain#26398

tibor-reiss · 2024-11-01T06:04:09Z

Friendly ping @baskaryan @eyurtsev
The corresponding has been live in experimental for some time now

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder labels Sep 12, 2024

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from ee0bcf5 to 3014a29 Compare September 12, 2024 18:34

vercel bot deployed to Preview September 12, 2024 18:59 View deployment

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from 3014a29 to 1c5b9f8 Compare September 15, 2024 08:59

tibor-reiss changed the title ~~docs[experimental]: Make docs clearer~~ docs[experimental]: Make docs clearer and add min_chunk_size Sep 15, 2024

tibor-reiss force-pushed the docs-26171-semantic-chunker branch 3 times, most recently from 05f22dc to 6f42986 Compare September 15, 2024 09:09

vercel bot deployed to Preview September 15, 2024 09:22 View deployment

efriis reviewed Sep 17, 2024

View reviewed changes

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Sep 18, 2024

tibor-reiss force-pushed the docs-26171-semantic-chunker branch 3 times, most recently from 4ebc335 to c4f4ef9 Compare September 18, 2024 18:39

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from c4f4ef9 to 50d2e59 Compare September 18, 2024 18:45

vercel bot had a problem deploying to Preview September 18, 2024 18:49 Failure

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from 50d2e59 to a25257c Compare September 21, 2024 12:42

vercel bot had a problem deploying to Preview September 21, 2024 12:59 Failure

tibor-reiss mentioned this pull request Sep 26, 2024

semantic-chunker: add min_chunk_size langchain-ai/langchain-experimental#4

Merged

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from a25257c to 82443ce Compare September 26, 2024 19:29

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Sep 26, 2024

vercel bot deployed to Preview September 26, 2024 19:38 View deployment

efriis pushed a commit to langchain-ai/langchain-experimental that referenced this pull request Sep 26, 2024

semantic-chunker: add min_chunk_size (#4)

305bde3

Moving code changes here from langchain-ai/langchain#26398

Make docs clearer and add min_cunk_size

fb82be9

tibor-reiss force-pushed the docs-26171-semantic-chunker branch from 82443ce to fb82be9 Compare November 1, 2024 06:03

vercel bot deployed to Preview November 1, 2024 06:11 View deployment

tibor-reiss requested a review from efriis December 13, 2024 21:42

efriis approved these changes Dec 15, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 15, 2024

efriis merged commit 690aa02 into langchain-ai:master Dec 15, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs[experimental]: Make docs clearer and add min_chunk_size #26398

docs[experimental]: Make docs clearer and add min_chunk_size #26398

tibor-reiss commented Sep 12, 2024 •

edited

Loading

vercel bot commented Sep 12, 2024 •

edited

Loading

efriis left a comment

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 20, 2024

efriis commented Sep 26, 2024

tibor-reiss commented Sep 26, 2024

tibor-reiss commented Nov 1, 2024

docs[experimental]: Make docs clearer and add min_chunk_size #26398

docs[experimental]: Make docs clearer and add min_chunk_size #26398

Conversation

tibor-reiss commented Sep 12, 2024 • edited Loading

vercel bot commented Sep 12, 2024 • edited Loading

efriis left a comment

Choose a reason for hiding this comment

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 18, 2024

tibor-reiss commented Sep 20, 2024

efriis commented Sep 26, 2024

tibor-reiss commented Sep 26, 2024

tibor-reiss commented Nov 1, 2024

tibor-reiss commented Sep 12, 2024 •

edited

Loading

vercel bot commented Sep 12, 2024 •

edited

Loading