-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs[experimental]: Make docs clearer and add min_chunk_size #26398
docs[experimental]: Make docs clearer and add min_chunk_size #26398
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
ee0bcf5
to
3014a29
Compare
3014a29
to
1c5b9f8
Compare
05f22dc
to
6f42986
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add some unit tests? I believe this has a bug where it will drop chunks if some of the final chunks are shorter than min_chunk_size
This is also relatively achievable as a postprocessing step, so not confident this needs to be part of this implementation in particular.
e.g.
def merge_short_docs(docs, min_chunk_size = 20):
rtn = []
doc = None
for d in docs:
if doc is None:
doc = d
else:
# use first doc's metadata
doc = Document(doc.page_content + d.page_content, metadata=doc.metadata)
if len(doc.page_content) < min_chunk_size:
continue
rtn.append(doc)
doc = None
if doc is not None:
rtn.append(doc)
return rtn
note this allows you to configure metadata handling as well if you want to merge in a custom way.
Thoughts on documenting this instead of adding the param?
@efriis sure, I will try to add some tests. However, since start_index is not incremented, the second if will catch this:
|
4ebc335
to
c4f4ef9
Compare
@efriis Test added. The current implementation does not miss the last sentence (or what is left) and does the same thing as your snippet just in less lines of code :) However, I like the idea of adding a postprocessing step. On the other hand, it's not that simple to cover all edge cases - e.g. if the last chunk is smaller than |
c4f4ef9
to
50d2e59
Compare
@efriis is there a way to restart vercel, please? Or how can I access the logs/build from this workflow? I am getting 404. |
50d2e59
to
a25257c
Compare
hey sorry for the delay! feel free to reopen the code changes against the langchain-experimental repo (this package moved)! https://github.com/langchain-ai/langchain-experimental when you merge in master, the vercel build should work better. (some issues with the vercel build last week that have been fixed by upgrading to docusaurus 3) |
a25257c
to
82443ce
Compare
Hi @efriis, as requested, I moved the code change to the other repo, leaving just the doc change here. |
Moving code changes here from langchain-ai/langchain#26398
82443ce
to
fb82be9
Compare
Friendly ping @baskaryan @eyurtsev |
Fixes #26171:
breakpoint_threshold_amount
breakpoint_threshold_amount
, too small/big chunk sizes can be avoidedNote: the langchain-experimental was moved to a separate repo, so only the doc change stays here.