Replies: 2 comments
-
Answer generated by a 🤖 AnswerI understand that you're looking to determine the optimal size for chunking and overlap when processing documents in the LangChain framework. You're specifically interested in how to set these parameters when chunking files from a GitHub repository. In LangChain, the default chunk size and overlap are defined in the Here's the relevant section of the class TextSplitter(BaseDocumentTransformer, ABC):
def __init__(
self,
chunk_size: int = 4000,
chunk_overlap: int = 200,
length_function: Callable[[str], int] = len,
keep_separator: bool = False,
add_start_index: bool = False,
) -> None:
if chunk_overlap > chunk_size:
raise ValueError(
f"Got a larger chunk overlap ({chunk_overlap}) than chunk size "
f"({chunk_size}), should be smaller."
)
self._chunk_size = chunk_size
self._chunk_overlap = chunk_overlap
self._length_function = length_function
self._keep_separator = keep_separator
self._add_start_index = add_start_index In your code, you can modify the texts = document_chunking(docs, size=1000, overlap=0) Please remember that the This is just one potential solution, and the optimal chunk size and overlap may vary depending on your specific use case and the nature of the documents you're processing. If you find this information helpful and believe it could benefit other users, I encourage you to make a pull request to update the LangChain documentation. This way, others who may have the same question in the future can easily find this information. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
-
@holly-21 There isn't a one-size fits all solution for chunk size. It's use-case dependent. For code, I've found larger chunks (~2000 characters) is better, but more important is how you do the chunking. I highly recommend using LangChain's language specific text splitters for chunking code |
Beta Was this translation helpful? Give feedback.
-
I want to chunk the files of github repo.
What is the best size of chunk and overlap?
Here is the code.
Beta Was this translation helpful? Give feedback.
All reactions