Replies: 1 comment 1 reply
-
🤖 Hey @tigerinus! 👋 Welcome to the LangChain repository. I'm Dosu, a friendly bot here to lend a hand while you're waiting for a human maintainer. I'm here to help squash bugs, answer questions, and guide you to becoming a contributor. Don't worry though, I promise not to overload your RAM with too much information! 🤖💾🚀 You're correct in your understanding that the The Here's the relevant code for reference: @classmethod
def from_documents(
cls,
documents: Iterable[Document],
*,
bm25_params: Optional[Dict[str, Any]] = None,
preprocess_func: Callable[[str], List[str]] = default_preprocessing_func,
**kwargs: Any,
) -> BM25Retriever:
"""
Create a BM25Retriever from a list of Documents.
Args:
documents: A list of Documents to vectorize.
bm25_params: Parameters to pass to the BM25 vectorizer.
preprocess_func: A function to preprocess each text before vectorization.
**kwargs: Any other arguments to pass to the retriever.
Returns:
A BM25Retriever instance.
"""
texts, metadatas = zip(*((d.page_content, d.metadata) for d in documents))
return cls.from_texts(
texts=texts,
bm25_params=bm25_params,
metadatas=metadatas,
preprocess_func=preprocess_func,
**kwargs,
) However, it is possible to modify the @classmethod
def from_documents(
cls,
documents: Iterable[Document],
*,
bm25_params: Optional[Dict[str, Any]] = None,
preprocess_func: Callable[[str], List[str]] = default_preprocessing_func,
**kwargs: Any,
) -> BM25Retriever:
"""
Create a BM25Retriever from a list of Documents.
Args:
documents: A list of Documents to vectorize.
bm25_params: Parameters to pass to the BM25 vectorizer.
preprocess_func: A function to preprocess each text before vectorization.
**kwargs: Any other arguments to pass to the retriever.
Returns:
A BM25Retriever instance.
"""
def generator():
for d in documents:
yield (d.page_content, d.metadata)
texts, metadatas = zip(*generator())
return cls.from_texts(
texts=texts,
bm25_params=bm25_params,
metadatas=metadatas,
preprocess_func=preprocess_func,
**kwargs,
) In this modification, we define a generator function inside You can find the original and modified code in the BM25Retriever class in the LangChain repository. I hope this helps! If you have any other questions, feel free to ask. Best, SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
It seems that
BM25Retriever
would have to load every single documents into memory before it can be used for retrieval. This is an issue when the amount of documents is huge.Or am I wrong?
Thanks.
Beta Was this translation helpful? Give feedback.
All reactions