How do I retrieve the latest documents based on metadata? #16111

XariZaru · 2024-01-16T22:37:04Z

XariZaru
Jan 16, 2024

I know this is a tricky one but basically I have documents that are similar to each other. They are only current for specific dates. For example, one document may only be effective for March 2023. The other effective for April 2023.

They are not identical documents. There are changes to them. What is the best practice in Langchain to look at these documents based on metadata and return the latest out of the bunch? Here is an example.

I have a list of files:

095 Rate Sheet 01.01.2023.pdf
010 Rate Sheet 02.03.2023.pdf
28468 Rate Sheet Update 12.12.2023.pdf

They all contain similar type of information (but not identical). If a user query was something similar to:
"Grab me the latest rates for an auto loan."

It should find the latest rate documents and grab the information out of it.

What is the best way to do this in Langchain?

@dosu-bot

@XariZaru · 2024-01-16T22:42:34Z

dosubot[bot]
bot Jan 16, 2024

🤖

Hi @XariZaru! I'm Dosu, a friendly bot here to assist while we're waiting for a human maintainer. I'm capable of helping you solve bugs, answering your questions, and guiding you to become a contributor. Let's work on your issue together!

In LangChain, you can use the EnsembleRetriever class to retrieve documents from multiple retrievers and use a rank fusion method to get the final result. The rank fusion method retrieves the results of all retrievers, enforces that retrieved docs are Documents for each list in retriever_docs, and then applies rank fusion using the weighted_reciprocal_rank method. This method performs weighted Reciprocal Rank Fusion on multiple rank lists, calculates RRF scores for each document, and sorts documents by their RRF scores in descending order.

However, LangChain does not provide a built-in way to filter documents based on their metadata and date of effectiveness. You would need to implement this functionality yourself. You could do this by adding a filtering step in the rank_fusion and arank_fusion methods, before the rank fusion is applied. This filtering step could remove documents from the retriever_docs lists that do not meet the required metadata and date of effectiveness criteria.

Here is a simplified example of how you could implement this:

def rank_fusion(
        self,
        query: str,
        run_manager: CallbackManagerForRetrieverRun,
        *,
        config: Optional[RunnableConfig] = None,
    ) -> List[Document]:

        # Get the results of all retrievers.
        retriever_docs = [
            retriever.invoke(
                query,
                patch_config(
                    config, callbacks=run_manager.get_child(tag=f"retriever_{i+1}")
                ),
            )
            for i, retriever in enumerate(self.retrievers)
        ]

        # Filter documents based on metadata and date of effectiveness
        for i in range(len(retriever_docs)):
            retriever_docs[i] = [
                doc for doc in retriever_docs[i] 
                if doc.metadata['date_of_effectiveness'] <= datetime.now() 
                and doc.metadata['other_required_metadata'] == 'required_value'
            ]

        # Enforce that retrieved docs are Documents for each list in retriever_docs
        for i in range(len(retriever_docs)):
            retriever_docs[i] = [
                Document(page_content=doc) if not isinstance(doc, Document) else doc
                for doc in retriever_docs[i]
            ]

        # apply rank fusion
        fused_documents = self.weighted_reciprocal_rank(retriever_docs)

        return fused_documents

Please note that this is a simplified example and you would need to adjust it to fit your specific requirements. You can find more information about the EnsembleRetriever class in the LangChain repository.

As for parsing and extracting metadata from documents, such as the date of effectiveness from the filename, LangChain does not seem to have built-in support for this. You would need to implement this functionality yourself. I recommend using a regular expression or a similar method to extract the date of effectiveness from the filename, and then storing this information in the document's metadata.

I hope this helps! If you have any further questions, please let me know.

Sources

libs/langchain/langchain/retrievers/ensemble.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

XariZaru · 2024-01-16T23:34:05Z

XariZaru
Jan 16, 2024
Author

Can additional insight from a contributor and things they may have done to get the latest document in a chain? I am not sure if @dosu-bot can answer this. I do have restrictions on the fact that I cannot edit any of the original functions in Langchain.

For example, this is how the document data looks like:

FileName                                                         DocumentName     UploadDate
28164 Flyr Rate Sheet Update 012523.pdf	2023 Rate Sheets	2023-01-24 20:51:43.537
28259 Rate Sheet Update 030123_v2.pdf	2023 Rate Sheets	2023-03-03 13:45:44.677
28363 Rate Sheet Update 040123.pdf	        2023 Rate Sheets	2023-03-31 16:09:04.033
28460 Rate Sheet Update 050323.pdf	        2023 Rate Sheets	2023-05-01 11:47:19.827
Rate Sheet Update_052423.pdf	                2023 Rate Sheets	2023-05-24 08:10:47.507
061 Rate Sheet Update 060123.pdf	        2023 Rate Sheets	2023-07-10 13:56:36.977

2 replies

dosubot[bot] bot Jan 16, 2024

🤖

Based on the information provided, LangChain does have a built-in function to extract metadata from documents. However, it doesn't seem to have a built-in functionality for parsing and comparing dates in metadata.

To achieve your goal, you might need to extend the functionality of LangChain. Here's a general approach you could take:

Extract Metadata: Use the built-in function to extract metadata from documents. In your case, the metadata is the date in the filename. You can modify the load method in the FacebookChatLoader class to extract the date from the filename and add it to the metadata dictionary.

def load(self) -> List[Document]:
    """Load documents."""
    p = Path(self.file_path)

    # Extract date from filename
    date_str = p.stem.split()[-1]
    date = datetime.strptime(date_str, "%m.%d.%Y")

    with open(p, encoding="utf8") as f:
        d = json.load(f)

    text = "".join(
        concatenate_rows(message)
        for message in d["messages"]
        if message.get("content") and isinstance(message["content"], str)
    )
    metadata = {"source": str(p), "date": date}

    return [Document(page_content=text, metadata=metadata)]

Compare Dates: After extracting the dates, you can compare them to find the latest document. You can do this by sorting the list of documents based on the date in the metadata.

documents = loader.load()
documents.sort(key=lambda doc: doc.metadata["date"], reverse=True)
latest_document = documents[0]

Query Document: Finally, you can use LangChain's language model to query the latest document for the information you need.

Please note that this is a general approach and might need to be adjusted based on your specific use case and the structure of your documents. Also, remember to handle potential errors, such as when the date cannot be extracted from the filename.

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

libs/langchain/langchain/document_loaders/facebook_chat.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

dayosalam Jul 17, 2024

@dosu-bot How can we retrieve the latest information first based on date after embedding?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I retrieve the latest documents based on metadata? #16111

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

How do I retrieve the latest documents based on metadata? #16111

XariZaru Jan 16, 2024

Replies: 2 comments · 2 replies

dosubot[bot] bot Jan 16, 2024

Sources

XariZaru Jan 16, 2024 Author

dosubot[bot] bot Jan 16, 2024

Sources

dayosalam Jul 17, 2024

XariZaru
Jan 16, 2024

Replies: 2 comments 2 replies

dosubot[bot]
bot Jan 16, 2024

XariZaru
Jan 16, 2024
Author