Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocumentsBuilder #5700

Closed
Tracked by #5330
sjrl opened this issue Sep 1, 2023 · 28 comments · Fixed by deepset-ai/haystack-experimental#92
Closed
Tracked by #5330

DocumentsBuilder #5700

sjrl opened this issue Sep 1, 2023 · 28 comments · Fixed by deepset-ai/haystack-experimental#92
Assignees
Labels
2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available type:feature New feature or request

Comments

@sjrl
Copy link
Contributor

sjrl commented Sep 1, 2023

See the proposal: #5540 and see AnswersBuilder


LLMs clients output strings, but many components expect other object types, and LLMs may produce output in a parsable format that can be directly converted into objects. Output parsers transform these strings into objects of the user’s choosing.

DocumentsBuilder. It takes the string replies and metadata output of an LLM and produces Document objects.

For example, a PromptNode could be used to summarize a longer doc and the user would like to have the result output as a Document object. This document object could then be shown to the end-user or it could be used in another PromptNode to answer a question for example.

Draft I/O for DocumentsBuilder:

@component
class DocumentsBuilder:

    @component.output_types(answers=List[List[Document]])
    def run(self, replies: List[List[str]], metadata: List[List[Dict[str, Any]]], documents: Optional[List[List[Document]]]):
        all_documents = []
        for replies_list, meta, document_list in zip(replies, metadata, documents):
            documents = [Document(content=document, metadata={**meta, "documents": document_list}) for document in replies_list]
            all_documents.append(documents)
        return {"documents": all_documents}
@sjrl sjrl mentioned this issue Sep 1, 2023
@sjrl sjrl added the 2.x Related to Haystack v2.0 label Sep 1, 2023
@sjrl sjrl changed the title DocumentsBuilder DocumentsBuilder Sep 1, 2023
@Timoeller Timoeller modified the milestone: 2.0-beta Oct 9, 2023
@Timoeller Timoeller added the P3 Low priority, leave it in the backlog label Oct 12, 2023
@mathislucka mathislucka added the type:feature New feature or request label Dec 22, 2023
@mrm1001
Copy link

mrm1001 commented Jun 21, 2024

This can also be viewed in the light of #7861 where one could embed the summary of a table, but use the actual table values at generation to get the answer. Similar to https://docs.llamaindex.ai/en/stable/examples/query_engine/pdf_tables/recursive_retriever/.

@mrm1001 mrm1001 added P1 High priority, add to the next sprint and removed P3 Low priority, leave it in the backlog labels Jun 28, 2024
@julian-risch
Copy link
Member

Hi @sjrl we're considering this for the next sprint. Any update from you? This is still relevant, right?

@sjrl
Copy link
Contributor Author

sjrl commented Jun 28, 2024

Hey @julian-risch that is great to hear! And yes this is definitely still relevant. Basically a component that allows for adding LLM extracted information from a Document into metadata would be very helpful to boost things like retrieval.

@davidsbatista
Copy link
Contributor

I'm confused and need a bit more of clarification regarding this ticket/issue, mainly:

why is the input list of lists?

replies: List[List[str]], metadata: List[List[Dict[str, Any]]], documents: Optional[List[List[Document]]]

wouldn't make more sense to be

replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]]

or are the internal lists supposed to chunks of the "main" Document?

Is the idea is to update the metadata of already existing documents or to create new ones, or it should support both cases?

I think an example or a test would be helpful for me to understand the behaviour of this component should have.

One more thing, isn't the MetadataBuilder a more specific case of this proposed DocumentBuilder?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 3, 2024

Hey @davidsbatista thanks for taking a look at this. This issue was created well before Haystack v2 finalized a lot of decisions. So I agree with you on both of these points

Point 1.

wouldn't make more sense to be

replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]]

Yes I agree this makes more sense, back when I wrote this it wasn't clear if we were going to support batching right away.

Point 2.

One more thing, isn't the #5702 a more specific case of this proposed DocumentBuilder?

Yes I think you're right. I can't exactly remember why I opened both, but I think at the time it wasn't clear which component would be better/easier to implement.

Last question
And in regards to this

Is the idea is to update the metadata of already existing documents or to create new ones, or it should support both cases?

The actual idea (even though poorly explained) was to update the metadata of existing documents.

@davidsbatista
Copy link
Contributor

davidsbatista commented Jul 3, 2024

I have a new proposal regarding the MetaDataBuilder/DocumentBuilder. We can merge both this issue and #5702 into a single issue, with the goal to create a MetadataBuilder, whose purpose is to update a Document's metadata based on it's content.

This could a component that relies on another component to extract the metadata, i.e.: Generator, a NamedEntityExtractor or a CustomComponent.

Looking roughly like this:

@component
class MetaDataBuilder:
    """
    A component that allows extracting metadata from a list of documents.

    The extractor function should take string and return a list of dictionaries and be able to extract metadata from
    the content of the document.

    The extractor can be:
        - a CustomComponent that receives a string and returns a list of dictionaries (e.g.: a custom component based on regexes)
        - a NamedEntityExtractor
        - a Generator
    """
    def run(self, documents: List[Document], extractor: Literal["Component", "NamedEntityExtractor", "Generator"]):
        metadata: List[Dict[str, Any]] = []
        
         # here build all the logic to use the extractor to extract metadata from the content of the documents     

        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

What do you think?

(also tagging @julian-risch @mrm1001 for feedback)

@sjrl
Copy link
Contributor Author

sjrl commented Jul 3, 2024

@davidsbatista yeah that sounds good to me! The only thing I wanted to clarify is if we've decided that it's okay to pass components to other components as input? I haven't participated in that discussion in a while, so I'm just unaware of the status.

@davidsbatista
Copy link
Contributor

That's a very good point - maybe there's a better way to handle this, but my idea was to have a component that uses any other component to extract metadata from document's content.

@shadeMe what do you think about this? any suggestions?

@shadeMe
Copy link
Contributor

shadeMe commented Jul 3, 2024

I don't see any reason to nest the extractor component inside the metadata builder? Furthermore, how can the wrapping component reason about the inputs of a generic extractor component? We'd have to expose a catch-all extractor_args: Dict[str, Any] parameter which might as well be left to the pipeline, i.e., the extractor component's outputs are connected to the input of the metadata builder.

@mrm1001
Copy link

mrm1001 commented Jul 3, 2024

Hey @davidsbatista, what about a use case where you might want to embed the document summaries? In this scenario, we would need a component that takes the output of an LLM and returns the output of the LLM as the content of the document.

@davidsbatista
Copy link
Contributor

I don't see any reason to nest the extractor component inside the metadata builder?

Doesn't need to be nested, I'm just after a way to leverage on existing components to do the extraction of metadata.

Furthermore, how can the wrapping component reason about the inputs of a generic extractor component? We'd have to expose a catch-all extractor_args: Dict[str, Any] parameter which might as well be left to the pipeline, i.e., the extractor component's outputs are connected to the input of the metadata builder.

It would be up to the user to define how the generic extractor behaves and what it extracts, the only concern is that it needs to return a Dict[str, Any] which is the metadata structure that it will be added to the Document.

@davidsbatista
Copy link
Contributor

a component that takes the output of an LLM and returns the output of the LLM as the content of the document

NOTE: the content of the document is never changed, only the metadata is updated.

You can use a Generator with a prompt give me a summary of this document, and it should be adjusted so that the output is a dictionary, {summary: <document_summary_text>} which would then be added to the original Document.metadata

@shadeMe
Copy link
Contributor

shadeMe commented Jul 3, 2024

Doesn't need to be nested, I'm just after a way to leverage on existing components to do the extraction of metadata.

Then I'd say let's leave that decision to the user - They can pick the best component for their needs.

It would be up to the user to define how the generic extractor behaves and what it extracts, the only concern is that it needs to return a Dict[str, Any] which is the metadata structure that it will be added to the Document.

I think we can just go with the following:

def run(self, documents: List[Document], metadata: List[Dict[str, Any]]):

The metadata builder just merges those two together.

@mrm1001
Copy link

mrm1001 commented Jul 3, 2024

Ok, I’m just a bit confused why we decided to drop the use case in this issue (originally it’s about adding LLM output to document content) and only do metadataBuilder instead.

@davidsbatista
Copy link
Contributor

I think in many cases you don't need LLM to do metadata extraction, and rather want to use other components from Haystack, like the NER module, or a custom component that only applies regexes for instance.

But I'm also OK about narrowing this down to simply having a component (haystack.extractors.metadata_extractor) that only does metadata extraction from documents based on an LLM prompt.

def run(self, documents: List[Document], prompt: str): -> List[Dict[str,Any]]

@davidsbatista
Copy link
Contributor

Ok, I’m just a bit confused why we decided to drop the use case in this issue (originally it’s about adding LLM output to document content) and only do metadataBuilder instead.

It was not dropped, I was just trying to come up with a more generic component that can use anything you want to do metadata extraction, including an LLM.

@mrm1001
Copy link

mrm1001 commented Jul 3, 2024

Yes, but the original issue is not about metadata extraction, it's about creating documents with the output of LLMs. I'm ok to drop this use case for now, but it would be nice to add a note on why we're not doing it anymore.

@davidsbatista
Copy link
Contributor

@sjrl would this suit your needs? A component that given documents and new metadata updates the metadata of the documents?

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[List[Dict[str, Any]]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

@sjrl
Copy link
Contributor Author

sjrl commented Jul 3, 2024

Hey @davidsbatista I think yes, just to clarify the type of metadata should be List[Dict[str, Any]] right?

And I say "I think" because I think there are a few things that would need to be clarified:

  • is it possible to convert the output of a Generator into a List[Dict[str, Any]] (maybe solvable with the output adapter)?
  • is it possible to run a LLM on a list of Documents and get out a List of replies (again maybe silly question, but just want to check that this is possible)?

@davidsbatista
Copy link
Contributor

Hey @davidsbatista I think yes, just to clarify the type of metadata should be List[Dict[str, Any]] right?

sorry, yes, my mistake - it should be only have one List[]

is possible to convert the output of a Generator into a List[Dict[str, Any]] (maybe solvable with the output adapter)?

yes, from the docs it should be (never used it before) - but in any case, you can also create a custom component

is it possible to run a LLM on a list of Documents and get out a List of replies (again maybe silly question, but just want to check that this is possible)?

yes, I did something before with a Looper custom component, for the same use case

so in recap a component that will update the Metadata would be enough for this issue?

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

@sjrl
Copy link
Contributor Author

sjrl commented Jul 3, 2024

but in any case, you can also create a custom component

Yes definitely, but for right now in dC it's not a straightforward experience bringing in custom components so it'd be great to have all necessary components for this use case within Haystack :)

so in recap a component that will update the Metadata would be enough for this issue?

So technically yes, but if other custom components (e.g. maybe two) are also needed, then it might make more sense to skip this and then just make our own custom component that can fully handle the use case? Otherwise, I'm not sure what this new component on it's own would enable.

@davidsbatista
Copy link
Contributor

just make our own custom component that can fully handle the use case

I'm happy to do that, and I can adapt some of the ideas from the MetaDataExtractor proposal/advanced use case.

My only doubt is that currently no component in haystack uses other components - and I think this MetaDataExtractor would definitely benefit from reusing other component - I don't know how much this would break Haystack philosophy/architectural design. (@shadeMe ?)

We could also do it in haystack-experimental, which dC can also install and use, and latter decide how to port it to the main haystack repository.

@shadeMe
Copy link
Contributor

shadeMe commented Jul 3, 2024

@sjrl At this point in time, the design pattern of initializing components with other components is still somewhat contentious, and we have yet to reach a decision about endorsing it "officially" (by introducing a new component that follows such a pattern into Haystack core).

So, I think it might make more sense to aim for a custom component to address your usecase.

PS: The experimental package offers no guarantees w.r.t compatibility, so it wouldn't be suitable for use in dC.

@davidsbatista
Copy link
Contributor

@sjrl I think if I had the component:

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

You can make an (indexing) pipeline that uses a Generator to extract metadata from Document(s), the OutputAdaptor to transform the Generator 's output into the metadata structure (Dict[str,Any]), and then this MetadataBuilder to update the metadata.

In this way you can solve it all with a pipeline and existing components, without the need to go for custom components.

Does that sounds like a solution ?

@sjrl
Copy link
Contributor Author

sjrl commented Jul 4, 2024

Hey @davidsbatista I see what you're saying, but I was under the impression that a Generator can only work on one document at at time, whereas a normal indexing pipeline handles List[Document]. So something like

import os
from haystack import Pipeline

from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.converters.output_adapter import OutputAdapter

os.environ["OPENAI_API_KEY"] = "API_KEY"


pipe = Pipeline()
pipe.add_component(name="TextConverter", instance=TextFileToDocument())
pipe.add_component(name="DocumentSplitter", instance=DocumentSplitter())
pipe.add_component(name="PromptBuilder", instance=PromptBuilder())
pipe.add_component(name="OpenAI", instance=OpenAIGenerator())
pipe.add_component(name="OutputAdapter", instance=OutputAdapter())

pipe.connect("TextConverter.documents", "DocumentSplitter.documents")
# I want to loop over documents here
pipe.connect("DocumentSplitter.documents", "PromptBuilder.documents")
pipe.connect("PromptBuilder.prompt", "OpenAI.prompt")
pipe.connect("OpenAI.replies", "OutputAdapter.replies")

The above doesn't appear to work since once we hit the PromptBuilder we lose the ability to run each document individually. Is there a way to overcome this with loops?

@davidsbatista
Copy link
Contributor

going to the backlog, as dC will use another solution for now

@mrm1001 mrm1001 removed the P1 High priority, add to the next sprint label Jul 10, 2024
@JasperLS
Copy link

JasperLS commented Sep 4, 2024

hey this comes up in discussions, so it would be great to have this functionality. To describe a potential indexing workflow:

  • client uploads data
  • indexing pipeline first does per file meta data generation
  • this metadata is then attached to all documents when chunking later

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 9, 2024
@davidsbatista
Copy link
Contributor

duplicate of #5702

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.x Related to Haystack v2.0 P2 Medium priority, add to the next sprint if no P1 available type:feature New feature or request
Projects
None yet
8 participants