`DocumentsBuilder` #5700

sjrl · 2023-09-01T08:52:54Z

See the proposal: #5540 and see AnswersBuilder

LLMs clients output strings, but many components expect other object types, and LLMs may produce output in a parsable format that can be directly converted into objects. Output parsers transform these strings into objects of the user’s choosing.

DocumentsBuilder. It takes the string replies and metadata output of an LLM and produces Document objects.

For example, a PromptNode could be used to summarize a longer doc and the user would like to have the result output as a Document object. This document object could then be shown to the end-user or it could be used in another PromptNode to answer a question for example.

Draft I/O for DocumentsBuilder:

@component
class DocumentsBuilder:

    @component.output_types(answers=List[List[Document]])
    def run(self, replies: List[List[str]], metadata: List[List[Dict[str, Any]]], documents: Optional[List[List[Document]]]):
        all_documents = []
        for replies_list, meta, document_list in zip(replies, metadata, documents):
            documents = [Document(content=document, metadata={**meta, "documents": document_list}) for document in replies_list]
            all_documents.append(documents)
        return {"documents": all_documents}

The text was updated successfully, but these errors were encountered:

mrm1001 · 2024-06-21T13:24:24Z

This can also be viewed in the light of #7861 where one could embed the summary of a table, but use the actual table values at generation to get the answer. Similar to https://docs.llamaindex.ai/en/stable/examples/query_engine/pdf_tables/recursive_retriever/.

julian-risch · 2024-06-28T10:08:31Z

Hi @sjrl we're considering this for the next sprint. Any update from you? This is still relevant, right?

sjrl · 2024-06-28T10:10:14Z

Hey @julian-risch that is great to hear! And yes this is definitely still relevant. Basically a component that allows for adding LLM extracted information from a Document into metadata would be very helpful to boost things like retrieval.

davidsbatista · 2024-07-01T11:27:55Z

I'm confused and need a bit more of clarification regarding this ticket/issue, mainly:

why is the input list of lists?

replies: List[List[str]], metadata: List[List[Dict[str, Any]]], documents: Optional[List[List[Document]]]

wouldn't make more sense to be

replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]]

or are the internal lists supposed to chunks of the "main" Document?

Is the idea is to update the metadata of already existing documents or to create new ones, or it should support both cases?

I think an example or a test would be helpful for me to understand the behaviour of this component should have.

One more thing, isn't the MetadataBuilder a more specific case of this proposed DocumentBuilder?

sjrl · 2024-07-03T06:55:01Z

Hey @davidsbatista thanks for taking a look at this. This issue was created well before Haystack v2 finalized a lot of decisions. So I agree with you on both of these points

Point 1.

wouldn't make more sense to be

replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]]

Yes I agree this makes more sense, back when I wrote this it wasn't clear if we were going to support batching right away.

Point 2.

One more thing, isn't the #5702 a more specific case of this proposed DocumentBuilder?

Yes I think you're right. I can't exactly remember why I opened both, but I think at the time it wasn't clear which component would be better/easier to implement.

Last question
And in regards to this

Is the idea is to update the metadata of already existing documents or to create new ones, or it should support both cases?

The actual idea (even though poorly explained) was to update the metadata of existing documents.

davidsbatista · 2024-07-03T10:03:01Z

I have a new proposal regarding the MetaDataBuilder/DocumentBuilder. We can merge both this issue and #5702 into a single issue, with the goal to create a MetadataBuilder, whose purpose is to update a Document's metadata based on it's content.

This could a component that relies on another component to extract the metadata, i.e.: Generator, a NamedEntityExtractor or a CustomComponent.

Looking roughly like this:

@component
class MetaDataBuilder:
    """
    A component that allows extracting metadata from a list of documents.

    The extractor function should take string and return a list of dictionaries and be able to extract metadata from
    the content of the document.

    The extractor can be:
        - a CustomComponent that receives a string and returns a list of dictionaries (e.g.: a custom component based on regexes)
        - a NamedEntityExtractor
        - a Generator
    """
    def run(self, documents: List[Document], extractor: Literal["Component", "NamedEntityExtractor", "Generator"]):
        metadata: List[Dict[str, Any]] = []
        
         # here build all the logic to use the extractor to extract metadata from the content of the documents     

        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

What do you think?

(also tagging @julian-risch @mrm1001 for feedback)

sjrl · 2024-07-03T10:12:12Z

@davidsbatista yeah that sounds good to me! The only thing I wanted to clarify is if we've decided that it's okay to pass components to other components as input? I haven't participated in that discussion in a while, so I'm just unaware of the status.

davidsbatista · 2024-07-03T11:20:15Z

That's a very good point - maybe there's a better way to handle this, but my idea was to have a component that uses any other component to extract metadata from document's content.

@shadeMe what do you think about this? any suggestions?

shadeMe · 2024-07-03T11:32:48Z

I don't see any reason to nest the extractor component inside the metadata builder? Furthermore, how can the wrapping component reason about the inputs of a generic extractor component? We'd have to expose a catch-all extractor_args: Dict[str, Any] parameter which might as well be left to the pipeline, i.e., the extractor component's outputs are connected to the input of the metadata builder.

mrm1001 · 2024-07-03T11:34:50Z

Hey @davidsbatista, what about a use case where you might want to embed the document summaries? In this scenario, we would need a component that takes the output of an LLM and returns the output of the LLM as the content of the document.

davidsbatista · 2024-07-03T11:46:57Z

I don't see any reason to nest the extractor component inside the metadata builder?

Doesn't need to be nested, I'm just after a way to leverage on existing components to do the extraction of metadata.

Furthermore, how can the wrapping component reason about the inputs of a generic extractor component? We'd have to expose a catch-all extractor_args: Dict[str, Any] parameter which might as well be left to the pipeline, i.e., the extractor component's outputs are connected to the input of the metadata builder.

It would be up to the user to define how the generic extractor behaves and what it extracts, the only concern is that it needs to return a Dict[str, Any] which is the metadata structure that it will be added to the Document.

davidsbatista · 2024-07-03T11:51:54Z

a component that takes the output of an LLM and returns the output of the LLM as the content of the document

NOTE: the content of the document is never changed, only the metadata is updated.

You can use a Generator with a prompt give me a summary of this document, and it should be adjusted so that the output is a dictionary, {summary: <document_summary_text>} which would then be added to the original Document.metadata

shadeMe · 2024-07-03T11:52:34Z

Doesn't need to be nested, I'm just after a way to leverage on existing components to do the extraction of metadata.

Then I'd say let's leave that decision to the user - They can pick the best component for their needs.

It would be up to the user to define how the generic extractor behaves and what it extracts, the only concern is that it needs to return a Dict[str, Any] which is the metadata structure that it will be added to the Document.

I think we can just go with the following:

def run(self, documents: List[Document], metadata: List[Dict[str, Any]]):

The metadata builder just merges those two together.

mrm1001 · 2024-07-03T11:58:39Z

Ok, I’m just a bit confused why we decided to drop the use case in this issue (originally it’s about adding LLM output to document content) and only do metadataBuilder instead.

davidsbatista · 2024-07-03T12:04:21Z

I think in many cases you don't need LLM to do metadata extraction, and rather want to use other components from Haystack, like the NER module, or a custom component that only applies regexes for instance.

But I'm also OK about narrowing this down to simply having a component (haystack.extractors.metadata_extractor) that only does metadata extraction from documents based on an LLM prompt.

def run(self, documents: List[Document], prompt: str): -> List[Dict[str,Any]]

davidsbatista · 2024-07-03T12:07:23Z

Ok, I’m just a bit confused why we decided to drop the use case in this issue (originally it’s about adding LLM output to document content) and only do metadataBuilder instead.

It was not dropped, I was just trying to come up with a more generic component that can use anything you want to do metadata extraction, including an LLM.

mrm1001 · 2024-07-03T12:25:58Z

Yes, but the original issue is not about metadata extraction, it's about creating documents with the output of LLMs. I'm ok to drop this use case for now, but it would be nice to add a note on why we're not doing it anymore.

davidsbatista · 2024-07-03T13:09:50Z

@sjrl would this suit your needs? A component that given documents and new metadata updates the metadata of the documents?

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[List[Dict[str, Any]]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

sjrl · 2024-07-03T13:18:35Z

Hey @davidsbatista I think yes, just to clarify the type of metadata should be List[Dict[str, Any]] right?

And I say "I think" because I think there are a few things that would need to be clarified:

is it possible to convert the output of a Generator into a List[Dict[str, Any]] (maybe solvable with the output adapter)?
is it possible to run a LLM on a list of Documents and get out a List of replies (again maybe silly question, but just want to check that this is possible)?

davidsbatista · 2024-07-03T13:46:27Z

Hey @davidsbatista I think yes, just to clarify the type of metadata should be List[Dict[str, Any]] right?

sorry, yes, my mistake - it should be only have one List[]

is possible to convert the output of a Generator into a List[Dict[str, Any]] (maybe solvable with the output adapter)?

yes, from the docs it should be (never used it before) - but in any case, you can also create a custom component

is it possible to run a LLM on a list of Documents and get out a List of replies (again maybe silly question, but just want to check that this is possible)?

yes, I did something before with a Looper custom component, for the same use case

so in recap a component that will update the Metadata would be enough for this issue?

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

sjrl · 2024-07-03T13:54:37Z

but in any case, you can also create a custom component

Yes definitely, but for right now in dC it's not a straightforward experience bringing in custom components so it'd be great to have all necessary components for this use case within Haystack :)

so in recap a component that will update the Metadata would be enough for this issue?

So technically yes, but if other custom components (e.g. maybe two) are also needed, then it might make more sense to skip this and then just make our own custom component that can fully handle the use case? Otherwise, I'm not sure what this new component on it's own would enable.

davidsbatista · 2024-07-03T14:09:01Z

just make our own custom component that can fully handle the use case

I'm happy to do that, and I can adapt some of the ideas from the MetaDataExtractor proposal/advanced use case.

My only doubt is that currently no component in haystack uses other components - and I think this MetaDataExtractor would definitely benefit from reusing other component - I don't know how much this would break Haystack philosophy/architectural design. (@shadeMe ?)

We could also do it in haystack-experimental, which dC can also install and use, and latter decide how to port it to the main haystack repository.

shadeMe · 2024-07-03T14:44:06Z

@sjrl At this point in time, the design pattern of initializing components with other components is still somewhat contentious, and we have yet to reach a decision about endorsing it "officially" (by introducing a new component that follows such a pattern into Haystack core).

So, I think it might make more sense to aim for a custom component to address your usecase.

PS: The experimental package offers no guarantees w.r.t compatibility, so it wouldn't be suitable for use in dC.

davidsbatista · 2024-07-03T15:20:04Z

@sjrl I think if I had the component:

@component
class MetadataBuilder:
    """
    A component that allows updating a Documents metadata.
    """
    @component.output_types(documents=List[Document])
    def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
        updated_documents = []
        for meta, document in zip(metadata, documents):
            updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
        return {"documents": updated_documents}

You can make an (indexing) pipeline that uses a Generator to extract metadata from Document(s), the OutputAdaptor to transform the Generator 's output into the metadata structure (Dict[str,Any]), and then this MetadataBuilder to update the metadata.

In this way you can solve it all with a pipeline and existing components, without the need to go for custom components.

Does that sounds like a solution ?

sjrl · 2024-07-04T05:49:04Z

Hey @davidsbatista I see what you're saying, but I was under the impression that a Generator can only work on one document at at time, whereas a normal indexing pipeline handles List[Document]. So something like

import os
from haystack import Pipeline

from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.converters.output_adapter import OutputAdapter

os.environ["OPENAI_API_KEY"] = "API_KEY"


pipe = Pipeline()
pipe.add_component(name="TextConverter", instance=TextFileToDocument())
pipe.add_component(name="DocumentSplitter", instance=DocumentSplitter())
pipe.add_component(name="PromptBuilder", instance=PromptBuilder())
pipe.add_component(name="OpenAI", instance=OpenAIGenerator())
pipe.add_component(name="OutputAdapter", instance=OutputAdapter())

pipe.connect("TextConverter.documents", "DocumentSplitter.documents")
# I want to loop over documents here
pipe.connect("DocumentSplitter.documents", "PromptBuilder.documents")
pipe.connect("PromptBuilder.prompt", "OpenAI.prompt")
pipe.connect("OpenAI.replies", "OutputAdapter.replies")

The above doesn't appear to work since once we hit the PromptBuilder we lose the ability to run each document individually. Is there a way to overcome this with loops?

davidsbatista · 2024-07-04T14:51:50Z

going to the backlog, as dC will use another solution for now

JasperLS · 2024-09-04T09:53:33Z

hey this comes up in discussions, so it would be great to have this functionality. To describe a potential indexing workflow:

client uploads data
indexing pipeline first does per file meta data generation
this metadata is then attached to all documents when chunking later

davidsbatista · 2024-09-19T12:14:47Z

duplicate of #5702

sjrl mentioned this issue Sep 1, 2023

LLM support (2.x) #5330

Closed

sjrl added the 2.x Related to Haystack v2.0 label Sep 1, 2023

sjrl changed the title ~~DocumentsBuilder~~ DocumentsBuilder Sep 1, 2023

Timoeller modified the milestone: 2.0-beta Oct 9, 2023

Timoeller added the P3 Low priority, leave it in the backlog label Oct 12, 2023

mathislucka added the type:feature New feature or request label Dec 22, 2023

vrunm mentioned this issue Dec 22, 2023

feat: Add MetadataBuilder #6636

Closed

mrm1001 added P1 High priority, add to the next sprint and removed P3 Low priority, leave it in the backlog labels Jun 28, 2024

julian-risch assigned davidsbatista Jul 1, 2024

davidsbatista mentioned this issue Jul 3, 2024

MetadataBuilder #5702

Closed

mrm1001 removed the P1 High priority, add to the next sprint label Jul 10, 2024

julian-risch unassigned davidsbatista Jul 10, 2024

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Sep 9, 2024

julian-risch assigned davidsbatista Sep 9, 2024

davidsbatista mentioned this issue Sep 13, 2024

feat: metadata extractor based on a LLM deepset-ai/haystack-experimental#92

Merged

davidsbatista closed this as completed Sep 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`DocumentsBuilder` #5700

`DocumentsBuilder` #5700

sjrl commented Sep 1, 2023 •

edited

Loading

mrm1001 commented Jun 21, 2024

julian-risch commented Jun 28, 2024

sjrl commented Jun 28, 2024

davidsbatista commented Jul 1, 2024

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024 •

edited

Loading

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

sjrl commented Jul 3, 2024 •

edited

Loading

davidsbatista commented Jul 3, 2024

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

sjrl commented Jul 4, 2024

davidsbatista commented Jul 4, 2024

JasperLS commented Sep 4, 2024

davidsbatista commented Sep 19, 2024

DocumentsBuilder #5700

DocumentsBuilder #5700

Comments

sjrl commented Sep 1, 2023 • edited Loading

mrm1001 commented Jun 21, 2024

julian-risch commented Jun 28, 2024

sjrl commented Jun 28, 2024

davidsbatista commented Jul 1, 2024

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024 • edited Loading

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

mrm1001 commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

sjrl commented Jul 3, 2024 • edited Loading

davidsbatista commented Jul 3, 2024

sjrl commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

shadeMe commented Jul 3, 2024

davidsbatista commented Jul 3, 2024

sjrl commented Jul 4, 2024

davidsbatista commented Jul 4, 2024

JasperLS commented Sep 4, 2024

davidsbatista commented Sep 19, 2024

`DocumentsBuilder` #5700

`DocumentsBuilder` #5700

sjrl commented Sep 1, 2023 •

edited

Loading

davidsbatista commented Jul 3, 2024 •

edited

Loading

sjrl commented Jul 3, 2024 •

edited

Loading