-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DocumentsBuilder
#5700
Comments
This can also be viewed in the light of #7861 where one could embed the summary of a table, but use the actual table values at generation to get the answer. Similar to https://docs.llamaindex.ai/en/stable/examples/query_engine/pdf_tables/recursive_retriever/. |
Hi @sjrl we're considering this for the next sprint. Any update from you? This is still relevant, right? |
Hey @julian-risch that is great to hear! And yes this is definitely still relevant. Basically a component that allows for adding LLM extracted information from a Document into metadata would be very helpful to boost things like retrieval. |
I'm confused and need a bit more of clarification regarding this ticket/issue, mainly: why is the input list of lists? replies: List[List[str]], metadata: List[List[Dict[str, Any]]], documents: Optional[List[List[Document]]] wouldn't make more sense to be replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]] or are the internal lists supposed to chunks of the "main" Document? Is the idea is to update the metadata of already existing documents or to create new ones, or it should support both cases? I think an example or a test would be helpful for me to understand the behaviour of this component should have. One more thing, isn't the MetadataBuilder a more specific case of this proposed |
Hey @davidsbatista thanks for taking a look at this. This issue was created well before Haystack v2 finalized a lot of decisions. So I agree with you on both of these points Point 1.
replies: List[str], metadata: List[Dict[str, Any]], documents: Optional[List[Document]] Yes I agree this makes more sense, back when I wrote this it wasn't clear if we were going to support batching right away. Point 2.
Yes I think you're right. I can't exactly remember why I opened both, but I think at the time it wasn't clear which component would be better/easier to implement. Last question
The actual idea (even though poorly explained) was to update the metadata of existing documents. |
I have a new proposal regarding the MetaDataBuilder/DocumentBuilder. We can merge both this issue and #5702 into a single issue, with the goal to create a MetadataBuilder, whose purpose is to update a Document's metadata based on it's content. This could a component that relies on another component to extract the metadata, i.e.: Generator, a NamedEntityExtractor or a CustomComponent. Looking roughly like this: @component
class MetaDataBuilder:
"""
A component that allows extracting metadata from a list of documents.
The extractor function should take string and return a list of dictionaries and be able to extract metadata from
the content of the document.
The extractor can be:
- a CustomComponent that receives a string and returns a list of dictionaries (e.g.: a custom component based on regexes)
- a NamedEntityExtractor
- a Generator
"""
def run(self, documents: List[Document], extractor: Literal["Component", "NamedEntityExtractor", "Generator"]):
metadata: List[Dict[str, Any]] = []
# here build all the logic to use the extractor to extract metadata from the content of the documents
updated_documents = []
for meta, document in zip(metadata, documents):
updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
return {"documents": updated_documents} What do you think? (also tagging @julian-risch @mrm1001 for feedback) |
@davidsbatista yeah that sounds good to me! The only thing I wanted to clarify is if we've decided that it's okay to pass components to other components as input? I haven't participated in that discussion in a while, so I'm just unaware of the status. |
That's a very good point - maybe there's a better way to handle this, but my idea was to have a component that uses any other component to extract metadata from document's content. @shadeMe what do you think about this? any suggestions? |
I don't see any reason to nest the extractor component inside the metadata builder? Furthermore, how can the wrapping component reason about the inputs of a generic extractor component? We'd have to expose a catch-all |
Hey @davidsbatista, what about a use case where you might want to embed the document summaries? In this scenario, we would need a component that takes the output of an LLM and returns the output of the LLM as the content of the document. |
Doesn't need to be nested, I'm just after a way to leverage on existing components to do the extraction of metadata.
It would be up to the user to define how the generic extractor behaves and what it extracts, the only concern is that it needs to return a |
NOTE: the content of the document is never changed, only the You can use a Generator with a prompt give me a summary of this document, and it should be adjusted so that the output is a dictionary, |
Then I'd say let's leave that decision to the user - They can pick the best component for their needs.
I think we can just go with the following: def run(self, documents: List[Document], metadata: List[Dict[str, Any]]): The metadata builder just merges those two together. |
Ok, I’m just a bit confused why we decided to drop the use case in this issue (originally it’s about adding LLM output to document content) and only do metadataBuilder instead. |
I think in many cases you don't need LLM to do metadata extraction, and rather want to use other components from Haystack, like the NER module, or a custom component that only applies regexes for instance. But I'm also OK about narrowing this down to simply having a component (haystack.extractors.metadata_extractor) that only does metadata extraction from documents based on an LLM prompt.
|
It was not dropped, I was just trying to come up with a more generic component that can use anything you want to do metadata extraction, including an LLM. |
Yes, but the original issue is not about metadata extraction, it's about creating documents with the output of LLMs. I'm ok to drop this use case for now, but it would be nice to add a note on why we're not doing it anymore. |
@sjrl would this suit your needs? A component that given documents and new metadata updates the metadata of the documents? @component
class MetadataBuilder:
"""
A component that allows updating a Documents metadata.
"""
@component.output_types(documents=List[Document])
def run(self, metadata: List[List[Dict[str, Any]]], documents: List[Document]):
updated_documents = []
for meta, document in zip(metadata, documents):
updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
return {"documents": updated_documents} |
Hey @davidsbatista I think yes, just to clarify the type of metadata should be And I say "I think" because I think there are a few things that would need to be clarified:
|
sorry, yes, my mistake - it should be only have one
yes, from the docs it should be (never used it before) - but in any case, you can also create a custom component
yes, I did something before with a Looper custom component, for the same use case so in recap a component that will update the Metadata would be enough for this issue? @component
class MetadataBuilder:
"""
A component that allows updating a Documents metadata.
"""
@component.output_types(documents=List[Document])
def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
updated_documents = []
for meta, document in zip(metadata, documents):
updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
return {"documents": updated_documents} |
Yes definitely, but for right now in dC it's not a straightforward experience bringing in custom components so it'd be great to have all necessary components for this use case within Haystack :)
So technically yes, but if other custom components (e.g. maybe two) are also needed, then it might make more sense to skip this and then just make our own custom component that can fully handle the use case? Otherwise, I'm not sure what this new component on it's own would enable. |
I'm happy to do that, and I can adapt some of the ideas from the MetaDataExtractor proposal/advanced use case. My only doubt is that currently no component in haystack uses other components - and I think this We could also do it in |
@sjrl At this point in time, the design pattern of initializing components with other components is still somewhat contentious, and we have yet to reach a decision about endorsing it "officially" (by introducing a new component that follows such a pattern into Haystack core). So, I think it might make more sense to aim for a custom component to address your usecase. PS: The experimental package offers no guarantees w.r.t compatibility, so it wouldn't be suitable for use in dC. |
@sjrl I think if I had the component: @component
class MetadataBuilder:
"""
A component that allows updating a Documents metadata.
"""
@component.output_types(documents=List[Document])
def run(self, metadata: List[Dict[str, Any]], documents: List[Document]):
updated_documents = []
for meta, document in zip(metadata, documents):
updated_documents.append(Document(content=document.content, metadata={**document.meta, **meta}))
return {"documents": updated_documents} You can make an (indexing) pipeline that uses a In this way you can solve it all with a pipeline and existing components, without the need to go for custom components. Does that sounds like a solution ? |
Hey @davidsbatista I see what you're saying, but I was under the impression that a Generator can only work on one document at at time, whereas a normal indexing pipeline handles import os
from haystack import Pipeline
from haystack.components.converters.txt import TextFileToDocument
from haystack.components.preprocessors.document_splitter import DocumentSplitter
from haystack.components.builders.prompt_builder import PromptBuilder
from haystack.components.generators.openai import OpenAIGenerator
from haystack.components.converters.output_adapter import OutputAdapter
os.environ["OPENAI_API_KEY"] = "API_KEY"
pipe = Pipeline()
pipe.add_component(name="TextConverter", instance=TextFileToDocument())
pipe.add_component(name="DocumentSplitter", instance=DocumentSplitter())
pipe.add_component(name="PromptBuilder", instance=PromptBuilder())
pipe.add_component(name="OpenAI", instance=OpenAIGenerator())
pipe.add_component(name="OutputAdapter", instance=OutputAdapter())
pipe.connect("TextConverter.documents", "DocumentSplitter.documents")
# I want to loop over documents here
pipe.connect("DocumentSplitter.documents", "PromptBuilder.documents")
pipe.connect("PromptBuilder.prompt", "OpenAI.prompt")
pipe.connect("OpenAI.replies", "OutputAdapter.replies") The above doesn't appear to work since once we hit the |
going to the backlog, as dC will use another solution for now |
hey this comes up in discussions, so it would be great to have this functionality. To describe a potential indexing workflow:
|
duplicate of #5702 |
See the proposal: #5540 and see AnswersBuilder
LLMs clients output strings, but many components expect other object types, and LLMs may produce output in a parsable format that can be directly converted into objects. Output parsers transform these strings into objects of the user’s choosing.
DocumentsBuilder
. It takes the string replies and metadata output of an LLM and producesDocument
objects.For example, a PromptNode could be used to summarize a longer doc and the user would like to have the result output as a Document object. This document object could then be shown to the end-user or it could be used in another PromptNode to answer a question for example.
Draft I/O for
DocumentsBuilder
:The text was updated successfully, but these errors were encountered: