-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add MetadataBuilder
#6636
Conversation
@vrunm thanks for this contribution... I will take an in-depth look after Christmas! |
I think we should discuss this component. @vrunm I see that there are some conflicts, but in any case, I think there will be a few days to wait... |
I like this a lot! The only addition (based off just reading the example) that I would like to see is being able to specify the key name of the meta field where we would store the generated reply by the LLM. So something like {
'documents': [
{
'id': '08005f665ae33e6b3d8fd4a33fc9f09157ff89e8a2f25698ea1f32127748aeeb',
'content': 'document_0',
'meta': {'user_specified_key': 'reply_0', 'key_0': 'value_0'}
},
{
'id': '60576c63fb17ae9d5dc0cbcc6d7ddfe299bc44a12ff8b7233e73257a3152e9b2',
'content': 'document_1',
'meta': {'user_specified_key': 'reply_1', 'key_1': 'value_1'}
},
{
'id': '038f78b217389bac27f9a4d690dba2b74d3139cea7ccfa955bcd5b332f3166aa',
'content': 'document_2',
'meta': {'user_specified_key': 'reply_2', 'key_2': 'value_2'}
}
]
} since I could imagine scenarios of wanting to add multiple meta fields to Documents from calling LLMs multiple times (e.g. one call to summarize and maybe another call to extract entities). |
Thank you so much for the contribution @vrunm! I agree with @sjrl here, the component would be more useful if the user could specify the key that should be used for meta. I would also change the naming of the parameters. I could see tonnes of applications here. You could even use it for embeddings. So the advanced embedding generation could look like Also for the case of multiple values that should be added, do people use separate MetadataBuilder instances for that? Or could we maybe do something like this: builder = MetadataBuilder(meta_keys=["entities", "summary"])
builder.run(data={"entities": [[...], [...]], "summary": ["...", "..."], documents=[doc1, doc2]}
# would result in {"meta": {"entities": [...], "summary": "..."}} for each document Thinking about it that way, we could maybe rename the component to |
I have updated the component where the user can now specify the key that could be used for the meta. metadata_builder = MetadataBuilder(meta_keys=["entities", "summary"])
documents = [Document(content="document_0"), Document(content="document_1")]
data = {
"entities": ["entity1", "entity2", "entity3"],
"summary": ["Summary 1", "Summary 2", "Summary3"],
}
metadata = [{"": ""}, {"": ""}]
result = metadata_builder.run(documents=documents, data=data, meta=metadata)
print(result)
data = {
'documents': [
{
'id': '08005f665ae33e6b3d8fd4a33fc9f09157ff89e8a2f25698ea1f32127748aeeb',
'content': 'document_0',
'meta': {
'entities': ['entity1', 'entity2', 'entity3'],
'summary': ['Summary 1', 'Summary 2', 'Summary3']
}
},
{
'id': '60576c63fb17ae9d5dc0cbcc6d7ddfe299bc44a12ff8b7233e73257a3152e9b2',
'content': 'document_1',
'meta': {
'entities': ['entity1', 'entity2', 'entity3'],
'summary': ['Summary 1', 'Summary 2', 'Summary3']
}
}
]
} |
Hey, @vrunm... I am sorry to have kept you waiting so long. I would put work on this feature on hold until we have better defined what we expect and have made sure that this component fits neatly into a pipeline. |
Closing for now. |
Related Issues
fixes #5702
fixes #5700
Proposed Changes:
Adds a new component MetadataBuilder which takes a list of Documents, the output of a Generator to which these Documents were passed, and adds the output from the Generator as metadata to the Documents.
The Generator takes a list of Documents, and returns replies and metadata.
The MetadataBuilder component takes these replies and metadata and adds them to the Documents.
It does this by adding the replies and metadata to the metadata of the Document.
Best explained through an example:
In this example, three documents are passed to the Generator.
The Generator has generated three replies and metadata for these.
The MetadataBuilder adds the replies and metadata of the Generator as metadata to the three Document objects.
The MetadataBuilder then returns this List of Documents.
How did you test it?
Added unit tests to check when the component:
Tests on Pipelines:
Added a test for a summarization Pipeline using a HuggingFaceLocalGenerator.
Added four tests for a RAG pipeline with the following Generators:
The test checks:
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
.