Enhance Data Storage and Filtering by Additional Metadata Dimensions #548

precisioninfinity · 2023-04-19T22:49:00Z

precisioninfinity
Apr 19, 2023

I'd like to propose a new feature that expands the current data storage and search capabilities in the repository by introducing additional dimensions for metadata. At present, we can store and retrieve data by collection name and ExternalSourceName for reference data, but I believe there are use cases that would benefit from even more flexible options.

Feature Request:

Add the ability to store and search data by more dimensions, such as TenantID and Permission Level.
Use Cases:

TenantID: This field could represent a customer code, tenant, or any other logical grouping, allowing users to better organize and filter data based on their specific needs. For example, consider a knowledge base containing data from multiple tenants or customers. With TenantID, users could easily filter search results to only display information relevant to a particular tenant.

Permission Level: This field could be used to store and filter data based on the allowed permissions for a particular group, such as admins and non-admins. This would provide better access control and security, ensuring that users only see the information they are authorized to view.

It might be tempting to assume the collection name could be used for this purpose, but there are scenarios where this is not sufficient. For instance, a developer might need to perform a global search across all tenants, which would not be possible if the collection name were used to represent both the tenant and the permission level. By introducing additional dimensions, we would have more flexibility and efficiency for users with diverse requirements.

Ideally, this optional metadata could be passed somehow as part of the query to be filtered on the datastore end then to be more efficient.

In summary, I believe that extending the data storage and search functionality by adding the ability to filter by TenantID and Permission Level would greatly improve the user experience and enable a wider range of use cases. Let me know your thoughts on this proposal and if there are any additional considerations we should take into account.

precisioninfinity · 2023-04-20T00:21:53Z

precisioninfinity
Apr 20, 2023
Author

In thinking about this more, there really should be more metadata for storage and retrieval thats expandable. There's so much metadata that's possible that you'd want to prefilter. Looks like Pinecone has metadata filtering and I'm assuming there'll eventually be a pinecone implementation for the memory store?

0 replies

evchaki · 2023-04-20T20:50:05Z

evchaki
Apr 20, 2023
Collaborator

@shawncal can you take a look at this?

0 replies

precisioninfinity · 2023-04-20T23:23:27Z

precisioninfinity
Apr 20, 2023
Author

I found some documentation on pinecone's metadata filtering. It seems useful for all the scenarios I plan on using and it seems powerful: https://docs.pinecone.io/docs/metadata-filtering

I haven't used it yet myself, but came across this really interesting article where they talk about how hard this actually is to do this filtering on the db level and how they do it by combining the vector and metadata index. I do believe I would like to use pinecone for this support to quickly trim the data up and provide filtered relevent context in my semantic kernal usage: https://www.pinecone.io/learn/vector-search-filtering/

0 replies

dluc · 2023-04-22T04:25:34Z

dluc
Apr 22, 2023
Maintainer

My first reaction is that further query patterns such as filtering and joining, are not part of the Semantic Memory where we deal with unstructured data. On the other hand, one could build more complex scenarios going directly to the storage features, without using SK Memory, or develop bespoke versions of Memory for personal scenarios. For instance Azure Cognitive Search has a very advanced set of features that one could build on.

2 replies

precisioninfinity Apr 22, 2023
Author

Thanks for the reply dluc. I see no need for joins, just metadata filtering. You're seriously limiting the performance by not allowing the metadata filters to pass through to the underlying datastore provider for pre-filtering. And isn't the point of using SK to make it easier for developers to work on a higher level to leverage the underlying tools?

(More technical details: ideally each datastore provider that implements IMemoryStore will leverage an existing underlying filter on the datastore itself and those datastores will be able to quickly search and filter and return trimmed nearest match results. But any providers that don't support the filtering or that you wanted to defer the native filtering work untill later, would have to be post-filtered, which would be slower but still very useful.

I realize it's not a simple request here because you'll have to come up with metadata filter syntax, ideally some sort of linq-like syntax would be best, so that's a decent amount of work. But I appreciate the consideration for the future, and maybe there's a way that you can make the metadata filter simple at first but set up for expansion later. Looking at the pinecone implementation of this they have very limited metadata filtering, but it looks super useful).

lukasz-appstream Jun 13, 2023

@dluc For a client we're currently extending a real estate marketplace with semantic search capabilities. Each record has both useful structured data and a long unstructured description. I guess it's not uncommon for a record to have a mix of both worlds, and I would say it's quite useful to be able to first filter out memories based on structured fields or metadata or payload (however we call it) and then use vector search to get the best matching results.

We use QDrant database, and this week we'll be creating a PR that will add a new interface IMemoryStore<TFilter> along with everything needed to have filterable memory. Hopefully, we could get your feedback there and, after some discussion, make it part of Semantic Kernel.
Of course, it would still require some work to add the same filtering capabilities to other memory stores, but it would at least provide a unified way to do it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance Data Storage and Filtering by Additional Metadata Dimensions #548

{{title}}

Replies: 4 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Enhance Data Storage and Filtering by Additional Metadata Dimensions #548

precisioninfinity Apr 19, 2023

Replies: 4 comments · 2 replies

precisioninfinity Apr 20, 2023 Author

evchaki Apr 20, 2023 Collaborator

precisioninfinity Apr 20, 2023 Author

dluc Apr 22, 2023 Maintainer

precisioninfinity Apr 22, 2023 Author

lukasz-appstream Jun 13, 2023

precisioninfinity
Apr 19, 2023

Replies: 4 comments 2 replies

precisioninfinity
Apr 20, 2023
Author

evchaki
Apr 20, 2023
Collaborator

precisioninfinity
Apr 20, 2023
Author

dluc
Apr 22, 2023
Maintainer

precisioninfinity Apr 22, 2023
Author