Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Array inclusion filter operation #143

Open
sjjpo2002 opened this issue Dec 4, 2024 · 0 comments
Open

Array inclusion filter operation #143

sjjpo2002 opened this issue Dec 4, 2024 · 0 comments

Comments

@sjjpo2002
Copy link

I want to be able to filter vector search results by records that include a specific item in an array type field. The current filtering operators do not support doing the element-wise inclusion test I need. The supported operators are listed here:

Operator Meaning/Category
$eq Equality (==)
$ne Inequality (!=)
$lt Less than (<)
$lte Less than or equal (<=)
$gt Greater than (>)
$gte Greater than or equal (>=)
$in Special Cased (in)
$nin Special Cased (not in)
$between Special Cased (between)
$like Text (like)
$ilike Text (case-insensitive like)
$and Logical (and)
$or Logical (or)

None of the operators from that list does exactly the inclusion test for a list of text elements. The closest is the $like operator which only works if the field is text. It doesn't work for array type because it tries to do an exact match for the entire array rather than looking into the elements inside.

To reproduce this issue you can use this sample code.

First let's define the schema and push some sample data in:

import os
from langchain_openai import AzureOpenAIEmbeddings
from langchain_text_splitters import CharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_postgres.vectorstores import PGVector
from datetime import datetime, timezone

# Create a sample text file for testing
sample_text = """
This is a sample document to test our Postgres setup.
We'll use this to verify our vector search functionality with filtering.
The text can contain any information you want to search through later. 
"""

with open("sample_test.txt", "w") as f:
    f.write(sample_text)

# Load and process the document
loader = TextLoader("sample_test.txt")
documents = loader.load()
text_splitter = CharacterTextSplitter(chunk_size=1000, chunk_overlap=0)
docs = text_splitter.split_documents(documents)

# Initialize Azure OpenAI embeddings
embeddings = AzureOpenAIEmbeddings(
    azure_deployment="myAzureDeployment",  # Replace with your deployment name
    model="text-embedding-ada-002",  # This is typically the model name
    azure_endpoint="https://yourEndpoint.openai.azure.com",
    api_key="yourApiKey",
    chunk_size=1,
)

connection_string = (
    "postgresql+psycopg://postgres:postgres@your_url:5432/test"
)
collection_name = "test_filtering"

vectorstore = PGVector(
    embeddings=embeddings,
    collection_name=collection_name,
    connection=connection_string,
    use_jsonb=True,
)

def prepare_metadata(doc):

    return {
        "source": doc.metadata["source"],
        "category": ["A"],
        "timestamp": datetime.now(timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.%fZ"),
    }

content_list = []
metadata_list = []

for doc in docs:
    content_list.append(doc.page_content)
    metadata_list.append(prepare_metadata(doc))

ids_added = vectorstore.add_texts(content_list, metadata_list)
print(f"Added {len(ids_added)} documents to the vector store.")

Now I use the following code to retrive data from the collection we just ingested:

filter = {"category": {"$like": ["A"]}}  # Exact match of the array instead of looking to the elements inside
retriever = vectorstore.as_retriever(search_kwargs={"filter": filter})
retrieved_docs = retriever .invoke("Opensearch", k=5)

print(retrieved_docs) # No results are returned

I have implemented another operator for array element-wise inclusion test which will fix this issue. But, I want to report this issue first for your review before creating a PR.

Thanks for reading.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant