Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Trouble Accessing Document ID in WeaviateHybridSearchRetriever Results #13238

Closed
3 of 14 tasks
blockfer-rp opened this issue Nov 11, 2023 · 5 comments · Fixed by langchain-ai/langchain-weaviate#87
Closed
3 of 14 tasks
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: vector store Related to vector store module

Comments

@blockfer-rp
Copy link

System Info

langchain: 0.0.334
python: 3.11.6
weaviate-client: 3.25.3

Who can help?

No response

Information

  • The official example notebooks/scripts
  • My own modified scripts

Related Components

  • LLMs/Chat Models
  • Embedding Models
  • Prompts / Prompt Templates / Prompt Selectors
  • Output Parsers
  • Document Loaders
  • Vector Stores / Retrievers
  • Memory
  • Agents / Agent Executors
  • Tools / Toolkits
  • Chains
  • Callbacks/Tracing
  • Async

Reproduction

I am trying to implement the WeaviateHybridSearchRetriever to retrieve documents from Weaviate. My schema indicates the document ID is stored in the _id field based on the shardingConfig.

When setting up the retriever, I included _id in the attributes list:

hybrid_retriever = WeaviateHybridSearchRetriever(
  attributes=["_id", "aliases", "categoryid", "name", "page_content", "ticker"]
)

However, when I try to access _id on the returned Document objects, I get an error that _id is not found.

For example:

results = hybrid_retriever.get_relevant_documents(query="some query")
print(results[0]._id) # Error!_id not found

I have tried variations like id, document_id instead of _id but still cannot seem to access the document ID field.

Any suggestions on what I am missing or doing wrong when trying to retrieve the document ID from the Weaviate results using the _id field specified in the schema?

Let me know if any other details would be helpful in troubleshooting this issue!

Schema Details

{
   "classes":[
      {
         "class":"Category_taxonomy",
         "invertedIndexConfig":{
            "bm25":{
               "b":0.75,
               "k1":1.2
            },
            "cleanupIntervalSeconds":60,
            "stopwords":{
               "additions":"None",
               "preset":"en",
               "removals":"None"
            }
         },
         "moduleConfig":{
            "text2vec-openai":{
               "baseURL":"https://api.openai.com",
               "model":"ada",
               "modelVersion":"002",
               "type":"text",
               "vectorizeClassName":true
            }
         },
         "multiTenancyConfig":{
            "enabled":false
         },
         "properties":[
            {
               "dataType":[
                  "text"
               ],
               "description":"Content of the page",
               "indexFilterable":true,
               "indexSearchable":true,
               "moduleConfig":{
                  "text2vec-openai":{
                     "skip":false,
                     "vectorizePropertyName":false
                  }
               },
               "name":"page_content",
               "tokenization":"word"
            },
            {
               "dataType":[
                  "number"
               ],
               "description":"Identifier for the category",
               "indexFilterable":true,
               "indexSearchable":false,
               "moduleConfig":{
                  "text2vec-openai":{
                     "skip":false,
                     "vectorizePropertyName":false
                  }
               },
               "name":"categoryid"
            },
            {
               "dataType":[
                  "text"
               ],
               "description":"Ticker symbol",
               "indexFilterable":true,
               "indexSearchable":true,
               "moduleConfig":{
                  "text2vec-openai":{
                     "skip":false,
                     "vectorizePropertyName":false
                  }
               },
               "name":"ticker",
               "tokenization":"word"
            },
            {
               "dataType":[
                  "text"
               ],
               "description":"Name of the entity",
               "indexFilterable":true,
               "indexSearchable":true,
               "moduleConfig":{
                  "text2vec-openai":{
                     "skip":false,
                     "vectorizePropertyName":false
                  }
               },
               "name":"name",
               "tokenization":"word"
            },
            {
               "dataType":[
                  "text"
               ],
               "description":"Aliases for the entity",
               "indexFilterable":true,
               "indexSearchable":true,
               "moduleConfig":{
                  "text2vec-openai":{
                     "skip":false,
                     "vectorizePropertyName":false
                  }
               },
               "name":"aliases",
               "tokenization":"word"
            }
         ],
         "replicationConfig":{
            "factor":1
         },
         "shardingConfig":{
            "virtualPerPhysical":128,
            "desiredCount":1,
            "actualCount":1,
            "desiredVirtualCount":128,
            "actualVirtualCount":128,
            "key":"_id",
            "strategy":"hash",
            "function":"murmur3"
         },
         "vectorIndexConfig":{
            "skip":false,
            "cleanupIntervalSeconds":300,
            "maxConnections":64,
            "efConstruction":128,
            "ef":-1,
            "dynamicEfMin":100,
            "dynamicEfMax":500,
            "dynamicEfFactor":8,
            "vectorCacheMaxObjects":1000000000000,
            "flatSearchCutoff":40000,
            "distance":"cosine",
            "pq":{
               "enabled":false,
               "bitCompression":false,
               "segments":0,
               "centroids":256,
               "trainingLimit":100000,
               "encoder":{
                  "type":"kmeans",
                  "distribution":"log-normal"
               }
            }
         },
         "vectorIndexType":"hnsw",
         "vectorizer":"text2vec-openai"
      }
   ]
}

Example Document

{
    "class": "Category_taxonomy",
    "creationTimeUnix": 1699553747601,
    "id": "ad092eb1-e4a6-4d93-a7d2-c507c33c3837",
    "lastUpdateTimeUnix": 1699553747601,
    "properties": {
        "aliases": "Binance Coin, Binance Smart Chain",
        "categoryid": 569,
        "name": "BNB",
        "page_content": "ticker: bnb\nname: BNB\naliases: Binance Coin, Binance Smart Chain",
        "ticker": "bnb"
    },
    "vectorWeights": null
}

Example Search Result

{
   "status":"success",
   "results":[
      {
         "page_content":"ticker: bnb\nname: BNB\naliases: Binance Coin, Binance Smart Chain",
         "metadata":{
            "_additional":{
               "explainScore":"(vector) [-0.0067740963 -0.03091735 0.00511335 0.0016186031 -0.016120477 0.017543973 -0.0072548385 -0.023063144 0.015246399 -0.0020884196]...  \n(hybrid) Document ad092eb1-e4a6-4d93-a7d2-c507c33c3837 contributed 0.00819672131147541 to the score",
               "score":"0.008196721"
            },
            "aliases":"Binance Coin, Binance Smart Chain",
            "categoryid":569,
            "name":"BNB",
            "ticker":"bnb"
         },
         "type":"Document"
      }
   ]
}

App Code

# Prepare global variables
WEAVIATE_URL = os.getenv('WEAVIATE_URL')
WEAVIATE_API_KEY = os.getenv('WEAVIATE_API_KEY')
OPENAI_API_KEY = os.getenv('OPENAI_API_KEY')
INDEX_NAME = "Category_taxonomy"
TEXT_KEY = "page_content"

# Dependency provider function for Weaviate client
def get_weaviate_vectorstore():
    # Initialize the Weaviate client with API key authentication
    client = weaviate.Client(
        url=WEAVIATE_URL, 
        auth_client_secret=weaviate.AuthApiKey(WEAVIATE_API_KEY),
        additional_headers={
            "X-Openai-Api-Key": OPENAI_API_KEY,
        }
    )

    # Initialize embeddings with a specified model
    embeddings = OpenAIEmbeddings(openai_api_key=OPENAI_API_KEY, model='text-embedding-ada-002')

    # Initialize vector store with attributes and schema
    vectorstore = Weaviate(
        client=client,
        index_name=INDEX_NAME,
        text_key=TEXT_KEY,
        embedding=embeddings,
        attributes=["aliases", "categoryid", "name", "page_content", "ticker"],
        by_text=False
    )
    return client, vectorstore

def get_weaviate_hybrid_retriever(k: int = 5):
    # Directly call the function to get the client and vectorstore
    client, vectorstore = get_weaviate_vectorstore()
    
    # Instantiate the retriever with the settings from the vectorstore
    hybrid_retriever = WeaviateHybridSearchRetriever(
        client=client,
        index_name=INDEX_NAME,
        text_key=TEXT_KEY,
        attributes=["aliases", "categoryid", "name", "page_content", "ticker"],
        k=k,
        create_schema_if_missing=True
    )
    return hybrid_retriever

async def parse_query_params(request: Request) -> Dict[str, List[Any]]:
    parsed_values = defaultdict(list)

    for key, value in request.query_params.multi_items():
        # Append the value for any key directly
        parsed_values[key].append(value)

    return parsed_values

@router.get("/hybrid_search_category_taxonomy/")
async def hybrid_search_category_taxonomy(parsed_values: Dict[str, List[Any]] = Depends(parse_query_params), query: Optional[str] = None, k: int = 5):
    categoryids = parsed_values.get('categoryid', [])
    tickers = parsed_values.get('ticker', [])
    names = parsed_values.get('name', [])
    aliasess = parsed_values.get('aliases', [])

    # Use a partial function to pass 'k' to 'get_weaviate_hybrid_retriever'
    retriever = get_weaviate_hybrid_retriever(k=k)
    
    # Initialize the where_filter with an 'And' operator if there are any filters provided
    logging.info(
        f"query: {query}, "
        f"categoryID: {categoryids}, "
        f"ticker: {tickers}, "
        f"name: {names}, "
        f"aliases: {aliasess}, "
        f"k: {k}"
    )

    # Adjustments to reference parameters from 'parse_query_params'
    where_filter = {"operator": "And", "operands": []} if any([categoryids, tickers, names, aliasess]) else None
    
    # Add filters for categoryid and ticker with the 'Equal' operator
    if categoryids:
        category_operands = [{"path": ["categoryid"], "operator": "Equal", "valueNumber": cid} for cid in categoryids]
        if category_operands:
            where_filter["operands"].append({"operator": "Or", "operands": category_operands})
        
    if tickers:
        ticker_operands = [{"path": ["ticker"], "operator": "Equal", "valueText": ticker} for ticker in tickers]
        if ticker_operands:
            where_filter["operands"].append({"operator": "Or", "operands": ticker_operands})
    
    if names:
        name_operands = [{"path": ["name"], "operator": "Equal", "valueText": name} for name in names]
        if name_operands:
            where_filter["operands"].append({"operator": "Or", "operands": name_operands})
    
    if aliasess:
        aliases_operands = [{"path": ["aliases"], "operator": "Equal", "valueText": aliases} for aliases in aliasess]
        if aliases_operands:
            where_filter["operands"].append({"operator": "Or", "operands": aliases_operands})
    
    try:
        # Format the results for the response
        effective_query = " " if not query or not query.strip() else query

        # Log the where_filter before fetching documents
        logging.info(f"where_filter being used: {where_filter}")
        
        # Fetch the relevant documents using the hybrid retriever instance
        results = retriever.get_relevant_documents(effective_query, where_filter=where_filter, score=True)

        # Format the results for the response
        response_data = [vars(doc) for doc in results]
        
        return {"status": "success", "results": response_data}
    except Exception as e:
        logger.error(f"Error while processing request: {str(e)}", exc_info=True)
        raise HTTPException(detail=str(e), status_code=500)

Expected behavior

Expected Behavior

When using the WeaviateHybridSearchRetriever for document retrieval, I expect that including the _id attribute in the attributes list will allow me to access the document ID of each retrieved document without any issues. Specifically, after setting up the WeaviateHybridSearchRetriever like so:

hybrid_retriever = WeaviateHybridSearchRetriever(
    attributes=["_id", "aliases", "categoryid", "name", "page_content", "ticker"]
)

I anticipate that executing a query and attempting to print the _id of the first result should successfully return the unique identifier of the document, as per the below code snippet:

results = hybrid_retriever.get_relevant_documents(query="some query")
print(results[0]._id)  # Expecting this to print the _id of the first result

In this scenario, my expectation is that the _id field, being specified in the attributes parameter, should be readily accessible in each Document object returned by the get_relevant_documents method. This behavior is crucial for my application as it relies on the unique document IDs for further processing and analysis of the retrieved data.

Copy link

dosubot bot commented Nov 11, 2023

🤖

Hi @blockfer-rp,

Great to see you again! I appreciate your detailed issue report. I'm currently looking into it and will get back to you with a thorough response shortly.

@dosubot dosubot bot added Ɑ: vector store Related to vector store module 🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature labels Nov 11, 2023
Copy link

dosubot bot commented Nov 11, 2023

🤖

The issue you're experiencing seems to be related to the _id attribute not being included in the results returned by the WeaviateHybridSearchRetriever. This is because the _id attribute is not being added to the attributes list when the query.get method is called on the Weaviate client.

In the WeaviateHybridSearchRetriever class, the query.get method is called with self.index_name and self.attributes as arguments. The self.attributes list should contain the names of the attributes that you want to be included in the results. However, the _id attribute is not being added to this list, so it is not included in the results.

To resolve this issue, you need to add the _id attribute to the attributes list. You can do this by modifying the validate_client method to append _id to values["attributes"]:

@root_validator(pre=True)
def validate_client(
    cls,
    values: Dict[str, Any],
) -> Dict[str, Any]:
    # existing code...

    # add _id to the attributes list
    cast(List, values["attributes"]).extend([values["text_key"], "_id"])

    # existing code...

This modification will ensure that the _id attribute is included in the results returned by the WeaviateHybridSearchRetriever.

You can find more details about the WeaviateHybridSearchRetriever class in the LangChain codebase.

Please try this solution and let me know if it resolves your issue or if you need further assistance.

Sources


This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@blockfer-rp
Copy link
Author

I appreciate your earlier guidance on including the _id attribute in the WeaviateHybridSearchRetriever. Following your suggestion, I implemented a subclass, CustomWeaviateHybridSearchRetriever, to append _id to the attributes list. However, I encountered an error indicating that _id is not a recognized attribute in the query.

Error

raise ValueError(f"Error during query: {result['errors']}")

ValueError: Error during query: [{'locations': [{'column': 184, 'line': 1}], 'message': 'Cannot query field "_id" on type "Category_taxonomy".', 'path': None}]

I've attached the updated implementation of my subclass for your reference. Could you please advise on how to resolve this issue or suggest an alternative approach to include the _id attribute in the search results?

class CustomWeaviateHybridSearchRetriever(WeaviateHybridSearchRetriever):
    def __init__(self, **data: Any):
        super().__init__(**data)
        self.custom_validate_client()

    def custom_validate_client(self):
        if "_id" not in self.attributes:
            self.attributes.append("_id")

def get_weaviate_hybrid_retriever(k: int = 5):
    # Directly call the function to get the client and vectorstore
    client, vectorstore = get_weaviate_vectorstore()
    
    # Instantiate the retriever with the settings from the vectorstore
    hybrid_retriever = CustomWeaviateHybridSearchRetriever(
        client=client,
        index_name=INDEX_NAME,
        text_key=TEXT_KEY,
        attributes=["_id", "aliases", "categoryid", "name", "page_content", "ticker"],
        k=k,
        create_schema_if_missing=True
    )
    return hybrid_retriever

I'm quite surprised that there isn't a more straightforward way to retrieve the Weaviate document_id from search results, akin to how the document type is always returned. I'm wondering if I might be overlooking a simpler solution or an essential step in the process.

@blockfer-rp
Copy link
Author

I successfully implemented a solution for retrieving the document ID from Weaviate search results by creating a subclass of WeaviateHybridSearchRetriever. This subclass overrides the _get_relevant_documents method.

The key modification in this method is the inclusion of "id" in the query_obj.with_additional call whenever the score parameter is set to True.

class ModifiedWeaviateHybridSearchRetriever(WeaviateHybridSearchRetriever):
    def _get_relevant_documents(
        self, query: str, *, run_manager: CallbackManagerForRetrieverRun,
        where_filter: Optional[Dict[str, object]] = None, 
        score: bool = False,
        hybrid_search_kwargs: Optional[Dict[str, object]] = None
    ) -> List[Document]:
        ...
        if score:
            query_obj = query_obj.with_additional(["score", "explainScore", "id"])
        ...

While this implementation works as intended and solves the problem at hand, I am still exploring if there's a more straightforward or less invasive method to achieve this. Overriding an entire class to modify a single method seems somewhat excessive for what appears to be a relatively simple requirement.

Ideally, a more direct way to include the document ID in the search results, without the need for subclassing, would be preferable. I am open to suggestions and hoping for an easier solution that aligns with best practices and maintains the integrity and simplicity of the code.

@hsm207
Copy link
Contributor

hsm207 commented Feb 7, 2024

@blockfer-rp fyi, this is solved in the new integration. See langchain-ai/langchain-weaviate#87

@dosubot dosubot bot added the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 8, 2024
@dosubot dosubot bot closed this as not planned Won't fix, can't repro, duplicate, stale May 15, 2024
@dosubot dosubot bot removed the stale Issue has not had recent activity or appears to be solved. Stale issues will be automatically closed label May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖:bug Related to a bug, vulnerability, unexpected error with an existing feature Ɑ: vector store Related to vector store module
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants