Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

howru0321 · 2024-11-15T06:25:52Z

Summary

I am experiencing an issue with FAISS where batch retrieval of multiple embeddings using IndexIDMap(IndexFlatIP) behaves incorrectly. Specifically, while single-vector retrieval works flawlessly, retrieving multiple vectors simultaneously results in all queries returning the same ID with similarity scores converging to zero as the batch size increases.

OS: Ubuntu 20.04.6 LTS (Focal Fossa)
Faiss version: 1.7.2
Installed from: pip (faiss-gpu package)
Faiss compilation options: GPU enabled, running on Python 3.8.10

Running on:

CPU
GPU

Interface:

C++
Python

Reproduction instructions

Setup FAISS Index:

import faiss
import numpy as np

def build_faiss_index(goal_embeddings, index_to_id, use_gpu=True, res=None):
    # Normalize embeddings
    faiss.normalize_L2(goal_embeddings)
    print(f"Goal embeddings shape: {goal_embeddings.shape}")
    
    d = goal_embeddings.shape[1]  # Dimension of embeddings

    # Create IndexIDMap with Inner Product
    index = faiss.IndexIDMap(faiss.IndexFlatIP(d))

    if use_gpu:
        if not res:
            print("GPU resources not provided.")
            return
        index = faiss.index_cpu_to_gpu(res, 0, index)

    # Add embeddings with corresponding IDs
    print("Adding goal embeddings to index...")
    index.add_with_ids(goal_embeddings, np.array(index_to_id, dtype=np.int64))
    print(f"FAISS index generated with {index.ntotal} entries.")
    return index

Retrieve IDs for Single and Multiple Embeddings:

def faiss_predict_ids(result_embeddings_np, index, top_k=1):
    faiss.normalize_L2(result_embeddings_np)
    
    predicted_ids = []
    for each_result in result_embeddings_np:
        each_result = each_result.reshape(1, -1)
        similarity, predicted_id = index.search(each_result, top_k)
        predictions = predicted_id.flatten().tolist()
        prediction = predictions[0]
        print(f"predicted_id: {prediction}")
        print(f"similarity: {similarity}")
        predicted_ids.append(prediction)
    
    # similarities, predicted_ids = index.search(result_embeddings_np, top_k)
    # print(f"predicted_ids: {predicted_ids}")
    # print(f"similarities: {similarities}")

    if top_k == 1:
        return predicted_ids
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions
    # else:##fix
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions

Observation:

Single Retrieval:

Correct ID is returned with high similarity (~0.99).

...
predicted_id: 1536
similarity: [[0.9985621]]
predicted_id: 1538
similarity: [[0.99936527]]
predicted_id: 1538
similarity: [[0.9997897]]
predicted_id: 1538
similarity: [[0.994013]]
predicted_id: 1538
similarity: [[0.8375373]]
predicted_id: 1537
...

Batch Retrieval (e.g., 50 embeddings):

All queries return the same ID.
Similarity scores approach zero.

def faiss_predict_ids(result_embeddings_np, index, top_k=1):
    faiss.normalize_L2(result_embeddings_np)
    
    # predicted_ids = []
    # for each_result in result_embeddings_np:
    #     each_result = each_result.reshape(1, -1)
    #     similarity, predicted_id = index.search(each_result, top_k)
    #     predictions = predicted_id.flatten().tolist()
    #     prediction = predictions[0]
    #     print(f"predicted_id: {prediction}")
    #     print(f"similarity: {similarity}")
    #     predicted_ids.append(prediction)
    
    similarities, predicted_ids = index.search(result_embeddings_np, top_k)
    print(f"predicted_ids: {predicted_ids}")
    print(f"similarities: {similarities}")

    if top_k == 1:
        # return predicted_ids
        predictions = predicted_ids.flatten().tolist()
        return predictions
    # else:##fix
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions```

predicted_ids: [[33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
...
similarities: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
...

Expected Behavior:

When performing batch retrieval with multiple embeddings, each query should independently return the most similar ID with high similarity scores, similar to single-vector retrieval.

Actual Behavior:

Single Retrieval: Functions correctly, returning accurate IDs with high similarity scores.
Batch Retrieval: As the number of queries increases, we begin to derive strange result values, and certain id is frequently seen. If the number of queries exceeds 50, only the same id 31 is always returned. Similarity scores decrease towards zero as the number of queries increases.

Additional Information:

Index Configuration

Using IndexIDMap with IndexFlatIP for inner product similarity.

Normalization

Both goal embeddings and query embeddings are L2-normalized using faiss.normalize_L2.

Data Characteristics:

Embedding dimension: 3072
Number of goal embeddings in index: ~2000
Each query embedding is a 1x3072 vector.

The text was updated successfully, but these errors were encountered:

asadoughi · 2024-11-20T19:39:14Z

Installed from: pip (faiss-gpu package)

Please try again with installing the faiss-gpu package from conda, following directions here. The faiss-gpu pypi package is not supported by this repository.

github-actions · 2024-11-28T02:05:38Z

This issue is stale because it has been open for 7 days with no activity.

github-actions · 2024-12-05T02:07:23Z

This issue was closed because it has been inactive for 7 days since being marked as stale.

junjieqi assigned junjieqi and unassigned junjieqi Nov 20, 2024

junjieqi added Implementation unconfirmed-bug labels Nov 20, 2024

asadoughi added the autoclose label Nov 20, 2024

github-actions bot added the stale label Nov 28, 2024

github-actions bot closed this as completed Dec 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

howru0321 commented Nov 15, 2024

asadoughi commented Nov 20, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Dec 5, 2024

Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

Comments

howru0321 commented Nov 15, 2024

Summary

Reproduction instructions

Observation:

Expected Behavior:

Actual Behavior:

Additional Information:

Index Configuration

Normalization

Data Characteristics:

asadoughi commented Nov 20, 2024

github-actions bot commented Nov 28, 2024

github-actions bot commented Dec 5, 2024