Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch Retrieval with FAISS IndexIDMap(IndexFlatIP) Returns Incorrect IDs and Zero Similarities #4028

Closed
3 of 4 tasks
howru0321 opened this issue Nov 15, 2024 · 3 comments
Closed
3 of 4 tasks

Comments

@howru0321
Copy link

Summary

I am experiencing an issue with FAISS where batch retrieval of multiple embeddings using IndexIDMap(IndexFlatIP) behaves incorrectly. Specifically, while single-vector retrieval works flawlessly, retrieving multiple vectors simultaneously results in all queries returning the same ID with similarity scores converging to zero as the batch size increases.

OS: Ubuntu 20.04.6 LTS (Focal Fossa)
Faiss version: 1.7.2
Installed from: pip (faiss-gpu package)
Faiss compilation options: GPU enabled, running on Python 3.8.10

Running on:

  • CPU
  • GPU

Interface:

  • C++
  • Python

Reproduction instructions

Setup FAISS Index:

import faiss
import numpy as np

def build_faiss_index(goal_embeddings, index_to_id, use_gpu=True, res=None):
    # Normalize embeddings
    faiss.normalize_L2(goal_embeddings)
    print(f"Goal embeddings shape: {goal_embeddings.shape}")
    
    d = goal_embeddings.shape[1]  # Dimension of embeddings

    # Create IndexIDMap with Inner Product
    index = faiss.IndexIDMap(faiss.IndexFlatIP(d))

    if use_gpu:
        if not res:
            print("GPU resources not provided.")
            return
        index = faiss.index_cpu_to_gpu(res, 0, index)

    # Add embeddings with corresponding IDs
    print("Adding goal embeddings to index...")
    index.add_with_ids(goal_embeddings, np.array(index_to_id, dtype=np.int64))
    print(f"FAISS index generated with {index.ntotal} entries.")
    return index

Retrieve IDs for Single and Multiple Embeddings:

def faiss_predict_ids(result_embeddings_np, index, top_k=1):
    faiss.normalize_L2(result_embeddings_np)
    
    predicted_ids = []
    for each_result in result_embeddings_np:
        each_result = each_result.reshape(1, -1)
        similarity, predicted_id = index.search(each_result, top_k)
        predictions = predicted_id.flatten().tolist()
        prediction = predictions[0]
        print(f"predicted_id: {prediction}")
        print(f"similarity: {similarity}")
        predicted_ids.append(prediction)
    
    # similarities, predicted_ids = index.search(result_embeddings_np, top_k)
    # print(f"predicted_ids: {predicted_ids}")
    # print(f"similarities: {similarities}")

    if top_k == 1:
        return predicted_ids
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions
    # else:##fix
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions

Observation:

Single Retrieval:

  • Correct ID is returned with high similarity (~0.99).
...
predicted_id: 1536
similarity: [[0.9985621]]
predicted_id: 1538
similarity: [[0.99936527]]
predicted_id: 1538
similarity: [[0.9997897]]
predicted_id: 1538
similarity: [[0.994013]]
predicted_id: 1538
similarity: [[0.8375373]]
predicted_id: 1537
...

Batch Retrieval (e.g., 50 embeddings):

  • All queries return the same ID.
  • Similarity scores approach zero.
def faiss_predict_ids(result_embeddings_np, index, top_k=1):
    faiss.normalize_L2(result_embeddings_np)
    
    # predicted_ids = []
    # for each_result in result_embeddings_np:
    #     each_result = each_result.reshape(1, -1)
    #     similarity, predicted_id = index.search(each_result, top_k)
    #     predictions = predicted_id.flatten().tolist()
    #     prediction = predictions[0]
    #     print(f"predicted_id: {prediction}")
    #     print(f"similarity: {similarity}")
    #     predicted_ids.append(prediction)
    
    similarities, predicted_ids = index.search(result_embeddings_np, top_k)
    print(f"predicted_ids: {predicted_ids}")
    print(f"similarities: {similarities}")

    if top_k == 1:
        # return predicted_ids
        predictions = predicted_ids.flatten().tolist()
        return predictions
    # else:##fix
    #     predictions = predicted_ids.flatten().tolist()
    #     return predictions```
predicted_ids: [[33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
 [33]
...
similarities: [[0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
 [0.]
...

Expected Behavior:

  • When performing batch retrieval with multiple embeddings, each query should independently return the most similar ID with high similarity scores, similar to single-vector retrieval.

Actual Behavior:

  • Single Retrieval: Functions correctly, returning accurate IDs with high similarity scores.
  • Batch Retrieval: As the number of queries increases, we begin to derive strange result values, and certain id is frequently seen. If the number of queries exceeds 50, only the same id 31 is always returned. Similarity scores decrease towards zero as the number of queries increases.

Additional Information:

Index Configuration

  • Using IndexIDMap with IndexFlatIP for inner product similarity.

Normalization

  • Both goal embeddings and query embeddings are L2-normalized using faiss.normalize_L2.

Data Characteristics:

  • Embedding dimension: 3072
  • Number of goal embeddings in index: ~2000
  • Each query embedding is a 1x3072 vector.
@asadoughi
Copy link
Contributor

Installed from: pip (faiss-gpu package)

Please try again with installing the faiss-gpu package from conda, following directions here. The faiss-gpu pypi package is not supported by this repository.

Copy link

This issue is stale because it has been open for 7 days with no activity.

@github-actions github-actions bot added the stale label Nov 28, 2024
Copy link

github-actions bot commented Dec 5, 2024

This issue was closed because it has been inactive for 7 days since being marked as stale.

@github-actions github-actions bot closed this as completed Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants