Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core: improve performance of InMemoryVectorStore #27538

Merged
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
49 changes: 34 additions & 15 deletions libs/core/langchain_core/vectorstores/in_memory.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,25 +326,44 @@ def _similarity_search_with_score_by_vector(
self,
embedding: list[float],
k: int = 4,
prefilter_k_multiplier: Optional[int] = 10,
filter: Optional[Callable[[Document], bool]] = None,
**kwargs: Any,
) -> list[tuple[Document, float, list[float]]]:
result = []
for doc in self.store.values():
vector = doc["vector"]
similarity = float(cosine_similarity([embedding], [vector]).item(0))
result.append(
(
Document(
id=doc["id"], page_content=doc["text"], metadata=doc["metadata"]
),
similarity,
vector,
)
)
result.sort(key=lambda x: x[1], reverse=True)
# get all docs with fixed order in list
docs = list(self.store.values())
if not docs:
return []

similarity = cosine_similarity([embedding], [doc["vector"] for doc in docs])[0]

# get the indices ordered by similarity score
top_k_idx = similarity.argsort()[::-1]

# prefilter to speed up for list comprehension below
if filter is not None:
result = [r for r in result if filter(r[0])]
# we can safely filter to top k if no filter is set
top_k_idx = top_k_idx[:k]
elif prefilter_k_multiplier is not None:
# Filter to top k * prefilter_k_multiplier
# We keep more than k to avoid returning less than k after filtering
prefilter_k = k * prefilter_k_multiplier
top_k_idx = top_k_idx[:prefilter_k]

result = [
(doc, float(similarity[idx].item()), doc_dict["vector"])
for idx in top_k_idx
for doc_dict in [docs[idx]]
for doc in [
Document(
id=doc_dict["id"],
page_content=doc_dict["text"],
metadata=doc_dict["metadata"],
)
]
if filter is None or filter(doc)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we filter prior to applying any computation?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whether one should depends on if the cosine_similarity or filter takes longer. Since cosine_similarity can be vectorized, I assumed that generally (although not always) cosine_similarity would be quicker and that it is preferable to filter on the prefetched subset. Note that filter can be any callable, so we have no control over how fast filter is.

]

return result[:k]

def similarity_search_with_score_by_vector(
Expand Down
Loading