-
Notifications
You must be signed in to change notification settings - Fork 15.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: improve performance of InMemoryVectorStore #27538
core: improve performance of InMemoryVectorStore #27538
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
@VMinB12 this makes the code more complex and the in memory implementations are mainly meant as simple reference implementations. Could you please provide some benchmark so it's possible to get a sense of what kind of an improvement this makes? Does it make a substantial difference with 1,000 or 10,000 docs? |
metadata=doc_dict["metadata"], | ||
) | ||
] | ||
if filter is None or filter(doc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we filter prior to applying any computation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whether one should depends on if the cosine_similarity or filter takes longer. Since cosine_similarity can be vectorized, I assumed that generally (although not always) cosine_similarity would be quicker and that it is preferable to filter on the prefetched subset. Note that filter
can be any callable, so we have no control over how fast filter
is.
Thanks for the feedback, I see your point. I will try to find some time to do benchmarks. We can postpone the merge until then. Given the focus for simplicity, I will likely remove the prefilter_k_multiplier argument and just filter up front. |
@eyurtsev Ok, I found time now. I simplified the PR, I think it is as simple to read and understand as the previous implementation. Results from benchmark:
Benchmark script ran with
|
) | ||
for idx in top_k_idx | ||
# Assign using walrus operator to avoid multiple lookups | ||
if (doc_dict := docs[idx]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@VMinB12 hope you're OK swapped this to use a walrus operator which I think is less surprising than a double comprehension
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect 👍
Description: We improve the performance of the InMemoryVectorStore.
Isue: Originally, similarity was computed document by document:
This is inefficient and does not make use of numpy vectorization.
This PR computes the similarity in one vectorized go:
Dependencies: None
Twitter handle: @b12_consulting, @Vincent_Min