-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
creating an alternative in-memory bm25 store and retriever #218
Comments
Dear @Guest400123064 thank you for reaching out! Great repo, we saw your LinkedIn post. 👏 🙂 We have written down contribution guidelines here to make the start as easy as possible: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md I suggest that you read the guidelines and then open a PR in the Haystack repository (not in haystack-integrations) that improves the InMemoryDocumentStore in Haystack so that it does not recreate the reverse index for BM25 for every search. That PR should be relatively small but would have great impact. Does that sound like a good start to contributing to an open source project? 🙂 |
Thanks for the reply! I can start working on that! If I am understanding it correctly, I will only change the indexing logic and leaving the filtering and tokenization method unchanged? Another question would be, do we want to keep using |
@Guest400123064 It would suggest to leave out the changes of the filtering and tokenization logic for the first PR, yes. Smaller changes make it easier to review and merge. The filtering we probably don't want to change. The tokenization we can discuss. |
Got it! I will do some initial work to see how things go |
Closing this thread after merge |
Dear maintainers, I am a newcomer to the Haystack project and have been enjoying the framework thus far! However, when I went to the source code of the in-memory document store, if I understood it correctly, the bm25 retriever implementation is suboptimal as it recreates an inverse index on every new search. Therefore, I tried to implement an alternative solution in this repo following the custom document store template. I am not sure if this should be an integration (because it is not actually related to any other technologies), and this is my first time trying to contribute to an open-source project, so I am not sure how I should move forward. I have not yet published a package to PyPI (but it is installable from the GitHub repo). Moreover, I have some different thoughts on the filters as well. So, to sum up, any suggestion would be greatly appreciated!
The text was updated successfully, but these errors were encountered: