Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

creating an alternative in-memory bm25 store and retriever #218

Closed
Guest400123064 opened this issue Apr 10, 2024 · 5 comments
Closed

creating an alternative in-memory bm25 store and retriever #218

Guest400123064 opened this issue Apr 10, 2024 · 5 comments

Comments

@Guest400123064
Copy link

Dear maintainers, I am a newcomer to the Haystack project and have been enjoying the framework thus far! However, when I went to the source code of the in-memory document store, if I understood it correctly, the bm25 retriever implementation is suboptimal as it recreates an inverse index on every new search. Therefore, I tried to implement an alternative solution in this repo following the custom document store template. I am not sure if this should be an integration (because it is not actually related to any other technologies), and this is my first time trying to contribute to an open-source project, so I am not sure how I should move forward. I have not yet published a package to PyPI (but it is installable from the GitHub repo). Moreover, I have some different thoughts on the filters as well. So, to sum up, any suggestion would be greatly appreciated!

@julian-risch
Copy link
Member

julian-risch commented Apr 12, 2024

Dear @Guest400123064 thank you for reaching out! Great repo, we saw your LinkedIn post. 👏 🙂
You're right that the current bm25 implementation in Haystack is suboptimal. It really recreates the index on every search. It would be great if you could open a PR in the Haystack repository and contribute improvements of that specific part of the implementation of the InMemoryDocumentStore. Other improvements might be interesting too but we should briefly discuss them before you put in efforts to open another PR. For example, changing the filtering logic would be a breaking change that we might not want to merge into Haystack.

We have written down contribution guidelines here to make the start as easy as possible: https://github.com/deepset-ai/haystack/blob/main/CONTRIBUTING.md
You can also open a Draft PR early on while you are working on the PR so that you can get some early feedback from our team.

I suggest that you read the guidelines and then open a PR in the Haystack repository (not in haystack-integrations) that improves the InMemoryDocumentStore in Haystack so that it does not recreate the reverse index for BM25 for every search. That PR should be relatively small but would have great impact. Does that sound like a good start to contributing to an open source project? 🙂

@Guest400123064
Copy link
Author

Thanks for the reply! I can start working on that! If I am understanding it correctly, I will only change the indexing logic and leaving the filtering and tokenization method unchanged? Another question would be, do we want to keep using rank_bm25 as an backend? The reason for asking is that the indexing logic is built within the package, which I have no control over.

@julian-risch
Copy link
Member

@Guest400123064 It would suggest to leave out the changes of the filtering and tokenization logic for the first PR, yes. Smaller changes make it easier to review and merge. The filtering we probably don't want to change. The tokenization we can discuss.
If you find that the implementation works better without rank_bm25, I don't see a reason to keep it. By the way, you probably saw that Haystack uses the rank_bm25 from here: https://github.com/deepset-ai/haystack-bm25

@Guest400123064
Copy link
Author

Got it! I will do some initial work to see how things go

@Guest400123064
Copy link
Author

Closing this thread after merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants