Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add one object storage based document store #758

Open
lalitpagaria opened this issue Jan 20, 2022 · 3 comments
Open

Add one object storage based document store #758

lalitpagaria opened this issue Jan 20, 2022 · 3 comments
Labels
feature request Ideas to improve an integration

Comments

@lalitpagaria
Copy link

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS.
It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

@ZanSara
Copy link
Contributor

ZanSara commented Jan 20, 2022

Hello @lalitpagaria! Now I'm not entirely sure, but I think we already had this discussion previously (I'll link the issue if I find it back) and we came to the conclusion that Object stores like S3 simply did not have the features required to implement an Haystack document store. The addition of s3 select might change things, but I imagine that's not the only feature that was missing, unfortunately. However, if I find we had no issue open on this topic yet, let's keep this one to track of the status of these Object stores in the future 🙂

@tstadel
Copy link
Member

tstadel commented Jan 20, 2022

@ZanSara and @lalitpagaria
It's been this discussion here I think. Seems like you could use SQL-syntax to query table-files like CSV. So this could be a feature of SQLDocumentStore. However I doubt that the performance of the underlying system can cope with relational databases like postgres. For retrieval you would need something like faiss or milvus on top, anyway.
I admit, it would be interesting to see how fast s3 select is (I expect it to be fastest with parquet files underneath instead of CSVs due to parquet's columnar format) compared to RDBMSs.

@lalitpagaria
Copy link
Author

@tstadel agree with you on the performance aspect. But there are few use cases where s3 can be a good alternative -

  • Very low infra cost
  • Very less operation overhead
  • Good candidate for searching in archives (cold documents which are rarely searched)
  • Now it is strongly consistent
  • Cross-region replication is very smooth in comparison to RDBMS, hence easily reducing network latency and availability
  • S3 data can be also kept in encrypted format hence less chance of data leaks
  • It can be a good candidate to store multi-mode data like images, audio, video etc.

There are few articles about using S3 as a database
https://petewarden.com/2010/10/01/how-i-ended-up-using-s3-as-my-database/

https://dev.to/aws-builders/using-aws-s3-as-a-database-17l0

https://www.percona.com/blog/querying-archived-rds-data-directly-from-an-s3-bucket/

@ZanSara ZanSara added the contributions wanted! Looking for external contributions label Jul 19, 2022
@masci masci removed the contributions wanted! Looking for external contributions label Dec 13, 2023
@masci masci changed the title Add support for Object storage Add one object storage based document store May 25, 2024
@masci masci transferred this issue from deepset-ai/haystack May 25, 2024
@masci masci added the feature request Ideas to improve an integration label May 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Ideas to improve an integration
Projects
None yet
Development

No branches or pull requests

4 participants