Add one object storage based document store #758

lalitpagaria · 2022-01-20T12:40:25Z

Currently, Haystack supports storing data into ElasticSearch, InMemory, and RDBMS.
It would be nice to add support of Object storage like S3, which is very cheap and have less hassle to maintain.

In the first step, AWS s3 can be supported as they recently added s3 select option which can help retrieve only a subset of data from an object (currently support CSV file object in compressed or uncompressed format).

Ideally, we can add a Metadata service as well which may help to use Haystack along with Data Lakes.

ZanSara · 2022-01-20T13:19:10Z

Hello @lalitpagaria! Now I'm not entirely sure, but I think we already had this discussion previously (I'll link the issue if I find it back) and we came to the conclusion that Object stores like S3 simply did not have the features required to implement an Haystack document store. The addition of s3 select might change things, but I imagine that's not the only feature that was missing, unfortunately. However, if I find we had no issue open on this topic yet, let's keep this one to track of the status of these Object stores in the future 🙂

tstadel · 2022-01-20T16:04:07Z

@ZanSara and @lalitpagaria
It's been this discussion here I think. Seems like you could use SQL-syntax to query table-files like CSV. So this could be a feature of SQLDocumentStore. However I doubt that the performance of the underlying system can cope with relational databases like postgres. For retrieval you would need something like faiss or milvus on top, anyway.
I admit, it would be interesting to see how fast s3 select is (I expect it to be fastest with parquet files underneath instead of CSVs due to parquet's columnar format) compared to RDBMSs.

lalitpagaria · 2022-01-20T16:27:58Z

@tstadel agree with you on the performance aspect. But there are few use cases where s3 can be a good alternative -

Very low infra cost
Very less operation overhead
Good candidate for searching in archives (cold documents which are rarely searched)
Now it is strongly consistent
Cross-region replication is very smooth in comparison to RDBMS, hence easily reducing network latency and availability
S3 data can be also kept in encrypted format hence less chance of data leaks
It can be a good candidate to store multi-mode data like images, audio, video etc.

There are few articles about using S3 as a database
https://petewarden.com/2010/10/01/how-i-ended-up-using-s3-as-my-database/

https://dev.to/aws-builders/using-aws-s3-as-a-database-17l0

https://www.percona.com/blog/querying-archived-rds-data-directly-from-an-s3-bucket/

ZanSara added the contributions wanted! Looking for external contributions label Jul 19, 2022

masci removed the contributions wanted! Looking for external contributions label Dec 13, 2023

masci changed the title ~~Add support for Object storage~~ Add one object storage based document store May 25, 2024

masci transferred this issue from deepset-ai/haystack May 25, 2024

masci added the feature request Ideas to improve an integration label May 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add one object storage based document store #758

Add one object storage based document store #758

lalitpagaria commented Jan 20, 2022

ZanSara commented Jan 20, 2022

tstadel commented Jan 20, 2022 •

edited

Loading

lalitpagaria commented Jan 20, 2022

Add one object storage based document store #758

Add one object storage based document store #758

Comments

lalitpagaria commented Jan 20, 2022

ZanSara commented Jan 20, 2022

tstadel commented Jan 20, 2022 • edited Loading

lalitpagaria commented Jan 20, 2022

tstadel commented Jan 20, 2022 •

edited

Loading