Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 or other file IO backends #217

Open
janheinrichmerker opened this issue Nov 28, 2022 · 1 comment
Open

S3 or other file IO backends #217

janheinrichmerker opened this issue Nov 28, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@janheinrichmerker
Copy link
Contributor

Is your feature request related to a problem? Please describe.
With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them.
For example, the Common Crawls are distributed on S3 buckets.

Describe the solution you'd like
To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend".
So for example, instead of opening a file from location /path/to/file, we could open files from s3.example.com/bucket/path/to/file.

Describe alternatives you've considered
Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.

Additional context
I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.

@janheinrichmerker janheinrichmerker added the enhancement New feature or request label Nov 28, 2022
@seanmacavaney
Copy link
Collaborator

The current structure actually already supports this! All you'd need to do is build a RequestsDownload object that isn't wrapped in Cache.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants