You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them.
For example, the Common Crawls are distributed on S3 buckets.
Describe the solution you'd like
To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend".
So for example, instead of opening a file from location /path/to/file, we could open files from s3.example.com/bucket/path/to/file.
Describe alternatives you've considered
Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.
Additional context
I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
With increasingly large datasets, "conventional" file storage is sometimes not the ideal way to store them.
For example, the Common Crawls are distributed on S3 buckets.
Describe the solution you'd like
To be able to integrate such datasets with ir_datasets, it would be great to be able to swap the IO "backend".
So for example, instead of opening a file from location
/path/to/file
, we could open files froms3.example.com/bucket/path/to/file
.Describe alternatives you've considered
Mounting S3 as a file system can be done but that would probably slow down file access. Also this is not officially endorsed by S3 storage providers like Amazon.
Additional context
I think that Hadoop/Yarn/Beam all support multiple underlying protocols for file access.
The text was updated successfully, but these errors were encountered: