-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Fast irods querying #184
Conversation
…e for nearly all functions that download or read data from irods) to use [session.]query instead of [collection.]walk to identify files, since this is much faster
Limitations
|
… fully functional (i.e. to open files directly)
Just pushed my most recent version of implementing the query in irods/check.get_data_objects (branch 155-cubi-tk-snappy-pull-raw-data-is-slow-for-large-studies). As discussed, probably best if you can merge that in, since the only other necessary changes will be the ones to remove opening the md5 files to get the checksums (the returned objects have a checksum attribute). I also think there is no reason to create and use the custom class |
@Nicolai-vKuegelgen I merged your PR and pull raw and processed data should work again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the reason to change the output type of the get_data_objs
function from dictionary to list, since this is likely to break depedencies.
Codecov ReportPatch coverage:
Additional details and impacted files@@ Coverage Diff @@
## main #184 +/- ##
==========================================
+ Coverage 75.84% 75.87% +0.02%
==========================================
Files 94 95 +1
Lines 7651 7668 +17
==========================================
+ Hits 5803 5818 +15
- Misses 1848 1850 +2
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this mostly looks good now - a few minor things I'm not 100% sure about.
@Nicolai-vKuegelgen Thanks for the additional comments. All changes have been applied. Re-review would be appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Still work in progress but already reduces irods pull commands to <30s. Querying all files from a single sodar assay takes ~2.5s. The slow part is reading each
md5
checksum.Todo
Since irods already offers native checksumming, its unclear whether the current md5 checksums are really necessary. This should be cleared up.