feat: Fast irods querying #184

xiamaz · 2023-06-28T07:40:42Z

Still work in progress but already reduces irods pull commands to <30s. Querying all files from a single sodar assay takes ~2.5s. The slow part is reading each md5 checksum.

Todo

clean up codebase
make filtering step cleaner
clean up checksum extraction

Since irods already offers native checksumming, its unclear whether the current md5 checksums are really necessary. This should be cleared up.

…e for nearly all functions that download or read data from irods) to use [session.]query instead of [collection.]walk to identify files, since this is much faster

xiamaz · 2023-06-28T07:44:25Z

Limitations

support for replicates has been removed for now, but that feature had not been used anywhere in cubi-tk
some hard-coded filters have not been adapted yet

… fully functional (i.e. to open files directly)

Nicolai-vKuegelgen · 2023-06-28T14:18:27Z

Just pushed my most recent version of implementing the query in irods/check.get_data_objects (branch 155-cubi-tk-snappy-pull-raw-data-is-slow-for-large-studies).

As discussed, probably best if you can merge that in, since the only other necessary changes will be the ones to remove opening the md5 files to get the checksums (the returned objects have a checksum attribute).

I also think there is no reason to create and use the custom class IrodsDataObject's or IrodsRawDataObject's, since the iRODSDataObject returned by get_data_objects (both previously with walk and now with query) has all of their attribute and functions and I see no reason why downsizing them should result in a noticable speedup.

xiamaz · 2023-06-28T16:11:34Z

@Nicolai-vKuegelgen I merged your PR and pull raw and processed data should work again.

Nicolai-vKuegelgen

I don't see the reason to change the output type of the get_data_objsfunction from dictionary to list, since this is likely to break depedencies.

cubi_tk/irods/check.py

cubi_tk/snappy/retrieve_irods_collection.py

codecov · 2023-06-30T15:05:06Z

Codecov Report

Patch coverage: 64.93% and project coverage change: +0.02 🎉

Comparison is base (944a54c) 75.84% compared to head (3866a54) 75.87%.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #184      +/-   ##
==========================================
+ Coverage   75.84%   75.87%   +0.02%     
==========================================
  Files          94       95       +1     
  Lines        7651     7668      +17     
==========================================
+ Hits         5803     5818      +15     
- Misses       1848     1850       +2

Impacted Files	Coverage Δ
cubi_tk/snappy/check_remote.py	`50.99% <25.00%> (ø)`
cubi_tk/irods/check.py	`35.55% <26.66%> (+0.12%)`	⬆️
cubi_tk/snappy/retrieve_irods_collection.py	`25.39% <33.33%> (-4.75%)`	⬇️
cubi_tk/snappy/pull_processed_data.py	`58.16% <75.00%> (-1.03%)`	⬇️
cubi_tk/snappy/pull_data_common.py	`58.24% <100.00%> (ø)`
cubi_tk/snappy/pull_raw_data.py	`57.95% <100.00%> (-0.24%)`	⬇️
cubi_tk/sodar/check_remote.py	`64.00% <100.00%> (ø)`
tests/helpers.py	`100.00% <100.00%> (ø)`
tests/test_snappy_check_remote.py	`91.93% <100.00%> (ø)`
tests/test_snappy_pull_data_common.py	`100.00% <100.00%> (ø)`
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Nicolai-vKuegelgen

I think this mostly looks good now - a few minor things I'm not 100% sure about.

cubi_tk/irods/check.py

cubi_tk/snappy/pull_data_common.py

cubi_tk/snappy/pull_processed_data.py

cubi_tk/snappy/retrieve_irods_collection.py

cubi_tk/sodar/check_remote.py

…object

xiamaz · 2023-07-04T15:53:31Z

@Nicolai-vKuegelgen Thanks for the additional comments. All changes have been applied. Re-review would be appreciated.

Nicolai-vKuegelgen

LGTM

Change the IrodsCheckCommand.get_data_objs function (which is the bas…

6adb75e

…e for nearly all functions that download or read data from irods) to use [session.]query instead of [collection.]walk to identify files, since this is much faster

xiamaz marked this pull request as draft June 28, 2023 07:41

Changed generation of iRODSDataObject from query result so that it is…

684647f

… fully functional (i.e. to open files directly)

Initial implementation of fast irods querying

d550682

xiamaz force-pushed the fast-pull branch from 5d460ea to d550682 Compare June 28, 2023 15:17

xiamaz added 2 commits June 28, 2023 18:00

Cleaning up pull processed data

3514394

Fix pull_raw_data

f59eaf0

xiamaz marked this pull request as ready for review June 28, 2023 16:13

xiamaz changed the title ~~[WIP] Fast irods querying~~ Fast irods querying Jun 28, 2023

xiamaz changed the title ~~Fast irods querying~~ feat: Fast irods querying Jun 28, 2023

Nicolai-vKuegelgen requested changes Jun 29, 2023

View reviewed changes

cubi_tk/irods/check.py Outdated Show resolved Hide resolved

cubi_tk/snappy/retrieve_irods_collection.py Outdated Show resolved Hide resolved

xiamaz added 2 commits June 30, 2023 16:51

Make tests work

0386896

Fix linting issues

93e4dc5

xiamaz requested a review from Nicolai-vKuegelgen June 30, 2023 14:55

Removing lost print

b40ea82

Nicolai-vKuegelgen reviewed Jul 3, 2023

View reviewed changes

xiamaz added 4 commits July 4, 2023 17:45

Reverting minor changes to irods check

892574b

Make naming of iRODSDataObject consistent with currently used native …

9dcb4ba

…object

Make naming of iRODSDataObject consistent with currently used native …

a36a5ae

…object

Revert naming of dict type

3866a54

xiamaz requested a review from Nicolai-vKuegelgen July 4, 2023 15:53

Nicolai-vKuegelgen approved these changes Jul 4, 2023

View reviewed changes

xiamaz merged commit d614fcc into bihealth:main Jul 4, 2023

xiamaz deleted the fast-pull branch July 4, 2023 18:37

xiamaz mentioned this pull request Jul 6, 2023

cubi-tk snappy pull-raw-data is slow for large studies #155

Closed

xiamaz mentioned this pull request Jul 6, 2023

Slow iRODS collection retrieval #136

Closed

Nicolai-vKuegelgen mentioned this pull request Jul 11, 2023

fix: Fixes for changes introdueced when switching pull functions to query #188

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Fast irods querying #184

feat: Fast irods querying #184

xiamaz commented Jun 28, 2023 •

edited

Loading

xiamaz commented Jun 28, 2023

Nicolai-vKuegelgen commented Jun 28, 2023

xiamaz commented Jun 28, 2023

Nicolai-vKuegelgen left a comment

codecov bot commented Jun 30, 2023 •

edited

Loading

Nicolai-vKuegelgen left a comment •

edited

Loading

xiamaz commented Jul 4, 2023

Nicolai-vKuegelgen left a comment

feat: Fast irods querying #184

feat: Fast irods querying #184

Conversation

xiamaz commented Jun 28, 2023 • edited Loading

Todo

xiamaz commented Jun 28, 2023

Limitations

Nicolai-vKuegelgen commented Jun 28, 2023

xiamaz commented Jun 28, 2023

Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

codecov bot commented Jun 30, 2023 • edited Loading

Codecov Report

Nicolai-vKuegelgen left a comment • edited Loading

Choose a reason for hiding this comment

xiamaz commented Jul 4, 2023

Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

xiamaz commented Jun 28, 2023 •

edited

Loading

codecov bot commented Jun 30, 2023 •

edited

Loading

Nicolai-vKuegelgen left a comment •

edited

Loading