Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Fast irods querying #184

Merged
merged 12 commits into from
Jul 4, 2023
Merged

feat: Fast irods querying #184

merged 12 commits into from
Jul 4, 2023

Conversation

xiamaz
Copy link
Member

@xiamaz xiamaz commented Jun 28, 2023

Still work in progress but already reduces irods pull commands to <30s. Querying all files from a single sodar assay takes ~2.5s. The slow part is reading each md5 checksum.

Todo

  • clean up codebase
  • make filtering step cleaner
  • clean up checksum extraction

Since irods already offers native checksumming, its unclear whether the current md5 checksums are really necessary. This should be cleared up.

…e for nearly all functions that download or read data from irods) to use [session.]query instead of [collection.]walk to identify files, since this is much faster
@xiamaz xiamaz marked this pull request as draft June 28, 2023 07:41
@xiamaz
Copy link
Member Author

xiamaz commented Jun 28, 2023

Limitations

  • support for replicates has been removed for now, but that feature had not been used anywhere in cubi-tk
  • some hard-coded filters have not been adapted yet

… fully functional (i.e. to open files directly)
@Nicolai-vKuegelgen
Copy link
Contributor

Just pushed my most recent version of implementing the query in irods/check.get_data_objects (branch 155-cubi-tk-snappy-pull-raw-data-is-slow-for-large-studies).

As discussed, probably best if you can merge that in, since the only other necessary changes will be the ones to remove opening the md5 files to get the checksums (the returned objects have a checksum attribute).

I also think there is no reason to create and use the custom class IrodsDataObject's or IrodsRawDataObject's, since the iRODSDataObject returned by get_data_objects (both previously with walk and now with query) has all of their attribute and functions and I see no reason why downsizing them should result in a noticable speedup.

@xiamaz
Copy link
Member Author

xiamaz commented Jun 28, 2023

@Nicolai-vKuegelgen I merged your PR and pull raw and processed data should work again.

@xiamaz xiamaz marked this pull request as ready for review June 28, 2023 16:13
@xiamaz xiamaz changed the title [WIP] Fast irods querying Fast irods querying Jun 28, 2023
@xiamaz xiamaz changed the title Fast irods querying feat: Fast irods querying Jun 28, 2023
Copy link
Contributor

@Nicolai-vKuegelgen Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the reason to change the output type of the get_data_objsfunction from dictionary to list, since this is likely to break depedencies.

@codecov
Copy link

codecov bot commented Jun 30, 2023

Codecov Report

Patch coverage: 64.93% and project coverage change: +0.02 🎉

Comparison is base (944a54c) 75.84% compared to head (3866a54) 75.87%.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #184      +/-   ##
==========================================
+ Coverage   75.84%   75.87%   +0.02%     
==========================================
  Files          94       95       +1     
  Lines        7651     7668      +17     
==========================================
+ Hits         5803     5818      +15     
- Misses       1848     1850       +2     
Impacted Files Coverage Δ
cubi_tk/snappy/check_remote.py 50.99% <25.00%> (ø)
cubi_tk/irods/check.py 35.55% <26.66%> (+0.12%) ⬆️
cubi_tk/snappy/retrieve_irods_collection.py 25.39% <33.33%> (-4.75%) ⬇️
cubi_tk/snappy/pull_processed_data.py 58.16% <75.00%> (-1.03%) ⬇️
cubi_tk/snappy/pull_data_common.py 58.24% <100.00%> (ø)
cubi_tk/snappy/pull_raw_data.py 57.95% <100.00%> (-0.24%) ⬇️
cubi_tk/sodar/check_remote.py 64.00% <100.00%> (ø)
tests/helpers.py 100.00% <100.00%> (ø)
tests/test_snappy_check_remote.py 91.93% <100.00%> (ø)
tests/test_snappy_pull_data_common.py 100.00% <100.00%> (ø)
... and 3 more

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Contributor

@Nicolai-vKuegelgen Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this mostly looks good now - a few minor things I'm not 100% sure about.

@xiamaz
Copy link
Member Author

xiamaz commented Jul 4, 2023

@Nicolai-vKuegelgen Thanks for the additional comments. All changes have been applied. Re-review would be appreciated.

@xiamaz xiamaz requested a review from Nicolai-vKuegelgen July 4, 2023 15:53
Copy link
Contributor

@Nicolai-vKuegelgen Nicolai-vKuegelgen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiamaz xiamaz merged commit d614fcc into bihealth:main Jul 4, 2023
@xiamaz xiamaz deleted the fast-pull branch July 4, 2023 18:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants