Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Allow concurrent and large refgenie downloads #1172

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

Austin-s-h
Copy link
Contributor

@Austin-s-h Austin-s-h commented Mar 28, 2023

Description

After running into a few instances of the refgenie wrapper downloading individual assets concurrently, I was met with a lock-timeout error when attempting to pull two large assets (I'm looking at you hg38_cdna/salmon_index at ~25Gb). The default timeout is 60 seconds, but I wanted to attempt to handle this error.

So, I slightly modified the refgenie wrapper to attempt to handle the RefgenconfError generated when the lock is unable to be obtained to skip the lock requirement. This may or may not be desirable behavior across all pipelines, but it did resolve the issues with mine as well as pass the testing requirements.

I added a rule that mimics obtaining a large asset, but I am not familiar enough yet with the wrapper system to know if simply adding a rule means that it is tested.

Additionally, the inclusion of force_large=True was necessary to download assets larger than the large threshold of 5Gb

QC

  • I confirm that:

For all wrappers added by this PR,

  • there is a test case which covers any introduced changes,
  • input: and output: file paths in the resulting rule can be changed arbitrarily,
  • either the wrapper can only use a single core, or the example rule contains a threads: x statement with x being a reasonable default,
  • rule names in the test case are in snake_case and somehow tell what the rule is about or match the tools purpose or name (e.g., map_reads for a step that maps reads),
  • all environment.yaml specifications follow the respective best practices,
  • wherever possible, command line arguments are inferred and set automatically (e.g. based on file extensions in input: or output:),
  • all fields of the example rules in the Snakefiles and their entries are explained via comments (input:/output:/params: etc.),
  • stderr and/or stdout are logged correctly (log:), depending on the wrapped tool,
  • temporary files are either written to a unique hidden folder in the working directory, or (better) stored where the Python function tempfile.gettempdir() points to (see here; this also means that using any Python tempfile default behavior works),
  • the meta.yaml contains a link to the documentation of the respective tool or command,
  • Snakefiles pass the linting (snakemake --lint),
  • Snakefiles are formatted with snakefmt,
  • Python wrapper scripts are formatted with black.
  • Conda environments use a minimal amount of channels, in recommended ordering. E.g. for bioconda, use (conda-forge, bioconda, nodefaults, as conda-forge should have highest priority and defaults channels are usually not needed because most packages are in conda-forge nowadays).

Summary by CodeRabbit

  • Chores
    • Updated CI and quality control workflows for improved runner configurations and streamlined validation.
  • New Features
    • Introduced enhanced asset processing with support for large items and more detailed logging.
  • Bug Fixes
    • Improved error handling during configuration loading to reduce concurrency issues and address input path mismatches.
  • Refactor
    • Refined file path management and command formatting in the expression calculation module for increased clarity and reliability.

Copy link
Contributor

@dlaehnemann dlaehnemann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this!

To get your new rule tested, have a look at the existing refgenie test (in test.py):

snakemake-wrappers/test.py

Lines 4322 to 4330 in 0d2c92a

@skip_if_not_modified
def test_refgenie():
try:
shutil.copytree("bio/refgenie/test/genome_folder", "/tmp/genome_folder")
except FileExistsError:
# no worries, the directory is already there
pass
os.environ["REFGENIE"] = "/tmp/genome_folder/genome_config.yaml"
run("bio/refgenie", ["snakemake", "--cores", "1", "--use-conda", "-F"])

You would need to include another test that requires your new rule's output. However, before doing so, it might make sense to change this to an example that is only slighlty over the threshold of 5GB, because otherwise the test would do an excessive download (for example the 25GB you mentioned) and fill up the GitHub Action VM's disk space every time it is run (even though it is only run if anything changes in this wrapper).

except RefgenconfError:
# If read lock timeout, attempt to skip the read lock
rgc = refgenconf.RefGenConf(
conf_path, writable=True, skip_read_lock=True, genome_exact=False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't really find out what the exact implications of skip_read_lock=TRUE are, but it seems dangerous to use, to me. Have you also tried increasing wait_max= as an alternative?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.

I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the wait_max= should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.

Do you have the possibility to try wait_max= in your use case and test whether this actually helps?

@mergify mergify bot mentioned this pull request Apr 17, 2023
1 task
Copy link
Contributor

github-actions bot commented Nov 1, 2023

This PR was marked as stale because it has been open for 6 months with no activity.

@github-actions github-actions bot added the Stale label Nov 1, 2023
@fgvieira
Copy link
Collaborator

fgvieira commented Feb 7, 2025

@Austin-s-h can you fix the conflicts?

Copy link
Contributor

coderabbitai bot commented Feb 7, 2025

📝 Walkthrough

Walkthrough

This pull request updates multiple workflow configurations and internal modules. The GitHub workflows now run on self‐hosted runners with revised labels and action versions, including a renamed PR job with an added input parameter and a new “Fetch master” step in the QC job. In addition, the refgenie Snakefile has been modified to improve asset processing (with a new rule for large assets) and the refgenie wrapper now includes error handling for configuration instantiation along with an updated pull method. Finally, the rsem expression wrapper now leverages pathlib for improved file path handling.

Changes

File(s) Change Summary
.github/workflows/conventional-prs.yml, .github/workflows/qc.yml Renamed PR workflow to “Lint PR Title”, updated runner configuration (from ubuntu-latest to self-hosted with specific labels), bumped action version (v3.4.0 → v3.6.0), added new input parameter (validateSingleCommit: true), and introduced a new “Fetch master” step in the QC linting job.
bio/refgenie/.../Snakefile, bio/refgenie/.../wrapper.py Updated refgenie module: modified the obtain_asset rule (added a comma, introduced log and wrapper parameters), added a new obtain_large_asset rule with a force_large parameter, and enhanced error handling during RefGenConf instantiation (retry with skip_read_lock and genome_exact settings) along with updating the pull method signature to support force_large.
bio/rsem/.../wrapper.py Improved file path handling by importing pathlib.Path and redefining reference_prefix via a reference_path; reformatted the shell command for better readability while preserving existing error handling for output file checks.

Sequence Diagram(s)

sequenceDiagram
    participant Dev as Developer
    participant GH as GitHub
    participant WR as Workflow Runner
    participant SPA as Semantic Pull Action

    Dev->>GH: Create Pull Request
    GH->>WR: Trigger "Lint PR Title" workflow
    WR->>SPA: Execute semantic pull request check (v3.6.0) with validateSingleCommit=true
    SPA-->>WR: Return lint result
    WR-->>GH: Report job outcome
Loading
sequenceDiagram
    participant S as Script
    participant RC as RefGenConf Constructor
    participant EH as Exception Handler

    S->>RC: Attempt to instantiate RefGenConf
    alt Successful instantiation
      RC-->>S: Return RefGenConf instance
    else On RefgenconfError
      RC-->>EH: Throw RefgenconfError
      EH->>RC: Retry with skip_read_lock=True, genome_exact=False
      RC-->>S: Return configured RefGenConf instance
    end
    S->>RC: Call pull(genome, asset, tag, force, force_large)
Loading

Suggested reviewers

  • fgvieira
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
.github/workflows/qc.yml (1)

41-41: Consider upgrading the checkout action version in the linting job.
The linting job uses "actions/checkout@v1" (line 41), while the formatting job uses "actions/checkout@v3". For consistency, security improvements, and access to newer features, it is recommended to upgrade the checkout action in the linting job to v3.

Suggested change:

-      - uses: actions/checkout@v1
+      - uses: actions/checkout@v3
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a2fe8b9 and decd1a5.

⛔ Files ignored due to path filters (1)
  • CHANGELOG.md is excluded by !CHANGELOG.md
📒 Files selected for processing (5)
  • .github/workflows/conventional-prs.yml (2 hunks)
  • .github/workflows/qc.yml (2 hunks)
  • bio/refgenie/test/Snakefile (1 hunks)
  • bio/refgenie/wrapper.py (1 hunks)
  • bio/rsem/calculate-expression/wrapper.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
`**/*.py`: Do not try to improve formatting. Do not suggest ...

**/*.py: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of the self argument of methods.
Do not suggest type annotation of the cls argument of classmethods.
Do not suggest return type annotation if a function or method does not contain a return statement.

  • bio/rsem/calculate-expression/wrapper.py
  • bio/refgenie/wrapper.py
`**/wrapper.py`: Do not complain about use of undefined vari...

**/wrapper.py: Do not complain about use of undefined variable called snakemake.

  • bio/rsem/calculate-expression/wrapper.py
  • bio/refgenie/wrapper.py
🪛 Ruff (0.8.2)
bio/rsem/calculate-expression/wrapper.py

66-66: Undefined name snakemake

(F821)


69-69: Undefined name snakemake

(F821)


70-70: Undefined name snakemake

(F821)


71-71: Undefined name snakemake

(F821)

bio/refgenie/wrapper.py

10-10: Undefined name snakemake

(F821)


11-11: Undefined name snakemake

(F821)


12-12: Undefined name snakemake

(F821)


27-27: Undefined name snakemake

(F821)

🪛 actionlint (1.7.4)
.github/workflows/conventional-prs.yml

12-12: label "v1.0.2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/qc.yml

8-8: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


39-39: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

🔇 Additional comments (9)
bio/refgenie/test/Snakefile (1)

1-13: LGTM! Good addition of logging.

The updates to the obtain_asset rule improve observability by adding proper logging.

bio/refgenie/wrapper.py (2)

26-28: LGTM! Good implementation of force_large parameter.

The force_large parameter is correctly implemented with a safe default of False, allowing users to explicitly opt-in for large asset downloads.

🧰 Tools
🪛 Ruff (0.8.2)

27-27: Undefined name snakemake

(F821)


16-24: Consider increasing wait_max instead of skip_read_lock.

Based on previous discussions, using skip_read_lock=True might be risky. While it solves the immediate issue, consider trying to increase wait_max first.

Try this alternative approach:

     rgc = refgenconf.RefGenConf(
-        conf_path, writable=True, skip_read_lock=True, genome_exact=False
+        conf_path, writable=True, wait_max=600, genome_exact=False
     )
bio/rsem/calculate-expression/wrapper.py (1)

73-80: LGTM! Improved command readability.

The shell command formatting changes improve readability by placing each argument on a new line.

.github/workflows/conventional-prs.yml (2)

14-18: LGTM! Good updates to action configuration.

The action version update and addition of validateSingleCommit parameter improve PR validation.


12-12: Verify self-hosted runner label.

The label v1.0.2 is not a standard GitHub Actions runner label. Please verify that this label is correctly configured in your self-hosted runner setup.

Check your self-hosted runner configuration to ensure the label exists:

🧰 Tools
🪛 actionlint (1.7.4)

12-12: label "v1.0.2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)

.github/workflows/qc.yml (3)

8-8: Runner label "python-3.10" might not be recognized.
The workflow now specifies the runner as "[self-hosted, Linux, python-3.10]". However, static analysis reports (actionlint) indicate that "python-3.10" is not a recognized label and may require configuration in your actionlint.yaml or adjustment to a supported label. Please verify that "python-3.10" is a valid custom label in your runner setup or consider using one of the available labels.

🧰 Tools
🪛 actionlint (1.7.4)

8-8: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


39-39: Runner label "python-3.10" in linting job needs verification.
Similar to the formatting job, the linting job now uses "[self-hosted, Linux, python-3.10]". Please ensure that this custom label is correctly registered in your runner configuration or in the actionlint configuration.

🧰 Tools
🪛 actionlint (1.7.4)

39-39: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file

(runner-label)


57-60: Addition of "Fetch master" step is a positive enhancement.
The new step running git fetch origin master helps ensure that the latest changes from the master branch are available before linting, which should improve the accuracy of lint checks.

Comment on lines +15 to 26
rule obtain_large_asset:
output:
star_index=directory("refs/star_index/hg38/star_index"),
params:
genome="hg38",
asset="star_index",
tag="default",
force_large=True,
log:
"logs/refgenie/obtain_large_asset.log",
wrapper:
"master/bio/refgenie"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider making force_large configurable.

Based on previous discussions, it would be better to make force_large configurable rather than hardcoding it to True. This allows users to make an informed decision about downloading large assets.

Apply this diff to make it configurable:

     params:
         genome="hg38",
         asset="star_index",
         tag="default",
-        force_large=True,
+        force_large=False,  # Default to False, let users explicitly enable it

Comment on lines +64 to +67
# BUG input_string is 'r' given the input but is should be the reference base path?
# subprocess.CalledProcessError: Command 'set -euo pipefail; rsem-calculate-expression --num-threads 24 --estimate-rspd --calc-ci --strandedness reverse --time --paired-end --alignments results/star/D-1/Aligned.toTranscriptome.out.bam r results/rsem/D-1/D-1 > logs/rsem/calculate_expression/D-1.log 2>&1' returned non-zero exit status 255.
reference_path = Path(snakemake.input.reference[0])
reference_prefix = str(reference_path.parents[0]/reference_path.stem)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Verify the input_string bug fix.

The comment indicates a bug where input_string is incorrectly set to 'r'. Please verify if this is still an issue with the current implementation.

Run this script to check the input_string value:


🏁 Script executed:

#!/bin/bash
# Search for places where input_string is set
rg -A 5 "input_string = " 

Length of output: 1648


Investigation on the input_string assignment issue

The search shows that in bio/rsem/calculate-expression/wrapper.py we still have an assignment of input_string using the variable bam (and later based on fq_one), while a computed reference_prefix exists but is never used. This indicates that the command still ends up receiving (likely) the wrong parameter (i.e. “r”) in place of the reference base path as per the observed error.

• In bio/rsem/calculate-expression/wrapper.py the value of input_string remains based on bam and fastq input, rather than using the computed reference_prefix.
• The bug comment and the failed command output confirmed that “r” is passed where the reference should be.

It appears that the bug fix requested in the review comment has not yet been applied.

🧰 Tools
🪛 Ruff (0.8.2)

66-66: Undefined name snakemake

(F821)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants