-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Allow concurrent and large refgenie downloads #1172
base: master
Are you sure you want to change the base?
Conversation
feat: Lofreq indelqual wrapper (snakemake#1166)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this!
To get your new rule tested, have a look at the existing refgenie
test (in test.py
):
Lines 4322 to 4330 in 0d2c92a
@skip_if_not_modified | |
def test_refgenie(): | |
try: | |
shutil.copytree("bio/refgenie/test/genome_folder", "/tmp/genome_folder") | |
except FileExistsError: | |
# no worries, the directory is already there | |
pass | |
os.environ["REFGENIE"] = "/tmp/genome_folder/genome_config.yaml" | |
run("bio/refgenie", ["snakemake", "--cores", "1", "--use-conda", "-F"]) |
You would need to include another test that requires your new rule's output. However, before doing so, it might make sense to change this to an example that is only slighlty over the threshold of 5GB, because otherwise the test would do an excessive download (for example the 25GB you mentioned) and fill up the GitHub Action VM's disk space every time it is run (even though it is only run if anything changes in this wrapper).
except RefgenconfError: | ||
# If read lock timeout, attempt to skip the read lock | ||
rgc = refgenconf.RefGenConf( | ||
conf_path, writable=True, skip_read_lock=True, genome_exact=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't really find out what the exact implications of skip_read_lock=TRUE
are, but it seems dangerous to use, to me. Have you also tried increasing wait_max=
as an alternative?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't attempt to, but I suspect that this might not be a great choice either. If someone is downloading an asset over a slow connection, even setting wait_max from its default of 60 to 600 might not make a difference and result in a hard-to-diagnose timeout error.
I'm not sure if this was some sort of conflict with the snakemake locking system as well. If we rely on that to protect other files, then the result of the wrapper is it either produces the output file, or the rule fails with a RefgenconfError error and recommends setting the skip_read_lock=TRUE param to try to fix the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From what I gathered by poking around a little, I think that the lock only happens while something is written to the conf file. So I would think that this lock is not in place the whole time you are doing the download and that the wait_max=
should already help. But the documentation on this is not very clear and I didn't immediately find the mechanism in the code, so I might be misunderstanding this lock.
Do you have the possibility to try wait_max=
in your use case and test whether this actually helps?
sync upstream
chore: sync fork
chore: release 1.26.0
This PR was marked as stale because it has been open for 6 months with no activity. |
@Austin-s-h can you fix the conflicts? |
📝 WalkthroughWalkthroughThis pull request updates multiple workflow configurations and internal modules. The GitHub workflows now run on self‐hosted runners with revised labels and action versions, including a renamed PR job with an added input parameter and a new “Fetch master” step in the QC job. In addition, the refgenie Snakefile has been modified to improve asset processing (with a new rule for large assets) and the refgenie wrapper now includes error handling for configuration instantiation along with an updated pull method. Finally, the rsem expression wrapper now leverages pathlib for improved file path handling. Changes
Sequence Diagram(s)sequenceDiagram
participant Dev as Developer
participant GH as GitHub
participant WR as Workflow Runner
participant SPA as Semantic Pull Action
Dev->>GH: Create Pull Request
GH->>WR: Trigger "Lint PR Title" workflow
WR->>SPA: Execute semantic pull request check (v3.6.0) with validateSingleCommit=true
SPA-->>WR: Return lint result
WR-->>GH: Report job outcome
sequenceDiagram
participant S as Script
participant RC as RefGenConf Constructor
participant EH as Exception Handler
S->>RC: Attempt to instantiate RefGenConf
alt Successful instantiation
RC-->>S: Return RefGenConf instance
else On RefgenconfError
RC-->>EH: Throw RefgenconfError
EH->>RC: Retry with skip_read_lock=True, genome_exact=False
RC-->>S: Return configured RefGenConf instance
end
S->>RC: Call pull(genome, asset, tag, force, force_large)
Suggested reviewers
✨ Finishing Touches
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
.github/workflows/qc.yml (1)
41-41
: Consider upgrading the checkout action version in the linting job.
The linting job uses "actions/checkout@v1" (line 41), while the formatting job uses "actions/checkout@v3". For consistency, security improvements, and access to newer features, it is recommended to upgrade the checkout action in the linting job to v3.Suggested change:
- - uses: actions/checkout@v1 + - uses: actions/checkout@v3
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (1)
CHANGELOG.md
is excluded by!CHANGELOG.md
📒 Files selected for processing (5)
.github/workflows/conventional-prs.yml
(2 hunks).github/workflows/qc.yml
(2 hunks)bio/refgenie/test/Snakefile
(1 hunks)bio/refgenie/wrapper.py
(1 hunks)bio/rsem/calculate-expression/wrapper.py
(2 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
`**/*.py`: Do not try to improve formatting. Do not suggest ...
**/*.py
: Do not try to improve formatting.
Do not suggest type annotations for functions that are defined inside of functions or methods.
Do not suggest type annotation of theself
argument of methods.
Do not suggest type annotation of thecls
argument of classmethods.
Do not suggest return type annotation if a function or method does not contain areturn
statement.
bio/rsem/calculate-expression/wrapper.py
bio/refgenie/wrapper.py
`**/wrapper.py`: Do not complain about use of undefined vari...
**/wrapper.py
: Do not complain about use of undefined variable calledsnakemake
.
bio/rsem/calculate-expression/wrapper.py
bio/refgenie/wrapper.py
🪛 Ruff (0.8.2)
bio/rsem/calculate-expression/wrapper.py
66-66: Undefined name snakemake
(F821)
69-69: Undefined name snakemake
(F821)
70-70: Undefined name snakemake
(F821)
71-71: Undefined name snakemake
(F821)
bio/refgenie/wrapper.py
10-10: Undefined name snakemake
(F821)
11-11: Undefined name snakemake
(F821)
12-12: Undefined name snakemake
(F821)
27-27: Undefined name snakemake
(F821)
🪛 actionlint (1.7.4)
.github/workflows/conventional-prs.yml
12-12: label "v1.0.2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
.github/workflows/qc.yml
8-8: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
39-39: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
🔇 Additional comments (9)
bio/refgenie/test/Snakefile (1)
1-13
: LGTM! Good addition of logging.The updates to the
obtain_asset
rule improve observability by adding proper logging.bio/refgenie/wrapper.py (2)
26-28
: LGTM! Good implementation of force_large parameter.The
force_large
parameter is correctly implemented with a safe default ofFalse
, allowing users to explicitly opt-in for large asset downloads.🧰 Tools
🪛 Ruff (0.8.2)
27-27: Undefined name
snakemake
(F821)
16-24
: Consider increasing wait_max instead of skip_read_lock.Based on previous discussions, using
skip_read_lock=True
might be risky. While it solves the immediate issue, consider trying to increasewait_max
first.Try this alternative approach:
rgc = refgenconf.RefGenConf( - conf_path, writable=True, skip_read_lock=True, genome_exact=False + conf_path, writable=True, wait_max=600, genome_exact=False )bio/rsem/calculate-expression/wrapper.py (1)
73-80
: LGTM! Improved command readability.The shell command formatting changes improve readability by placing each argument on a new line.
.github/workflows/conventional-prs.yml (2)
14-18
: LGTM! Good updates to action configuration.The action version update and addition of
validateSingleCommit
parameter improve PR validation.
12-12
: Verify self-hosted runner label.The label
v1.0.2
is not a standard GitHub Actions runner label. Please verify that this label is correctly configured in your self-hosted runner setup.Check your self-hosted runner configuration to ensure the label exists:
🧰 Tools
🪛 actionlint (1.7.4)
12-12: label "v1.0.2" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
.github/workflows/qc.yml (3)
8-8
: Runner label "python-3.10" might not be recognized.
The workflow now specifies the runner as "[self-hosted, Linux, python-3.10]". However, static analysis reports (actionlint) indicate that "python-3.10" is not a recognized label and may require configuration in your actionlint.yaml or adjustment to a supported label. Please verify that "python-3.10" is a valid custom label in your runner setup or consider using one of the available labels.🧰 Tools
🪛 actionlint (1.7.4)
8-8: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
39-39
: Runner label "python-3.10" in linting job needs verification.
Similar to the formatting job, the linting job now uses "[self-hosted, Linux, python-3.10]". Please ensure that this custom label is correctly registered in your runner configuration or in the actionlint configuration.🧰 Tools
🪛 actionlint (1.7.4)
39-39: label "python-3.10" is unknown. available labels are "windows-latest", "windows-latest-8-cores", "windows-2022", "windows-2019", "ubuntu-latest", "ubuntu-latest-4-cores", "ubuntu-latest-8-cores", "ubuntu-latest-16-cores", "ubuntu-24.04", "ubuntu-22.04", "ubuntu-20.04", "macos-latest", "macos-latest-xl", "macos-latest-xlarge", "macos-latest-large", "macos-15-xlarge", "macos-15-large", "macos-15", "macos-14-xl", "macos-14-xlarge", "macos-14-large", "macos-14", "macos-13-xl", "macos-13-xlarge", "macos-13-large", "macos-13", "macos-12-xl", "macos-12-xlarge", "macos-12-large", "macos-12", "self-hosted", "x64", "arm", "arm64", "linux", "macos", "windows". if it is a custom label for self-hosted runner, set list of labels in actionlint.yaml config file
(runner-label)
57-60
: Addition of "Fetch master" step is a positive enhancement.
The new step runninggit fetch origin master
helps ensure that the latest changes from the master branch are available before linting, which should improve the accuracy of lint checks.
rule obtain_large_asset: | ||
output: | ||
star_index=directory("refs/star_index/hg38/star_index"), | ||
params: | ||
genome="hg38", | ||
asset="star_index", | ||
tag="default", | ||
force_large=True, | ||
log: | ||
"logs/refgenie/obtain_large_asset.log", | ||
wrapper: | ||
"master/bio/refgenie" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider making force_large configurable.
Based on previous discussions, it would be better to make force_large
configurable rather than hardcoding it to True
. This allows users to make an informed decision about downloading large assets.
Apply this diff to make it configurable:
params:
genome="hg38",
asset="star_index",
tag="default",
- force_large=True,
+ force_large=False, # Default to False, let users explicitly enable it
# BUG input_string is 'r' given the input but is should be the reference base path? | ||
# subprocess.CalledProcessError: Command 'set -euo pipefail; rsem-calculate-expression --num-threads 24 --estimate-rspd --calc-ci --strandedness reverse --time --paired-end --alignments results/star/D-1/Aligned.toTranscriptome.out.bam r results/rsem/D-1/D-1 > logs/rsem/calculate_expression/D-1.log 2>&1' returned non-zero exit status 255. | ||
reference_path = Path(snakemake.input.reference[0]) | ||
reference_prefix = str(reference_path.parents[0]/reference_path.stem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Verify the input_string bug fix.
The comment indicates a bug where input_string
is incorrectly set to 'r'. Please verify if this is still an issue with the current implementation.
Run this script to check the input_string value:
🏁 Script executed:
#!/bin/bash
# Search for places where input_string is set
rg -A 5 "input_string = "
Length of output: 1648
Investigation on the input_string assignment issue
The search shows that in bio/rsem/calculate-expression/wrapper.py we still have an assignment of input_string using the variable bam (and later based on fq_one), while a computed reference_prefix exists but is never used. This indicates that the command still ends up receiving (likely) the wrong parameter (i.e. “r”) in place of the reference base path as per the observed error.
• In bio/rsem/calculate-expression/wrapper.py the value of input_string remains based on bam and fastq input, rather than using the computed reference_prefix.
• The bug comment and the failed command output confirmed that “r” is passed where the reference should be.
It appears that the bug fix requested in the review comment has not yet been applied.
🧰 Tools
🪛 Ruff (0.8.2)
66-66: Undefined name snakemake
(F821)
Description
After running into a few instances of the refgenie wrapper downloading individual assets concurrently, I was met with a lock-timeout error when attempting to pull two large assets (I'm looking at you hg38_cdna/salmon_index at ~25Gb). The default timeout is 60 seconds, but I wanted to attempt to handle this error.
So, I slightly modified the refgenie wrapper to attempt to handle the
RefgenconfError
generated when the lock is unable to be obtained to skip the lock requirement. This may or may not be desirable behavior across all pipelines, but it did resolve the issues with mine as well as pass the testing requirements.I added a rule that mimics obtaining a large asset, but I am not familiar enough yet with the wrapper system to know if simply adding a rule means that it is tested.
Additionally, the inclusion of
force_large=True
was necessary to download assets larger than thelarge
threshold of 5GbQC
For all wrappers added by this PR,
input:
andoutput:
file paths in the resulting rule can be changed arbitrarily,threads: x
statement withx
being a reasonable default,map_reads
for a step that maps reads),environment.yaml
specifications follow the respective best practices,input:
oroutput:
),Snakefile
s and their entries are explained via comments (input:
/output:
/params:
etc.),stderr
and/orstdout
are logged correctly (log:
), depending on the wrapped tool,tempfile.gettempdir()
points to (see here; this also means that using any Pythontempfile
default behavior works),meta.yaml
contains a link to the documentation of the respective tool or command,Snakefile
s pass the linting (snakemake --lint
),Snakefile
s are formatted with snakefmt,Summary by CodeRabbit