-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add disk_mb resource to config for tools that consume a large amount of scratch space at runtime #149
Comments
Is this different from setting e.g. |
I think so. The idea here is the user is no longer required/expected to know what the data footprint of one bam-equivalent would be for each pipeline. Instead, the user would specify their resources in available disk_mb. For example, if I knew I had only 2TB to work with in scratch I could set my resource limit to be 1.5TB at most. This would prevent any new jobs from launching if they could exceed that. Once jobs complete, assuming they clean up after themselves, then new jobs would be submitted that could consume the scratch space that had been freed up. |
That makes sense. I think the only issue then is that the disk_mb value can vary pretty widely. For GRIDSS on fresh frozen the temp footprint is like 20 GB, on FFPE it's sometimes 150-200 GB. But picking an intermediate/conservative value and putting some comments in the default config to guide users on when/how to change it would address that. |
I agree with that being an issue. I wonder if we could configure a few values of this that would be set at runtime depending on the sample type? The same idea would apply here for running battenberg on a genome vs an exome. |
Ah yes, a |
Might it be more useful if we had FF/FFPE status as a wildcard, though? Or is there an alternative we can use that doesn't require a wildcard? |
Using switch_on_wildcard to change the value based on FF/FFPE would require it to be a wildcard. It could also be a column in the samples table, but it still requires that to be in the samples table. I was thinking we'd just use switch_on_wildcard for changing the value based on seq_type, and provide guidance to the user on increasing those values for FFPE samples. Need to also think about how this plays with the resource unpacking function we discussed with Bruno. |
This is probably addressed by now since we have the gridss module working? Unsure if this can be closed or there is something that has to be addressed? |
A few of our modules (the STAR, bam2fastq combination, GRIDSS, battenberg) create fairly large temporary files that eat up a lot of scratch space. To help reduce the risk of a pipeline filling up all the scratch space we should provide users with a way to throttle their runs, thereby not starting jobs if there is not enough scratch space theoretically available. This can be done per rule so Snakemake knows at all times the maximum amount of scratch space the pipeline is theoretically using. Given the growing pressures on scratch space, I think we need to implement this for large projects such as GAMBL.
The text was updated successfully, but these errors were encountered: