-
Notifications
You must be signed in to change notification settings - Fork 2
Best_Practises
Your workflow and tools files should/must have the following syntax:
workflows/<workflow_name>/<workflow_version>/<workflow_name>-<workflow-version>.cwl
This should also apply for schemas and expressions. This reduces the amount of 'unnecessary workflow changes', particularly important for production workflows.
Example:
Alice has been working on a workflow under workflows/x/1.0.0/x-1.0.0.cwl
. Her colleague Bob, decides that he would like to generalise one of the expressions get_secondary_file used in Alice's workflow so that it can also take in multiple secondary files and output an array instead. This seems like a good decision from Bob. If he sets it up correctly, Alice should be able to continue using the expression how she'd set it up. However, this expression was placed under expressions/get_secondary_file.cwl
without any versioning folders or naming conventions. Although this doesn't have any affect on Alice's workflow, it nonetheless has changed the 'md5sum' of of workflow without her knowing. Therefore, all workflows in production projects that use this expression now have an updated md5sum tag.
Instead, the repository should be set in such a way that Alice would instead place the expression under expressions/get_secondary_file/1.0.0/get_secondary_file-1.0.0.cwl
.
Bob should then create a new expression under expressions/get_secondary_file/1.0.1/get_secondary_file-1.0.1.cwl
with the generalisation changes as required. Migration from the old expression to the new expression in a workflow can now be done explicitly.
Such a policy should be applied not just for expressions, but for schemas, tools AND workflows (because all workflows can be subworkflows). For tools and workflows, one can observe the 'used-by' category to show which workflows a tool is used by and thus, what will be affected upstream by changing a tool's definition.
As shown above, bad things can happen when one person updates an expression or tool or schema without knowing how systemically used a workflow is. Therefore it is paramount to make sure that a tool / workflow is ready to go for almost all use-cases. Here are some tips I've discovered to make sure that a tool is 'ready-on-deployment'.
If a tool has a setting such as 'threads'. One can use the 'runtime.cores' attribute to automatically set this default value to the set number of cores in the tool instance but still would like this as an optional input for the user.
# Set up the function get_num_threads
requirements:
InlineJavaScriptRequirement:
expressionLib:
- var get_num_threads(set_threads){
/*
Returns runtime.cores if inputs.threads is not set
*/
if (set_threads !== null){
return set_threads
} else {
return runtime.cores
}
}
...
inputs:
threads:
label: threads
doc: |
Number of threads for the tool to use.
Set automatically to the number of threads at runtime if not specified
type: int?
inputBinding:
# Call get_num_threads function
prefix: "--threads"
valueFrom: "$(get_num_threads(self))"
Rather than using "ShellCommandRequirement" to hack together some ugly arguments, use eval
in a listing to ensure all of your input's values and prefixes are still able to be set in the inputs
section. Make sure to use the '"\${@}"'
syntax which will make sure all of your input parameters are still quoted as expected.
See the example of sambamba sort below:
requirements:
InlineJavascriptRequirement:
expressionLib:
- var boolean_to_string = function(input_obj){
/*
Returns false if null else returns .toString of object
*/
if (input_obj === null){
return "false";
} else {
return input_obj.toString();
}
}
InitialWorkDirRequirement:
listing:
- entryname: run-sambamba-sort.sh
entry: |
#!/usr/bin/env bash
# This tool runs sambamba sort and then
# indexes if output-format is set to "bam" and
# we're not sorting output by name
# Set to fail script on non-zero exitcode of subcommand
set -euo pipefail
# Run sambamba
eval sambamba sort '"\${@}"'
# Index only if inputs.output_filename extension is bam
# and inputs.sort_by_name is false
if [[ "$(boolean_to_string(inputs.sort_by_name))" == "false" && "$(inputs.output_filename.nameext) == ".bam" ]]; then
sambamba index "$(inputs.output_filename)"
fi
...
baseCommand: [ "bash", "run-sambamba-sort.sh" ]
inputs:
...
outputs:
output_bam:
label: output bam file
doc: |
The output bam file
# Must be optional since we could be outputting a different file format
type: File?
secondaryFiles:
# Exists only if not inputs.sort_by_name is true.
- pattern: ".bai"
required: false
successCodes:
- 0