Best_Practises

Best Practises

Structuring your workflow and tool collections
Options Etiquette
- Auto-setting the memory and threads parameters.
- When pre and post processing steps are sometimes required within a tool

Structuring your workflow and tool collections

Your workflow and tools files should/must have the following syntax:

workflows/<workflow_name>/<workflow_version>/<workflow_name>-<workflow-version>.cwl

This should also apply for schemas and expressions. This reduces the amount of 'unnecessary workflow changes', particularly important for production workflows.

Example:

Alice has been working on a workflow under workflows/x/1.0.0/x-1.0.0.cwl. Her colleague Bob, decides that he would like to generalise one of the expressions get_secondary_file used in Alice's workflow so that it can also take in multiple secondary files and output an array instead. This seems like a good decision from Bob. If he sets it up correctly, Alice should be able to continue using the expression how she'd set it up. However, this expression was placed under expressions/get_secondary_file.cwl without any versioning folders or naming conventions. Although this doesn't have any affect on Alice's workflow, it nonetheless has changed the 'md5sum' of of workflow without her knowing. Therefore, all workflows in production projects that use this expression now have an updated md5sum tag.

Instead, the repository should be set in such a way that Alice would instead place the expression under expressions/get_secondary_file/1.0.0/get_secondary_file-1.0.0.cwl.
Bob should then create a new expression under expressions/get_secondary_file/1.0.1/get_secondary_file-1.0.1.cwl with the generalisation changes as required. Migration from the old expression to the new expression in a workflow can now be done explicitly.

Such a policy should be applied not just for expressions, but for schemas, tools AND workflows (because all workflows can be subworkflows). For tools and workflows, one can observe the 'used-by' category to show which workflows a tool is used by and thus, what will be affected upstream by changing a tool's definition.

Options Etiquette

As shown above, bad things can happen when one person updates an expression or tool or schema without knowing how systemically used a workflow is. Therefore it is paramount to make sure that a tool / workflow is ready to go for almost all use-cases. Here are some tips I've discovered to make sure that a tool is 'ready-on-deployment'.

Auto-setting the memory and threads parameters.

If a tool has a setting such as 'threads'. One can use the 'runtime.cores' attribute to automatically set this default value to the set number of cores in the tool instance but still would like this as an optional input for the user.

# Set up the function get_num_threads
requirements:
  InlineJavaScriptRequirement: 
    expressionLib:
      - var get_num_threads(set_threads){
            /*
            Returns runtime.cores if inputs.threads is not set
            */
            if (set_threads !== null){
              return set_threads
            } else {
              return runtime.cores
            }
      }
...

inputs:
  threads:
    label: threads
    doc: |
      Number of threads for the tool to use.  
      Set automatically to the number of threads at runtime if not specified
    type: int?
    inputBinding:
      # Call get_num_threads function
      prefix: "--threads"
      valueFrom: "$(get_num_threads(self))"

When pre and post processing steps are sometimes required within a tool

Rather than using "ShellCommandRequirement" to hack together some ugly arguments, use eval in a listing to ensure all of your input's values and prefixes are still able to be set in the inputs section. Make sure to use the '"\${@}"' syntax which will make sure all of your input parameters are still quoted as expected.

See the example of sambamba sort below:

requirements:
  InlineJavascriptRequirement:
      expressionLib:
      - var boolean_to_string = function(input_obj){
          /*
          Returns false if null else returns .toString of object
          */
          if (input_obj === null){
            return "false";
          } else {
            return input_obj.toString();
          }
        }
  InitialWorkDirRequirement:
    listing:
      - entryname: run-sambamba-sort.sh
        entry: |
          #!/usr/bin/env bash

          # This tool runs sambamba sort and then
          # indexes if output-format is set to "bam" and 
          # we're not sorting output by name

          # Set to fail script on non-zero exitcode of subcommand
          set -euo pipefail

          # Run sambamba
          eval sambamba sort '"\${@}"'

          # Index only if inputs.output_filename extension is bam
          # and inputs.sort_by_name is false
          if [[ "$(boolean_to_string(inputs.sort_by_name))" == "false" && "$(inputs.output_filename.nameext) == ".bam" ]]; then
            sambamba index "$(inputs.output_filename)"
          fi

...

baseCommand: [ "bash", "run-sambamba-sort.sh" ]

inputs:
  ...

outputs:
  output_bam:
    label: output bam file
    doc: |
      The output bam file
    # Must be optional since we could be outputting a different file format
    type: File? 
    secondaryFiles:
      # Exists only if not inputs.sort_by_name is true.
      - pattern: ".bai"
        required: false  

successCodes:
 - 0

Provide feedback

Saved searches