Skip to content

Getting_Started

Alexis Lucattini edited this page Jul 5, 2021 · 29 revisions

Getting Started

Installation

This software is designed to be installed on your local workstation. ICA tokens are stored in user-level read-only files in the conda env. However, the repository path could be in a shared path without comprising any user's credentials. In general, collaborative work should be done through version controlling over a remote repository such as GitHub.

Install the latest release by heading to the releases page and downloading the latest zip file.

You will need the following prerequisites:

  • conda
  • jq
  • yq (optional, but will need to be v4)
# Unzip the zip file
unzip "release-${version}.zip"
# Change into the extracted directory
cd "release-${version}.zip"
# Run the installation script, press '1' when prompted to update / create the conda environment
bash install.sh

The installation script will then create a conda environment called cwl-ica.

You will need to activate this environment with the following command:

conda activate cwl-ica

Configuration

You will need to also clone this repo to your computer:

If you are a member of the UMCCR GitHub organisation use one of the following options:

  1. Cloning the entire repo (simple)
# subout the umccr repository for your own in the case of forking
git clone -b beta-release [email protected]:umccr/cwl-ica.git
  1. Using a sparse checkout (recommended)

Requires Git 2.25 or higher

A sparse checkout means that only the directories of interest are cloned.
No source files or GH Actions files are cloned.

git clone \
  --no-checkout \
  -b beta-release \
  [email protected]:umccr/cwl-ica.git 
( 
  cd cwl-ica
  git sparse-checkout init --cone
  git sparse-checkout set \
    config/ \
    schemas/ \
    expressions/ \
    tools/ \
    workflows/
)

You will now need to run the configuration command so that your conda environment knows where your local clone of cwl-ica repo is:

# Ensure your cwl-ica conda env has first been activated with "conda activate cwl-ica"
cwl-ica configure-repo \
  --repo-path /path/to/local-repository-clone

Great job! You've now configured your project.

You will need to reactivate your environment in order to complete the configuration with:

# Reactivate your environment with the following two commands
conda deactivate
conda activate cwl-ica

Add your username

Add your username to user.yaml. This will ensure that you're acknowledged for CWL files you create / maintain.
It also makes it much easier for future users to know who to contact when they need clarification on a CWL workflow / tool.

Add the --set-as-default to save adding the --username parameter later on when building your first tool. You will need to deactivate then reactivate your conda environment for this to take effect.

cwl-ica configure-user \
  --username "Firstname Lastname" \
  --email "[email protected]" \
  --set-as-default

Configure a tenant

First check out the list of registered tenants in the repo with:

cwl-ica list-tenants

Then run the cwl-ica configure-tenant command to create a mapping of tenant names and tenant ids. You can then define projects to be in given tenants through the --tenant-name option in cwl-ica project-init.

While this only seems useful if your ICA organisation spans over multiple tenants, this will future proof your workflows.
For now, cwl-ica configure-tenant is a mandatory step before you initialise a project.

You can see all registered tenants through cwl-ica list-tenants.

Initialise a project

First check out the list of registered projects in the repo with:

cwl-ica list-projects

If you would like to add a project run the following command:

cwl-ica project-init \
  --project-id "xxxx-yyyy..." \
  --project-name "my-registered-project" \
  --access-token "<project-access-token>" \
  --tenant-name "<name-of-tenant>"

To determine the project id, and project name you will need to run ica projects list.

Setting the api-key access script

We take inspiration from the GIT_SSH environment variable with our own CWL_ICA_API_KEY_SH variable. This variable should point to an executable file (like a bash script) that uses an environment variable ${PROJECT_API_KEY_PATH} that is set to the project's project_api_key_name attribute.

Confusing?? Let's go through an example.

I set my CWL_ICA_API_KEY_SH variable to a file under ${CONDA_PREFIX}/etc/get_api_key.sh (a bash script with executable permissions) with the following contents.

#!/usr/bin/env bash
gpg \
  --decrypt \
  --passphrase-file "${HOME}/.gpg/umccr.txt" \
  "${HOME}/.password-store/ica/api-keys/${PROJECT_API_KEY_PATH}.gpg"

This allows me to use the pass binary to store/manage my api-keys for each project - note: pass and python's subprocess module don't work very well together which is why this script is not simply pass /ica/api-keys/${PROJECT_API_KEY_PATH}. When any cwl-ica subcommand now tries to access my api-key, I must first enter my gpg password, the token is then stored under ${CONDA_PREFIX}/etc/ica/tokens. If the token expires, this script is called again to refresh that token.

This method has the following benefits:

  1. API keys last indefinitely but tokens do not (and should not). This way, one doesn't have to manually update tokens, or worry about if a token has expired.
  2. The security level is up to the user. One could just have a file called api-key.txt where this script above simply prints the contents of the file that being the api-key, or they could set up multi-factor authentication when trying to access the api-key .

Initialising a category

Save yourself having to trawl through a plethora of workflows to find the one you're after.
You may assign a workflow to multiple categories. A category does NOT have to be registered before registering your tool or workflow.
Categories are registered on ICA, but a given category may span multiple projects.

Like tenants and projects, you can see existing categories with:

cwl-ica list-categories

To create your own, run:

cwl-ica category-init \
  --name "name of category" \
  --description "optional, can instead use a large text field instead"

Building your first tool

First we use the cwl-ica create-tool-from-template command to create a file that we can expand on to build our first tool.

This will automatically create an id, label and doc for us, along with the author metadata namespaces for us to fill in.

The following command will create a tool under tools/tabix/0.2.6/tabix-0.2.6.cwl

cwl-ica create-tool-from-template \
  --tool-name tabix \
  --tool-version 0.2.6

Fill out the rest of the tool and then validate it. You should also test the tool locally (if possible).

cwl-ica tool-validate can be the most laborious part of the process but for good reason. No one else will use your tool if it's not documented properly.

Check out contributions or our examples 🚧 section for more help on mastering your first cwl tool.

Registering your tool

Now you've validated your tool, it's time to "register" it. This will:

  1. Create an entry in tool.yaml for this cwl tool.
  2. Create a workflow ID, and workflow version for the tool on ICA.
  3. Keep the tool up-to-date on ICA.
  4. Create a user-friendly markdown document when pushed to the main branch.

You can register your tool with cwl-ica tool-init.
If you decide later on, that a specific already initialised tool, would be convenient in a given project, use the subcommand add-tool-to-project to add the tool to the project.

Building your first workflow

If you've made it to this stage, congratulations! You've built a suite of tools and ready to stitch them together as a workflow.

Initialise the workflow through cwl-ica create-workflow-from-template. You will need to also 'validate' your workflow with cwl-ica workflow-validate.
This may be pretty tedious and is easier if you've first 'validated' all of your tools.

Once you've successfully run cwl-ica workflow-validate, it's time to register your workflow.

Registering your workflow

Like a tool registry, registering your workflow will also keep it in sync on ICA, and create a user-friendly markdown document on the workflow when pushed to the main branch.

You can register your tool with cwl-ica workflow-init.
Likewise, if you have an existing workflow with a new project, you may connect this workflow to your new ICA project with cwl-ica add-workflow-to-project.

Syncing your tools / workflows with ICA.

For non-production projects, tools and workflows will sync with the registered workflow id and workflow version on each push to the main branch. You may also 'sync' your tool with the following commands:

cwl-ica tool-sync or cwl-ica workflow-sync.

For production tools / workflows, you will need to first push to the main branch which will create a new version suffix based on the git commit of the merge.

Since all pushes to the main branch are required to be a pull-request, it is recommended the workflow has been first been fully tested in a non-production project.

Committing files to the repo

Yamls are precious creatures, make sure that you don't upload an invalid yaml under config/ to the git repo.
One of the best ways to make sure you avoid this is by running a 'pre-commit' git hook.

You can place the following lines under ${CWL_ICA_REPO_PATH}/.git/hooks/pre-commit, now when you run 'git commit', your config files will now be automatically validated before each commit.

#!/usr/bin/env bash

: '
Ensure that the configuration files are legit before committing
'

# Fail on non-zero exit code
set -euo pipefail

# Run validate yamls subcommand
conda run \
  --name 'cwl-ica' \
  cwl-ica validate-config-yamls

Show others how to run your workflow / tool 🚧

Registering a run instance of your workflow / tool will guide others how your workflow should be set up to run.
For tools, this means a plot showing the cpu and mem usage over time along with the duration of the tool length.
For workflows, this means a stacked bar chat of the cpu / mem usage over time along with the duration of the workflow.

To register a run instance use either:

cwl-ica register-tool-run-instance-id or cwl-ica register-workflow-run-instance-id.

Optimising your workflow 🚧

One can use the overrides setting to optimise the cpu and mem usage (or even change the docker container used by a step in a workflow or tool).

In order to view the step ids of a workflow, run cwl-ica get-workflow-step-ids.

Use these in the overrides settings to adjust the engine parameters for this step of a workflow.

Running a workflow 🚧

There are two recommended ways of running your workflow.

If you are unsure on how to run a workflow, check out the ica catalogue page 🚧, which should have all of the documentation that you need.

  1. Through postman

    On each push to the main branch, the GitHub actions builds a postman-json for each project.
    This can be imported into postman and allows a user to navigate through all registered workflows and to launch new workflows from the templates of registered runs for each workflow.
    Once imported, you will need to update ICA_ACCESS_TOKEN in the variable in the top directory with a token that belongs to the relevant project.

  2. Copying a tool / workflow submission template

    One may also use the cwl-ica register-tool-run-instance-id or cwl-ica register-workflow-run-instance-id to set up template to submit from.
    Then edit the submission json and run via the following command.
    This can then be launched via the ica workflows versions launch <workflow-id> <workflow-version> input.json

Clone this wiki locally