From a0b0f0ba0c0091772e6490027b1c2cf1d7e0670e Mon Sep 17 00:00:00 2001 From: Michael Franklin Date: Mon, 16 Mar 2020 23:31:49 +1100 Subject: [PATCH 1/3] Update janis installation and test run instructions to 0.9.x --- docs/tutorials/tutorial0.md | 119 +++++++++++++++++++++++------------- 1 file changed, 78 insertions(+), 41 deletions(-) diff --git a/docs/tutorials/tutorial0.md b/docs/tutorials/tutorial0.md index 0e04a50e..480d951e 100644 --- a/docs/tutorials/tutorial0.md +++ b/docs/tutorials/tutorial0.md @@ -1,17 +1,15 @@ # Tutorial 0 - Introduction to Janis -Welcome to the introduction for Janis! This tutorial introduces Janis and installs it on your local computer, ready for building your first workflow. +Janis is workflow framework that uses Python to construct a declarative workflow. It has a simple workflow API within Python that you use to build your workflow. Janis converts your pipeline to the Common Workflow Language (CWL) and Workflow Description Language (WDL) for execution, and it’s also great for publishing and archiving. -Janis is workflow framework that uses Python to construct a declarative workflow. It has a simple workflow API within Python that you use to declare your workflow. Janis can convert your pipeline to the Common Workflow Language (CWL) and Workflow Description Language (WDL) for execution, but it's also great for publishing and archiving. +Janis was designed with a few points in mind: -Janis was designed with a few priorities: - -- Workflows should be easy to build/ -- Workflows and tools must be easily shared (portable). -- Execution must be able to occur on HPCs and cloud environments. +- Workflows should be easy to build, +- Workflows and tools must be easily shared (portable), +- Workflows should be able to execute on HPCs and cloud environments. - Workflows should be reproducible and re-runnable. -Janis uses an *abstracted execution environment*, which removes the shared file system in favour of you specifiying all the files you need up front and passing them around as a File object. This allows the same workflow to be executable on your local machine, HPCs and cloud, and we let the `execution engine` handle moving our files. This also means that we can use file systems like ``S3``, ``GCS``, ``FTP`` and more without any code changes. +Janis uses an *abstracted execution environment*, which removes the shared file system in favour of you specifiying all the files you need up front and passing them around as a File object. This allows the same workflow to be executable on your local machine, HPCs and cloud, and we let the `execution engine` handle moving our files. This also means that we can use file systems like ``S3``, ``GCS``, ``FTP`` and more without any changes to our workflow. > Instructions for setting up Janis on a compute cluster are under construction. @@ -46,14 +44,16 @@ We'll install Janis in a virtual environment as it preserves versioning of Janis ```bash janis -v - # -------------------- ------- - # janis-core v0.8.0 - # janis-assistant v0.8.0 - # janis-unix v0.8.0 - # janis-bioinformatics v0.8.0 - # -------------------- ------- + # -------------------- ------ + # janis-core v0.9.7 + # janis-assistant v0.9.9 + # janis-unix v0.9.0 + # janis-bioinformatics v0.9.5 + # janis-pipelines v0.9.2 + # janis-templates v0.9.4 + # -------------------- ------ ``` - å + ### Installing CWLTool [CWLTool](https://github.com/common-workflow-language/cwltool) is a reference workflow engine for the Common Workflow Language. Janis can run your workflow using CWLTool and collect the results. For more information about which engines Janis supports, visit the [Engine Support](https://janis.readthedocs.io/en/latest/references/engines.html) page. @@ -72,56 +72,93 @@ cwltool --version ## Running an example workflow with Janis +First off, let's create a directory to store our janis workflows. This could be anywhere you want, but for now we'll put it at `$HOME/janis/` + +```bash +mkdir ~/janis +cd ~/janis +``` + You can test run an example workflow with Janis and CWLTool with the following command: ```bash -janis run --engine cwltool --stay-connected hello +janis run --engine cwltool -o tutorial0 hello ``` -Usually Janis starts a separate process to run and manage the workflow. By including the `--stay-connected` parameter, Janis and the engine are connected, so you'll see any errors that occur. When you exit the Janis process, this will also exit the engine. +You'll see the `INFO` statements from CWLTool in terminal. -If this works successfully, you can omit the `--stay-connected` param and you'll be presented with the progress screen as your workflow completes. Some things to note: +> To see all logs, add `-d` to become: +> ```bash +> janis -d run --engine cwltool -o tutorial0 hello +> ``` -- `WID` - the janis identifier of your workflow. -- `Task Dir` - Where your workflow, output files and logs are. -- +At the start, we see the two lines in our output: ``` -WID: df5daa -EngId: df5daa -Name: hello -Engine: cwltool - -Task Dir: $HOME/janis/hello/20191115_105042_df5daa/ -Exec Dir: None +2020-03-16T18:49:08 [INFO]: Starting task with id = 'd909df' +d909df +``` -Status: Completed -Duration: 6s -Start: 2019-11-14T23:50:42.940196+00:00 -Finish: 2019-11-14T23:50:48.453133+00:00 -Updated: Just now (2019-11-14T23:50:53+00:00) +This is our workflow ID (wid) and is one way we can refer to our workflow. -Jobs: - [✓] hello (2s) +After the workflow has completed (or in a different window), you can see the progress of this workflow with: -Outputs: - - out: $HOME/janis/execution/hello/20191115_105042_df5daa/output/out +```bash +janis watch d909df + +# WID: d909df +# EngId: d909df +# Name: hello +# Engine: cwltool +# +# Task Dir: $HOME/janis/tutorial0 +# Exec Dir: None +# +# Status: Completed +# Duration: 4s +# Start: 2020-03-16T07:49:08.367981+00:00 +# Finish: 2020-03-16T07:49:11.881006+00:00 +# Updated: 3h:51m:54s ago (2020-03-16T07:49:11+00:00) +# +# Jobs: +# [✓] hello (1s) +# +# Outputs: +# - out: $HOME/janis/tutorial0/out ``` There is a single output `out` from the workflow, cat-ing this result we get: ```bash -cat $HOME/janis/execution/hello/20191115_105042_df5daa/output/out +cat $HOME/janis/tutorial0/out # Hello, World ``` ### Overriding an input -The workflow `hello` has one input `inp`. We can override this input by passing `--inp $value` onto the end of our run statement, eg: +The workflow `hello` has one input `inp`. We can override this input by passing `--inp $value` onto the end of our run statement. Note the structure for workflow parameters and parameter overriding: + +``` +janis run worklowname +``` + +We can run the following command: +```bash +janis run --engine cwltool -o tutorial0-override hello --inp "Hello, $(whoami)" + +# out: Hello, mfranklin +``` + +### Running Janis in the background + +You may want to run Janis in the background as it's own process. You could do this with `nohup [command] &`, however we can also run Janis with the `--background` flag and capture the workflow ID to watch, eg: ```bash -janis run --engine cwltool hello --inp "Hello, yourname" -# out: Hello, yourname +wid=$(janis run \ + --background --engine cwltool -o tutorial0-background \ + hello \ + --inp "Run in background") +janis watch $wid ``` From 0ff8c8647de8cbceaec11151779f2e472effe477 Mon Sep 17 00:00:00 2001 From: Michael Franklin Date: Mon, 16 Mar 2020 23:32:30 +1100 Subject: [PATCH 2/3] Update tutorial 1 with sample data & better instructions from workshop2 --- docs/tutorials/tutorial1.md | 235 ++++++++++++++++++++++++++++-------- 1 file changed, 183 insertions(+), 52 deletions(-) diff --git a/docs/tutorials/tutorial1.md b/docs/tutorials/tutorial1.md index dd205bbf..a1aa6469 100644 --- a/docs/tutorials/tutorial1.md +++ b/docs/tutorials/tutorial1.md @@ -1,87 +1,176 @@ # Tutorial 1 - Building a Workflow -In this stage, we have an installation of Janis, CWLTool, our data and now we're going to construct our workflow. In [Next Generation Sequencing (NGS)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3841808/), short reads of DNA are sequenced in parallel to speed up sequencing time, one of the consequences of many short reads is the need for [alignment](https://en.wikibooks.org/wiki/Next_Generation_Sequencing_(NGS)/Alignment). +In this stage, we're going to build a simple workflow to align short reads of DNA. -After the NGS reads have been pre-processed, we have a `FASTQ` pair, we align the reads inside our FASTQ file into an uncompressed `SAM` file (the _de facto_ standard for short read alignments) using `BWA MEM`, compress this into the binary equivalent `BAM` file using `samtools`, and finally sort the reads using `GATK4 SortSam`. +1. Start with a pair of compressed `FASTQ` files, +2. Align these reads using `BWA MEM` into an uncompressed `SAM` file (the _de facto_ standard for short read alignments), +3. Compress this into the binary equivalent `BAM` file using `samtools`, and finally +4. Sort the reads using `GATK4 SortSam`. -These tools already exist within the Janis Tool Registry, you can see their documentation right on this website: +These tools already exist within the Janis Tool Registry, you can see their documentation online: - [BWA MEM](https://janis.readthedocs.io/en/latest/tools/bioinformatics/bwa/bwamem.html) - [Samtols View](https://janis.readthedocs.io/en/latest/tools/bioinformatics/samtools/samtoolsview.html) - [GATK4 SortSam](https://janis.readthedocs.io/en/latest/tools/bioinformatics/gatk4/gatk4sortsam.html) +## Preparation + +To prepare for this tutorial, we're going to create a folder and download some data: + +```bash +mkdir janis-tutorials && cd janis-tutorials +wget -q -O- "https://github.com/PMCC-BioinformaticsCore/janis-workshops/raw/master/janis-data.tar" | tar -xz +``` + + ## Creating our file -A Janis workflow file is a regular Python file, so we can start by creating a file called `alignment.py` and importing Janis. +A Janis workflow is a Python script, so we can start by creating a file called `alignment.py` and importing Janis. + +```bash +mkdir tools +vim tools/alignment.py # or vim, emacs, sublime, vscode +``` + +From the `janis_core` library, we're going to import `WorkflowBuilder` and a `String`: ```python -import janis +from janis_core import WorkflowBuilder, String ``` -## Importing our tools and datatypes +## Imports -Python requires that you import the tools and types that you use, these import statements are available on the documentation. We'll have one import per tool, and one import for every data-types we use. +We have four inputs we want to expose on this workflow: -We have three inputs we want to expose on this workflow: +1. Sequencing Reads (`FastqGzPair` - paired end sequence) +2. Sample name (`String`) +3. Read group header (`String`) +4. Reference files (`Fasta` + index files (`FastaWithIndex`)) -1. Sample name (`String`) -2. Sequencing Reads (`FastqGzPair` - paired end sequence) -3. Reference files (`Fasta` + index files) +We've already imported the `String` type, and we can import `FastqGzPair` and `FastaWithIndex` from the `janis_bioinformatics` registry: +```python +from janis_bioinformatics.data_types import FastqGzPair, FastaWithDict +``` + +### Tools + +We've discussed the tools we're going to use. The documentation for each tool has a row in the tbale caled "Python" that gives you the import statement. This is how we'll import how tools: -We can use the `janis.String` datatype (imported with Janis) for the first, and we can import the remaining bioinformatics types: ```python -from janis_bioinformatics.tools.bwa.mem.latest import BwaMemLatest -from janis_bioinformatics.tools.samtools.view.view import SamToolsView_1_9 +from janis_bioinformatics.tools.bwa import BwaMemLatest +from janis_bioinformatics.tools.samtools import SamToolsView_1_9 from janis_bioinformatics.tools.gatk4 import Gatk4SortSam_4_1_2 -from janis_bioinformatics.data_types import FastqGzPair, FastaWithDict ``` -## Declaring our workflow and exposing inputs -We'll create an instance of the [`janis.WorkflowBuilder`](https://janis.readthedocs.io/en/latest/references/workflow.html#janis.Workflow) class, this requires a workflow identifier. We discussed which imports we want in the previous section which we can expose on this workflow with the `janis.Workflow.input` method: + +## Declaring our workflow + +We'll create an instance of the [`WorkflowBuilder`](https://janis.readthedocs.io/en/latest/references/workflow.html#janis.Workflow) class, this just requires a name for your workflow (can contain alphanumeric characters and underscores). ```python -w = janis.WorkflowBuilder("alignmentWorkflow") +w = WorkflowBuilder("alignmentWorkflow") +``` -w.input("readGroup", janis.String) +A workflow has 3 methods for building workflows: + +- `workflow.input` - Used for creating inputs, +- `workflow.step` - Creates a step on a workflow, +- `workflow.output` - Exposes an output on a workflow. + +We give each input / step / output a unique identifier, which then becomes a node in our workflow graph. We can refer to the created node using _dot-notation_ (eg: `w.input_name`). We'll see how this works in the later sections. + +More information about each step will be linked from this page about the [`Workflow` and `WorkflowBuilder` class](https://janis.readthedocs.io/en/latest/references/workflow.html). + + +### Creating inputs on a workflow + +> Further reading: [Creating an input](https://janis.readthedocs.io/en/latest/references/workflow.html#creating-an-input) + +To create an input on a workflow, you can use the `Workflow.input` method, which has the following structure: + +```python +Workflow.input( + identifier: str, + datatype: DataType, + default: any = None, + doc: str = None +) +``` + +An input requires a unique identifier (string) and a DataType (String, FastqGzPair, etc). Let's prepare the inputs for our workflow: + + +```python +w.input("sample_name", String) +w.input("read_group", String) w.input("fastq", FastqGzPair) w.input("reference", FastaWithDict) ``` -## Declaring our steps and connections +### Declaring our steps and connections + +> Further reading: [Creating a step](https://janis.readthedocs.io/en/latest/references/workflow.html#creating-a-step) + +Similar to exposing inputs, we create steps with the `Workflow.step` method. It has the following structure: + +```python +Workflow.step( + identifier: str, + tool: janis_core.tool.tool.Tool, + scatter: Union[str, List[str], ScatterDescription] = None, +) +``` + +We provide a identifier for the step (unique amongst the other nodes in the workflow), and intialise our tool, passing our inputs of the step as parameters. + +We can refer to an input (or previous result) using the dot notation. For example, to refer to the `fastq` input, we can use `w.fastq`. -Steps are easy to create, however you may need to refer to the documentation when writing your own workflows to know which named parameters a tool takes (and their types). Similar to exposing inputs, we create steps with the `janis.Workflow.step` method. +#### BWA MEM -We can refer to any node on the workflow graph (such as an input) by accessing the property of the same name (dot-notation). Eg, to access the `readGroup` on our workflow, we can use `w.readGroup`. +We use [bwa mem's documentation](https://janis.readthedocs.io/en/latest/tools/bioinformatics/bwa/bwamem.html) to determine that we need to provide the following inputs: -We instantiate our tool with the named parameters we want to provide and pass that as the second parameter to the `janis.Workflow.step` method: +- `reads`: `FastqGzPair` (connect to `w.fastq`) +- `readGroupHeaderLine`: `String` (connect to `w.read_group`) +- `reference`: `FastaWithDict` (connect to `w.reference`) -### BWA MEM +We can connect them to the relevant inputs to get the following step definition: ```python w.step( - "bwamem", - BwaMemLatest( - reads=w.fastq, - readGroupHeaderLine=w.readGroup, + "bwamem", # identifier + BwaMemLatest( + reads=w.fastq, + readGroupHeaderLine=w.read_group, reference=w.reference ) ) ``` -### Samtools view +#### Samtools view + +We'll use a very similar pattern for Samtools View, except this time we'll reference the output of `bwamem`. From bwa mem's documentation, there is one output called `out` with type `Sam`. We'll connect this to `SamtoolsView` only input, called `sam`. -When creating the connection between `bwamem` and `samtoolsview`, we'll access the `out` output of `BwaMemLatest`. This will create a dependency of `"bwamem"` for `samtoolsview`. ```python -w.step("samtoolsview", SamToolsView_1_9(sam=w.bwamem.out)) +w.step( + "samtoolsview", + SamToolsView_1_9( + sam=w.bwamem.out + ) +) ``` -### SortSam +#### SortSam + +In addition to connecting the output of `samtoolsview` to Gatk4 SortSam, we want to tell SortSam to use the following values: -SortSam requires a number of values we want to set +- sortOrder: `"coordinate"` +- createIndex: `True` + +Instead of connecting an input or a step, we just just provide the literal value. ```python w.step( @@ -89,16 +178,28 @@ w.step( Gatk4SortSam_4_1_2( bam=w.samtoolsview.out, sortOrder="coordinate", - createIndex=True, - validationStringency="SILENT", - maxRecordsInRam=5000000 + createIndex=True ) ) ``` -## Exposing outputs +### Exposing outputs + +> Further reading: [Creating an output](https://janis.readthedocs.io/en/latest/references/workflow.html#creating-an-output) -Outputs have a very similar syntax to both inputs and steps, they take an `identifier` and a named `source` parameter. We only want to output the resulting bam file from `sortsam`, which we can do with the following line: +Outputs have a very similar syntax to both inputs and steps, they take an `identifier` and a named `source` parameter. Here is the structure: + +```python +Workflow.output( + identifier: str, + datatype: DataType = None, + source: Node = None, + output_folder: List[Union[String, Node]] = None, + output_name: Union[String, Node] = None +) +``` + +Often, we don't want to specify the output data type, because we can let Janis do this for us. We'll talk about the `output_folder` and `output_name` in the next few sections. For now, we just have to specify an output identifier and a source. ```python w.output("out", source=w.sortsam.out) @@ -109,38 +210,44 @@ w.output("out", source=w.sortsam.out) Hopefully you have a workflow that looks like the following! ```python -import janis +from janis_core import WorkflowBuilder, String -from janis_bioinformatics.tools.bwa.mem.latest import BwaMemLatest -from janis_bioinformatics.tools.samtools.view.view import SamToolsView_1_9 -from janis_bioinformatics.tools.gatk4 import Gatk4SortSam_4_1_2 from janis_bioinformatics.data_types import FastqGzPair, FastaWithDict -w = janis.WorkflowBuilder("alignmentWorkflow") +from janis_bioinformatics.tools.bwa import BwaMemLatest +from janis_bioinformatics.tools.samtools import SamToolsView_1_9 +from janis_bioinformatics.tools.gatk4 import Gatk4SortSam_4_1_2 + +w = WorkflowBuilder("alignmentWorkflow") # Inputs -w.input("readGroup", janis.String, value="@RG\\tID:NA12878\\tSM:NA12878\\tLB:NA12878\\tPL:ILLUMINA") -w.input("fastq", FastqGzPair, value="/path/to/reads.fastq") -w.input("reference", FastaWithDict, value="/path/to/reference.fasta") +w.input("sample_name", String) +w.input("read_group", String) +w.input("fastq", FastqGzPair) +w.input("reference", FastaWithDict) # Steps w.step( "bwamem", BwaMemLatest( reads=w.fastq, - readGroupHeaderLine=w.readGroup, + readGroupHeaderLine=w.read_group, reference=w.reference ) ) -w.step("samtoolsview", SamToolsView_1_9(sam=w.bwamem.out)) +w.step( + "samtoolsview", + SamToolsView_1_9( + sam=w.bwamem.out + ) +) + w.step( "sortsam", Gatk4SortSam_4_1_2( bam=w.samtoolsview.out, sortOrder="coordinate", - createIndex=True, - validationStringency="SILENT", - maxRecordsInRam=5000000 + createIndex=True ) ) @@ -151,5 +258,29 @@ w.output("out", source=w.sortsam.out) We can translate the following file into Workflow Description Language using janis from the terminal: ```bash -janis translate alignment.py wdl +janis translate tools/alignment.py wdl +``` + + +## Running the alignment workflow + ``` +janis run -o tutorial1 tools/alignment.py \ + --fastq data/BRCA1_R*.fastq.gz \ + --reference reference/hg38-brca1.fasta \ + --sample_name NA12878 \ + --read_group "@RG\tID:NA12878\tSM:NA12878\tLB:NA12878\tPL:ILLUMINA" +``` + +After the workflow has run, you'll see the outputs in the current directory: + +```bash +ls + +drwxr-xr-x mfranklin 1677682026 160B data +drwxr-xr-x mfranklin 1677682026 256B janis +-rw-r--r-- mfranklin wheel 2.7M out.bam +-rw-r--r-- mfranklin wheel 296B out.bam.bai +drwxr-xr-x mfranklin 1677682026 320B reference +drwxr-xr-x mfranklin 1677682026 128B tools +``` \ No newline at end of file From c739018d0fd7e1e65cd8c2f2377fea3f4fb95082 Mon Sep 17 00:00:00 2001 From: Michael Franklin Date: Mon, 16 Mar 2020 23:32:56 +1100 Subject: [PATCH 3/3] Update tutorial 2 using data from tut1 + improved instructions from W2 --- docs/tutorials/tutorial2.md | 201 +++++++++++++++++------------------- 1 file changed, 94 insertions(+), 107 deletions(-) diff --git a/docs/tutorials/tutorial2.md b/docs/tutorials/tutorial2.md index ab4e63d9..cc4648c5 100644 --- a/docs/tutorials/tutorial2.md +++ b/docs/tutorials/tutorial2.md @@ -1,62 +1,27 @@ # Tutorial 2 - Wrapping a new tool +> This tutorial builds on the content and output from [Tutorial 1](https://janis.readthedocs.io/en/latest/tutorials/tutorial1.html). + ## Introduction A CommandTool is the interface between Janis and a program to be executed. Simply put, a CommandTool has a name, a command, inputs, outputs and a container to run in. Inputs and arguments can have a prefix and / or position, and this is used to construct the command line. The Janis documentation for the [CommandTool](https://janis.readthedocs.io/en/latest/references/commandtool.html) gives an introduction to the tool structure and a template for constructing your own tool. A tool wrapper must provide all of the information to configure and run the specified tool, this includes the `base_command`, [janis.ToolInput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-input), [janis.ToolOutput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-output), a `container` and its version. -## Requirements - -You must have Python 3.6 and Janis installed: - -```bash -pip3 install janis-pipelines -``` - -You can check you have the correct version of Janis installed by running: - -```bash -$ janis -v --------------------- ------ -janis-core v0.7.1 -janis-assistant v0.7.8 -janis-unix v0.7.0 -janis-bioinformatics v0.7.1 --------------------- ------ -``` - -### Setup - -This tutorial is on checked in on GitHub with sample data. You can download this sample data and template files with the following: - -```bash -git clone https://github.com/PMCC-BioinformaticsCore/janis-workshops.git -cd janis-workshops/workshop3 -ls -lGh * # ls with extra options -``` - -You'll see a list of files within this repository: - -- `README.md` - *This file* -- `samtoolsflagstat.py` - The template for this tutorial -- `samtoolsflagstat-final.py` - The final command tool (also at the bottom of this file) -- `data/brca1.bam` - A Bam file that this tool can be run with -- `data/README.md` - Information about the data file - - ### Container > _Further information_: [Containerising your tools](https://janis.readthedocs.io/en/latest/tutorials/container.html) -> Guide on using containers +For portability, Janis requires that you specify an OCI compliant `container` (eg: Docker) for your tool. Often there will already be a container with some searching, however here's a guide on [preparing your tools in containers](https://janis.readthedocs.io/en/latest/tutorials/container.html) to ensure it works across all environments. -For portability, we require that you specify an OCI compliant `container` (eg: Docker) for your tool. Often there will already be a container with some searching, however here's a guide on [preparing your tools in containers](https://janis.readthedocs.io/en/latest/tutorials/container.html) to ensure it works across all environments. +## Preparation +The sample data to test this tool is computed in [Tutorial 1](https://janis.readthedocs.io/en/latest/tutorials/tutorial1.html). You can follow this tutorial, but running the example will require you to have completed and obtained the bam from the first tutorial. ## Samtools flagstat -In this workshop we're going to wrap the `samtools flagstat` tool. +In this tutorial we're going to wrap the `samtools flagstat` tool - flagstat counts the number of alignments for each FLAG type within a bam file. + ### Samtools project links @@ -84,21 +49,20 @@ Hence, we can isolate the following information: ### Command tool template -The following template is the minimum amount of information required to wrap a tool. For more information, see the [CommandToolBuilder documentation](https://janis.readthedocs.io/en/latest/references/commandtool.html). +The following template is the minimum amount of information required to wrap a tool. For more information, see the [CommandToolBuilder documentation](https://janis.readthedocs.io/en/latest/references/commandtool.html#janis.CommandToolBuilder). -> We've removed the optional fields: tool_module, tool_provider, metadata, cpu, memory from the following template. +> We've removed the optional fields: `tool_module`, `tool_provider`, `metadata`, `cpu`, `memory` from the following [template](https://janis.readthedocs.io/en/latest/references/commandtool.html#template). -```python -from typing import List, Optional, Union -import janis as j +We're going to use `Bam` and `TextFile` data types, so let's import them as well. -import janis_core as j +```python +from janis_core import CommandToolBuilder, ToolInput, ToolOutput, Int, Stdout -ToolName = j.CommandToolBuilder( +ToolName = CommandToolBuilder( tool: str="toolname", base_command=["base", "command"], - inputs: List[j.ToolInput]=[], - outputs: List[j.ToolOutput]=[], + inputs=[], # List[ToolInput] + outputs=[], # List[ToolOutput] container="container/name:version", version="version" ) @@ -106,10 +70,11 @@ ToolName = j.CommandToolBuilder( ### Tool information -Let's start by creating a file with this template: +Let's start by creating a file with this template inside a second output directory: ```bash -vim samtoolsflagstat.py +mkdir -p tools +vim tools/samtoolsflagstat.py ``` We can start by filling in the basic information: @@ -122,53 +87,70 @@ We can start by filling in the basic information: - `version` to be `"v1.9.0"` You'll have a class definition like the following + ```python -SamtoolsFlagstat = j.CommandToolBuilder( +SamtoolsFlagstat = CommandToolBuilder( tool: str="samtoolsflagstat", base_command=["samtools", "flagstat"], container="quay.io/biocontainers/samtools:1.9--h8571acd_11", version="1.9.0", - inputs: List[j.ToolInput]=[], - outputs: List[j.ToolOutput]=[], + inputs=[], # List[ToolInput] + outputs=[], # List[ToolOutput] ) ``` ### Inputs +> Further reading: [`ToolInput`](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-input) -We'll use the [ToolInput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-input) class to represent these inputs. A `ToolInput` provides a mechanism for binding this input onto the command line (eg: prefix, position, transformations). See the documentation for more ways to configure a ToolInput. - -Our positional input is a Bam, so we'll import the Bam type from `janis` with the following line: +We'll use the [ToolInput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-input) class to represent these inputs. A `ToolInput` provides a mechanism for binding this input onto the command line (eg: prefix, position, transformations). See the documentation for more ways to [configure a ToolInput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-input). ```python -from janis.data_types import Bam +janis.ToolInput( + tag: str, + input_type: DataType, + position: Optional[int] = None, + prefix: Optional[str] = None, + # more configuration options + separate_value_from_prefix: bool = None, + prefix_applies_to_all_elements: bool = None, + presents_as: str = None, + secondaries_present_as: Dict[str, str] = None, + separator: str = None, + shell_quote: bool = None, + localise_file: bool = None, + default: Any = None, + doc: Optional[str] = None +) ``` -Then we can declare our two inputs: +> Nb: A ToolInput must have a `position` OR `prefix` in order to be bound onto the command line. If the prefix is specified with no position, a `position=0` is automatically applied. -1. Positional bam input +Now we can declare our two inputs: + +1. Positional bam input 2. Threads configuration input with the prefix `--threads` We're going to give our inputs a name through which we can reference them by. This allows us to specify a value from the command line, or connect the result of a previous step [within a workflow](https://janis.readthedocs.io/en/latest/tutorials/tutorial1.html#bwa-mem). ```python -SamtoolsFlagstat = j.CommandToolBuilder( - # tool information +SamtoolsFlagstat = CommandToolBuilder( + # ... tool information inputs=[ # 1. Positional bam input - j.ToolInput( + ToolInput( "bam", # name of our input Bam, position=1, doc="Input bam to generate statistics for" ), # 2. `threads` inputs - j.ToolInput( + ToolInput( "threads", # name of our input - j.Int(optional=True), + Int(optional=True), prefix="--threads", - doc="(-@) Number of additional threads to use [0] " + doc="(-@) Number of additional threads to use [0]" ) ], # outputs @@ -177,41 +159,55 @@ SamtoolsFlagstat = j.CommandToolBuilder( ### Outputs +> Further reading: [`ToolOutput`](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-output) + We'll use the [ToolOutput](https://janis.readthedocs.io/en/latest/references/commandtool.html#tool-output) class to collect and represent these outputs. A `ToolOutput` has a type, and if not using `stdout` we can provide a `glob` parameter. -The only output of `samtools flagstat` is the statistics that are written to `stdout`. We give this the name `"stats"`, and collect this with the `j.Stdout` data type: +```python +janis.ToolOutput( + tag: str, + output_type: DataType, + glob: Union[janis_core.types.selectors.Selector, str, None] = None, + # more configuration options + presents_as: str = None, + secondaries_present_as: Dict[str, str] = None, + doc: Optional[str] = None +) +``` + +The only output of `samtools flagstat` is the statistics that are written to `stdout`. We give this the name `"stats"`, and collect this with the `Stdout` data type. We can additionally tell Janis that the Stdout has type [`TextFile`](https://janis.readthedocs.io/en/latest/datatypes/textfile.html). ```python -SamtoolsFlagstat = j.CommandToolBuilder( - # tool information + inputs +SamtoolsFlagstat = CommandToolBuilder( + # ... tool information + inputs outputs=[ - j.ToolOutput("stats", j.Stdout) + ToolOutput("stats", Stdout(TextFile)) ] ) ``` - ### Tool definition Putting this all together, you should have the following tool definition: ```python -from typing import List, Optional, Union -import janis as j -from janis.data_types import Bam +from janis_core import CommandToolBuilder, ToolInput, ToolOutput, Int, Stdout + +from janis_unix.data_types import TextFile +from janis_bioinformatics.data_types import Bam -SamToolsFlagstat_1_9 = j.CommandToolBuilder( +SamtoolsFlagstat = CommandToolBuilder( tool="samtoolsflagstat", base_command=["samtools", "flagstat"], container="quay.io/biocontainers/samtools:1.9--h8571acd_11", version="v1.9.0", inputs=[ # 1. Positional bam input - j.ToolInput("bam", Bam, position=1), + ToolInput("bam", Bam, position=1), # 2. `threads` inputs - j.ToolInput("threads", j.Int(optional=True), prefix="--threads"), + ToolInput("threads", Int(optional=True), prefix="--threads"), ], - outputs=[j.ToolOutput("stats", j.Stdout)], + outputs=[ToolOutput("stats", Stdout(TextFile))], ) ``` @@ -222,11 +218,12 @@ We can test the translation of this from the CLI: > If you have multiple command tools or workflows declared in the same file, you will need to provide the `--name` parameter with the name of your tool. ```bash -janis translate samtoolsflagstat.py wdl # or cwl +janis translate tools/samtoolsflagstat.py wdl # or cwl ``` In the following translation, we can see the WDL representation of our tool. In particular, the `command` block gives us an indication of how the command line might look: -``` + +```wdl task samtoolsflagstat { input { Int? runtime_cpu @@ -234,19 +231,19 @@ task samtoolsflagstat { File bam Int? threads } - command { + command <<< samtools flagstat \ - ${"--threads " + threads} \ - ${bam} - } + ~{"--threads " + threads} \ + ~{bam} + >>> runtime { docker: "quay.io/biocontainers/samtools:1.9--h8571acd_11" cpu: if defined(runtime_cpu) then runtime_cpu else 1 - memory: if defined(runtime_memory) then "${runtime_memory}G" else "4G" + memory: if defined(runtime_memory) then "~{runtime_memory}G" else "4G" preemptible: 2 } output { - File out = stdout() + File stats = stdout() } } ``` @@ -255,10 +252,12 @@ task samtoolsflagstat { ### Running the workflow -We can call the `janis run` functionality (default CWLTool), and provide the data file to the input called `bam` with the following line: +> A reminder that the sample data for this section requires you to have completed Tutorial 1. + +We can call the `janis run` functionality, and use the output from tutorial1: ```bash -janis run samtoolsflagstat.py --bam data/brca1.bam +janis run -o tutorial2 tools/samtoolsflagstat.py --bam tutorial1/out.bam ``` OUTPUT: @@ -268,8 +267,8 @@ EngId: f9e89f Name: samtoolsflagstatWf Engine: cwltool -Task Dir: $HOME/janis/execution/samtoolsflagstatWf/20191114_155159_f9e89f/ -Exec Dir: None +Task Dir: $HOME/janis-tutorials/tutorial2/ +Exec Dir: $HOME/janis-tutorials/tutorial2/janis/execution/ Status: Completed Duration: 4s @@ -281,15 +280,13 @@ Jobs: [✓] samtoolsflagstat (N/A) Outputs: - - out: $HOME/janis/execution/samtoolsflagstatWf/20191114_155159_f9e89f/output/out -2019-11-14T15:52:05 [INFO]: Exiting - + - stats: $HOME/janis-tutorials/tutorial2/stats.txt ``` Janis (and CWLTool) said the tool executed correctly, let's check the output file: ```bash -cat $HOME/janis/execution/samtoolsflagstatWf/20191114_155159_f9e89f/output/out +cat tutorial2/stats.txt ``` ``` @@ -306,14 +303,4 @@ cat $HOME/janis/execution/samtoolsflagstatWf/20191114_155159_f9e89f/output/out 90 + 0 singletons (0.46% : N/A) 860 + 0 with mate mapped to a different chr 691 + 0 with mate mapped to a different chr (mapQ>=5) -``` - -## Summary - -- Learn about the structure of a CommandTool, -- Use an existing docker container, -- Wrapped the inputs, outputs and tool information in a Janis CommandTool wrapper, - -### Next steps - -- [Containerising a tool](https://janis.readthedocs.io/en/latest/tutorials/container.html) +``` \ No newline at end of file