Last time we set up our own instance in the de.NBI cloud. We created volumes and mounted them to get an option to store our data in a persistent fashion. The Openstack CLI was installed and used to get access to the cloud object storage, another means to store data in the cloud. At the end we used the bibigrid tool to set up our own compute cluster. This is where we want to continue today and use the newly started cluster to test different workflow management systems.
The first candidate we will test is called Nextflow. If you have questions or are stuck during the exercises, just look at the documentation. The documentation is well written and can solve many questions quickly.
At first we will prepare a workplace we can use. Since we will work and share data between all instances of our cluster, we need to use the only
directory which is accessible from all instances: /vol/spool
cd /vol/spool
mkdir -p cloud_computing/nextflow
cd cloud_computing/nextflow
With this we created a directory cloud_computing
which will be used to gather all our experiments. Simultaneously a directory nextflow
was created and we jumped into it. Now we just need to install Nextflow locally and we are good to go.
curl -s https://get.nextflow.io | bash
If everything went well you should see a prompt with the current version number and some additional informations. Here an example. Your version number should be equal or higher.
N E X T F L O W
version 21.04.1 build 5556
created 14-05-2021 15:20 UTC
cite doi:10.1038/nbt.3820
http://nextflow.io
Nextflow installation completed. Please note:
- the executable file `nextflow` has been created in the folder: /raid1/benedikt/cloud_computing/nextflow
- you may complete the installation by moving it to a directory in your $PATH
Lets test our new installation. What could be better than a good old "hello world!" ?
./nextflow run hello
Shorter than you expected? Yes, because we are a lazy bunch and cheated a little bit. Without a "hello" script in the same directory, Nextflow looked for a corresponding script on its github account. There it found something, cloned it using the integrated git module and started it locally. What you see are the results for running this script:
#!/usr/bin/env nextflow
cheers = Channel.from 'Bonjour', 'Ciao', 'Hello', 'Hola'
process sayHello {
echo true
input:
val x from cheers
script:
"""
echo '$x world!'
"""
}
It should look like this:
N E X T F L O W ~ version 21.04.1
Launching `nextflow-io/hello` [boring_wilson] - revision: e6d9427e5b [master]
executor > local (4)
[a5/b5f3bb] process > sayHello (1) [100%] 4 of 4 ✔
Hola world!
Ciao world!
Hello world!
Bonjour world!
Great! But as you can see this script was executed locally. We have multiple jobs and a whole cluster we could execute them on. Can't we share our jobs between all our instances using the workload manager "Slurm" we installed with our cluster? Sure, we just have to tell Nextflow to use Slurm instead of a local code execution.
echo 'process.executor = "slurm"' > nextflow.config
With this we created the nextflow.config
file. Nextflow will look for it during each run and use it to configure each run.
In this case we just told Nextflow to use Slurm as its primary executor.
But a simple "hello world!" is just to boring. Let's do something more useful. We will take a quick look at BLAST.
The "Basic Local Alignment Search Tool" or BLAST is a bioinformatics tool to find regions of similarity between biological sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance.
We will later download a database containing different antibiotic resistance genes. If an organism contains one or multiple of these genes it is generally concerning. In case of an infection with this organism, the choice of potential treatment options could be severely limited due to these resistances.
Another file with organisms will be downloaded. We will use BLAST to tell us which organisms contain antibiotic resistance genes from our database and need to be looked at more closely.
But first things first. We need BLAST. We could install it locally on every instance in our cluster. But this would be a little bit tedious. We will use Nextflows build in support for the Anaconda package manager. You need to define which packages are needed for every single process, Nextflow will do the rest. If a package is missing on one of the instances Nextflow will fetch and install it for you. Just try it with BLAST. Open your favorite editor in your shell and create a blast.nf
file with this content:
#!/usr/bin/env nextflow
process blast {
conda 'bioconda::blast'
echo true
"""
blastp -h
"""
}
On running this script Nextflow will use Anaconda and install BLAST. Then BLAST will be called and prompt its help page.
Perfect, since BLAST is up and running the only missing parts are the database and input organisms. We will quickly solve this.
cd /vol/spool/cloud_computing
mkdir input database
wget -O input/clostridium_botulinum.fasta https://raw.githubusercontent.com/bosterholz/MeRaGENE/dev/data/test_data/genome/clostridium_botulinum.fasta
cd database
git clone https://github.com/bosterholz/MeRaGENE.git
cd MeRaGENE
git checkout dev
cd ..
mv MeRaGENE/data/databases/resFinderDB_23082018/ .
rm -rf MeRaGENE/
rm resFinderDB_23082018/*_NA*
cd /vol/spool/cloud_computing/nextflow
You can now build your own BLAST pipeline using Nextflow. Let me give you some tips, so that you can get a head start.
Here are some important Nextflow code snippets you will need:
How do I use files or a whole folder as input?
We have multiple database files, how do we use EACH of them?
How do I get my output in a file?
Here a blast call template:
blastx -db YOUR-DATABASE-FSA-FILE -query YOUR-INPUT-FASTA -out NAME-OF-YOUR-OUTPUT
Whith this you should be good to go! Good luck!
After we got a quick glimpse on how to handle Nextflow, we will test our second candidate today: Snakemake
Let's set it up:
cd /vol/spool/cloud_computing
mkdir snakemake
cd snakemake/
wget https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-Linux-x86_64.sh
bash Mambaforge-Linux-x86_64.sh
You will see a dialogue during the installation of Snakemake. Press ENTER, answer the license agreement with YES.
Then change the installation path to: /vol/spool/cloud_computing/snakemake/mamba
. We don't want to initialize conda so we will type NO.
Now we need the tutorial data, we download and extract everything with:
wget https://github.com/snakemake/snakemake-tutorial-data/archive/v5.24.1.tar.gz
tar --wildcards -xf v5.24.1.tar.gz --strip 1 "*/data" "*/environment.yaml"
After a quick introduction into Snakemake we will try to replicate our BLAST example. To make things easier we add BLAST to our downloaded environment.yml
. These are packages which will be installed at the start of a new initialization. So we will start with an installed BLAST and don't need to add it later on.
Again, open the environment.yml
with your favorite editor and add - blast =2.11
at the beginning of your package list.
Now we initialize our environment:
source /vol/spool/cloud_computing/snakemake/mamba/bin/activate base
mamba env create --name snakemake --file environment.yaml
source /vol/spool/cloud_computing/snakemake/mamba/bin/activate snakemake
You should now be able to use the Snakemake
command in your bash. It would be a little bit boring to start with hello world
for the second time, so we will look at an easy but real bioinformatic tool call. Create a file called snakefile
and fill it with:
rule bwa_map:
input:
"data/genome.fa",
"data/samples/A.fastq"
output:
"mapped_reads/A.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
All the data we need we downloaded in the previous steps. We can immediately start calling Snakemake. In Snakemake we start with a so called dry run. Snakemake will look if everything is in order and print every paths it will create or tell us of problems it encountered.
snakemake -np mapped_reads/A.bam
Everything should be okay. We start the real run with:
snakemake --cores 1 mapped_reads/A.bam
This was a real static run. Is there a way to do it in a more dynamic fashion? Yes, of course. Just change your snakefile
like you see below.
rule bwa_map:
input:
"data/genome.fa",
"data/samples/{sample}.fastq"
output:
"mapped_reads/{sample}.bam"
shell:
"bwa mem {input} | samtools view -Sb - > {output}"
Now try these dry runs:
snakemake -np mapped_reads/B.bam
snakemake -np mapped_reads/A.bam mapped_reads/B.bam
snakemake -np mapped_reads/{A,B}.bam
Do you see what happens? Try running these for real. Are there Problems?
You should now know enough to replicate our Nextflow BLAST call with Snakemake. Try it!
CWL is not a software, but rather a standard for describing data analysis workflows. A range of different workflow engines can be used to execute CWL-workflows. We call these workflow engines “cwl-runners”, but some of them are compatible to other workflow languages as well.
CWL is used to describe two types of document: CommandLineTool and Workflow.
A CommandLineTool describes a single tool that can be invoked from the command line, like echo
or blast
. Based on this description, a cwl-runner is able to execute the tool.
A Workflow is used to link multiple CommandLineTools (or other workflows), describing the flow of data between them.
A cwl-runner also needs to know which input to feed into a tool or workflow. This is done with a job file, where inputs are described in the YAML-format.
For the purpose of executing our CWL code, we will be using the Toil workflow engine. To install Toil (and blast) in a virtual environment, we will use a conda. Toil has several optional features, and the ability to interpret CWL code is one of them. So we have to specify that we want to install the cwl-feature when wie install Toil.
cd /vol/spool/cloud_computing
mkdir miniconda
cd miniconda
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
Accept the license (scroll through the text with ENTER, then type yes
) and specify the installation path as /vol/spool/cloud_computing/miniconda/inst
. Type no
when asked about conda init. We will now activate our conda virtual environment, create another virtual environment for our work with CWL and install toil and blast.
source inst/bin/activate
conda create --name cwl
conda activate cwl
conda install pip
pip install toil[cwl] boto boto3
conda install -c bioconda blast
We need a piece of CWL code for toil to execute. Once again, we will work in /vol/spool
so data will be available across all nodes.
cd /vol/spool
mkdir -p cloud_computing/cwl
cd cloud_computing/cwl
Paste the following into a text file named exampleTool.cwl
:
#!/usr/bin/env cwl-runner
cwlVersion: v1.2 #Toil can handle different versions of the CWL-specification.
class: CommandLineTool
baseCommand: echo
requirements:
InlineJavascriptRequirement: {}
inputs:
message: #Name of our first (and only) input
type: string #The type of the input
inputBinding:
position: 1 #Determines the order of inputs on the command line
outputs: #Here we tell the CommandLineTool which outputs to collect
answer: #The internal name of our first (and only) output file
type: stdout #Echo writes to stdout, so we have to capture standard output
stdout: $(inputs.message).txt #The standard output we have captured will be written to a file.
#(in this case, the input string with the suffix ".txt")
Paste the following into a text file named exampleJob.yml
:
message: "Hello"
message2: "Hallo"
message3: "Bonjour"
message4: "Hola"
Now execute the CWL code on the local machine using Toil:
mkdir myWorkdir
toil-cwl-runner --workDir myWorkdir --outdir myOutputs --clean always exampleTool.cwl exampleJob.yml
Toil will run the code in example.cwl, using the input specified in exampleJob.yml
and save the results to myOutputs. We specified four different input strings in our exampleJob.yml
, but exampleTool.cwl
will only use the input specified behind message
. Your output should turn up in the myOutputs
directory.
Lets save a multi-step CWL-workflow into a file exampleWorkflow.cwl
so Toil has multiple jobs to run.
#!/usr/bin/env cwl-runner
cwlVersion: v1.2
class: Workflow
inputs: #The different inputs to our workflow, each has a name and a type (string)
message: string
message2: string
message3: string
message4: string
steps: #A list of steps the workflow has to carry out. Toil will decide the order of steps.
print1: #The internal name of the first step
run: exampleTool.cwl #The CommandLineTool that we will run in the first step
in: #The input that the workflow will feed into the first step
message: message
out: [answer]
print2:
run: exampleTool.cwl
in:
message: message2 #Our CommandLineTool expects an input for the "message" parameter. We provide the "message2" string from our workflow inputs
out: [answer]
print3:
run: exampleTool.cwl
in:
message: message3
out: [answer] #A list of outputs that the workflow will collect from the CommandLineTool. We specified the name "answer" in the CommandLineTool above
print4:
run: exampleTool.cwl
in:
message: message4
out: [answer]
outputs: #The outputs our workflow will create
text1: #The internal name of the first of four outputs
type: File #The type of the output. We know that our CommandLineTool will create a file
outputSource: print1/answer #Where the fetch the output (in this case from the print1 step). Answer is the internal name we have given to the output earlier
text2:
type: File
outputSource: print2/answer
text3:
type: File
outputSource: print3/answer
text4:
type: File
outputSource: print4/answer
Now submit the job to your slurm cluster like this:
toil-cwl-runner --batchSystem slurm --disableCaching --jobStore myJobstore --workDir myWorkdir --outdir myOutputs --clean always exampleWorkflow.cwl exampleJob.yml
Again, all your outputs should show up in myOutputs
Like before we want to use blast to look for antibiotic resistances in the Clostridium botulinum genome. The genome will be our blast query, the individual antibiotic genes are the databases we want to search. Doing this on the command line for a single resistance gene might look like this:
blastx -query /vol/spool/cloud_computing/input/clostridium_botulinum.fasta -db /vol/spool/cloud_computing/resFinderDB_23082018/tetracycline_NA.fsa
Try to create a CWL CommandLineTool that can be used for this kind of blast search. It should be able to process the following job file:
target_fasta:
class: File
path: /vol/spool/cloud_computing/input/clostridium_botulinum.fasta
query_fasta:
class: File
path: /vol/spool/cloud_computing/resFinderDB_23082018/tetracycline_AA.fsa
For this task, it is important to know that the blast database for the tetracycline sequence actually consists of four files:
- tetracycline_NA.fsa
- tetracycline_NA.fsa.phr
- tetracycline_NA.fsa.pin
- tetracycline_NA.fsa.psq
Our CWL runner needs to know that files 2, 3 and 4 contain information pertaining to our primary file and blast expects to find them in the same directory. For this reason, we need to specify the secondaryFiles
property for our input parameter.
someInputParameter:
type: File
inputBinding:
position: 5
prefix: -somePrefix
secondaryFiles
- .phr
- .pin
- .psq
This tells the CWL runner to look for Files that have the same basename as the input parameter, but end in .phr, .pin or .psq.
The CWL User Guide provides an excellent introduction into writing CWL and you may also use it to look up individual topics.
When you have written your CommandLineTool, try using toil to run it with the job file that was specified in this section.
Using docker images to run our tools in containers solves a lot of problems for us. The first advantage is that we do not have to install software (like blast) on every node that processes our jobs. It also makes our workflows more robust and analyses will be easier to reproduce, since the software will always be carried out in an environment that is exactly identical. You can add the following code to your CommandLineTool to use docker.
hints:
DockerRequirement:
dockerPull: ncbi/blast:2.11.0
Now, when you run this tool, the CWL-runner will test if docker is available on the system. If it is, the runner will download a docker image that can run version 2.11.0 of the blast software.
Look at the examples for writing workflows and scattering workflows from the CWL User Guide. Can you build a workflow that uses your blast CommandLineTool to blast against multiple antibiotic resistance sequences? Use the following job file:
query_fasta:
class: File
path: /mnt/huge/cloud-kurs/input/clostridium_botulinum.fasta
database_array:
- {class: File, path: /vol/spool/cloud_computing/resFinderDB_23082018/aminoglycoside_AA.fsa }
- {class: File, path: /vol/spool/cloud_computing/resFinderDB_23082018/beta-lactam_AA.fsa }
- {class: File, path: /vol/spool/cloud_computing/cloud-kurs/resFinderDB_23082018/colistin_AA.fsa }