Nextflow is a framework that simplifies writing complex parallel computational pipelines in a portable and reproducible manner. Parallelization is automatically managed by the framework and it is implicitly defined by the processes input and output declarations.
Nextflow is a command line tool. It only requires a Unix-like operating system and a Java 7/8 available in the running environment.
It can be installed with the following command:
curl -fsSL get.nextflow.io | bash
mv nextflow ~/bin
To check the installed version use the command below:
nextflow info
Version: 0.17.3 build 3495 Modified: 18-02-2016 11:00 UTC (12:00 CEST) System: Mac OS X 10.10.5 Runtime: Groovy 2.4.5 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_40-b27 Encoding: UTF-8 (UTF-8)
Write a basic script to count the number of transcripts and exons in a gene annotation file.
#!/usr/bin/env nextflow
echo true
annotation = Channel.fromPath('/users/ngs00/refs/mm65.long.ok.gtf')
process count {
input:
file 'annotation.gtf' from annotation
'''
grep Sec23a annotation.gtf > Sec23a.gff
awk '$3=="transcript"' Sec23a.gff | wc -l
awk '$3=="exon"' Sec23a.gff | wc -l
'''
}
This examples introduces the basic syntax of a Nextflow process and shows how to execute any existing piece of code or script available in the hosting environment.
Save the code showed above in a file named tutorial1.nf
, then run it using the following
command:
nextflow run tutorial1.nf
Parameters allow the same script to be used specifying different input values.
By convention Nextflow parameters are defined at the top of the script file providing a default value for each of them.
#!/usr/bin/env nextflow
echo true
params.gene = 'Sec23a'
params.annot = '/users/ngs00/refs/mm65.long.ok.gtf'
annotation = Channel.fromPath(params.annot)
process count {
input:
file 'annotation.gtf' from annotation
shell:
'''
grep !{params.gene} annotation.gtf > gff
awk '$3=="transcript"' gff | wc -l
awk '$3=="exon"' gff | wc -l
'''
}
The actual parameter value can be provided on the launch command line as an extra option
prefixing the parameter name with --
and then specifying a value e.g. --gene Sec1
.
Save the above code in a file named tutorial2.nf
, then execute it providing a different
gene name or an annotation file as shown below:
nextflow run tutorial2.nf --gene Sec1
:
nextflow run tutorial2.nf --annot /users/ngs00/refs/gencode.vM7.annotation.gtf --gene Sec23b
Nextflow processes are implicitly executed in a parallel manner whenever multiple inputs are provided. The following example executes a parallel task for each pair of gene name and annotation file specified.
#!/usr/bin/env nextflow
echo true
params.gene = 'Sec23a'
params.annot = '/users/ngs00/refs/mm65.long.ok.gtf'
annotation = Channel.fromPath(params.annot)
genes = params.gene.tokenize(', ')
process count {
input:
file 'annotation.gtf' from annotation
each gene from genes
shell:
'''
echo !{gene}
grep !{gene} annotation.gtf > gff
awk '$3=="transcript"' gff | wc -l
awk '$3=="exon"' gff | wc -l
'''
}
Save the above content to a file named tutorial3.nf
, then run it proving more than an
annotation file as shown below:
nextflow run tutorial3.nf --annot '/users/ngs00/refs/*.gtf'
:
nextflow run tutorial3.nf --gene Sec23a,Tulp1,Trex1
:
Each of file matching the specified glob pattern will trigger the execution
of a parallel count
process. The same happens when specifying multiple gene names.
Note
|
The each clause results in the execution of a parallel task for all the combinations
of annotation files with the specified gene names.
|
This example shows how collect the output produced by multiple parallel processes into a file and prints the resulting content.
#!/usr/bin/env nextflow
params.gene = 'Sec23a'
params.annot = '/users/ngs00/refs/mm65.long.ok.gtf'
annotation = Channel.fromPath(params.annot)
genes = params.gene.tokenize(', ')
process count {
input:
each gene from genes
file annot from annotation
output:
stdout into result
shell:
'''
echo !{annot.baseName}
echo !{gene}
grep !{gene} !{annot} > gff
awk '$3=="transcript"' gff | wc -l
awk '$3=="exon"' gff | wc -l
'''
}
result
.map { str -> str.readLines().join(',') } // (1)
.collectFile(newLine: true) // (2)
.println { it.text } // (3)
-
The
map
operator transforms the multi-lines output into a comma-separated string. -
The
collectFile
operator gathers the produced strings and append them into a file. -
The
println
operator prints the file content.
Save the above script to a file named tutorial4.nf
, then run it by using the
following command in your shell terminal:
nextflow run tutorial4.nf --annot '/users/ngs00/refs/*.gtf' --gene Sec23a,Tulp1
It will output a text similar to the one below following:
gencode.vM9.annotation,Tulp1,12,71 gencode.vM7.annotation,Sec23a,5,47 mm65.long.ok,Sec23a,5,47 gencode.v22.annotation,Sec23a,0,0 gencode.v22.annotation,Tulp1,0,0 gencode.vM7.annotation,Tulp1,12,71 mm65.long.ok,Tulp1,12,71 dmel-all-no-analysis-r6.05,Tulp1,0,0 dmel-all-no-analysis-r6.05,Sec23a,0,0 gencode.vM9.annotation,Sec23a,5,47
When a pipeline runs many computing intensive tasks a batch scheduler is required to submit the job executions to a cluster of computers.
Nextflow manages the execution with the batch scheduler in a transparent manner without any change in the pipeline code. It only requires a few settings in the pipeline configuration file:
process {
executor = 'sge'
queue = 'NGS'
memory = '1 GB'
}
Save the content showed above in a file named nextflow.config
, then launch
the script execution as before:
nextflow run tutorial4.nf -bg --annot '/users/ngs00/refs/*.gtf' --gene Sec23a,Tulp1,Trex > log
You can check tasks are submitted to the cluster using the following command:
qstat
The following platforms are currently supported:
-
Sun/Open Grid Engine
-
Univa Grid Engine
-
Linux SLURM
-
IBM LSF
-
Torque/PBS
Launch the script execution as shown below:
nextflow run tutorial4.nf --annot '/users/ngs00/refs/*.gtf' --gene Sec23a -resume
The -resume
command line option will force to skip the execution of tasks that have
been already executed successfully.
It will print an output similar to the following:
N E X T F L O W ~ version 0.17.3 Launching tutorial4.nf [warm up] executor > sge [85/145369] Cached process > count (2) [35/054b18] Cached process > count (1) [4a/36a5d1] Cached process > count (3) gencode.vM7.annotation,Sec23a,5,47 mm65.long.ok,Sec23a,5,47
If you add other genes by using the --gene
option only the tasks required by the new input
will be executed. For example:
nextflow run tutorial4.nf --annot '/users/ngs00/refs/*.gtf' --gene Sec23a,Tulp2,Trex2 -resume
Only the tasks for which a new input is specified will be executed.
Nextflow provides built-in support for the Git tool and the GitHub source code management plaftorm. This makes it possible to share and deploy a pipeline hosted in the GitHub platform with ease.
For the sake of this tutorial we will use RNA-Toy, a proof of concept RNA-Seq pipeline implemented with Nextflow.
The pipeline uses the following tools:
In order to setup the required pipeline dependencies we will use Environment Modules.
Add the following setting in your nextflow.config
file:
process { module = 'Boost/1.55.0-goolf-1.4.10-no-OFED-Python-2.7.6:Python/2.7.8-goolf-1.4.10-no-OFED:Bowtie2/2.2.8-goolf-1.4.10-no-OFED:TopHat/2.1.0-goolf-1.4.10-no-OFED:Cufflinks/2.2.1-goolf-1.4.10-no-OFED' }
Launch the pipeline execution using the following command:
nextflow run nextflow-io/rnatoy
The above command will automatically pull the pipeline from the GitHub repository and run it using a dataset included in the pipeline itself.
Users can provide its own dataset specifying it on the launch command line.
In the previous example we still needed to configure one by one the tools used by the pipeline by loading the corresponding environment modules.
Nextflow provides built-in support for Docker containers that allows binary dependencies to be deployed automatically when the pipeline is executed.
Having the Docker engine installed, the previous example can be executed simply using the following command:
nextflow run nextflow-io/rnatoy -with-docker
The -with-docker
command line option instructs Nextflow to pull the Docker image defined
in the pipeline execution file. The image contains all the binary dependencies required to
run the pipeline script (i.e. Bowtie, TopHat, Cufflinks, etc.) thus it is not need
to configure them manually.
This greatly simplify the pipeline deployment process and guarantee consistent results over time and across different platforms.