From 66407808058fdbbdd1b39b1b30cba5bb5d7c7802 Mon Sep 17 00:00:00 2001 From: Richard Lupat Date: Wed, 29 May 2024 11:04:35 +1000 Subject: [PATCH] Built site for gh-pages --- .nojekyll | 2 +- index.html | 2 +- search.json | 6 +++--- sessions/2_nf_dev_intro.html | 19 +++++++++++++++++-- sitemap.xml | 20 ++++++++++---------- workshops/3.1_creating_a_workflow.html | 2 +- 6 files changed, 33 insertions(+), 18 deletions(-) diff --git a/.nojekyll b/.nojekyll index 997d65c..5f37abd 100644 --- a/.nojekyll +++ b/.nojekyll @@ -1 +1 @@ -f2d36f16 \ No newline at end of file +9e9c49e2 \ No newline at end of file diff --git a/index.html b/index.html index 1380e02..d2c59ec 100644 --- a/index.html +++ b/index.html @@ -200,7 +200,7 @@

Curre -Creating a workflow +Basic to Create a Nextflow Workflow Introduction to developing bioinformatics workflow with Nextflow (Part1) 29th May 2024 diff --git a/search.json b/search.json index 5d6662e..3251abb 100644 --- a/search.json +++ b/search.json @@ -4,14 +4,14 @@ "href": "sessions/2_nf_dev_intro.html", "title": "Developing bioinformatics workflows with Nextflow", "section": "", - "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\nSong Li, Bioinformatics Core Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nCreating a workflow\nIntroduction to Nextflow: Introduce nextflow’s core features and concepts; including CLI and how to run it on Peter Mac cluster\n29th May 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." + "text": "This workshop is designed to provide participants with a fundamental understanding of developing bioinformatics pipelines using Nextflow. This workshop aims to provide participants with the necessary skills required to create a Nextflow pipeline from scratch or from the nf-core template.\n\nCourse Presenters\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\n\n\n\nCourse Helpers\n\nSanduni Rajapaksa, Research Computing Facility\nSong Li, Bioinformatics Core Facility\n\n\n\nPrerequisites\n\nExperience with command line interface and cluster/slurm\nFamiliarity with the basic concept of workflows\nAccess to Peter Mac Cluster\nAttendance in the ‘Introduction to Nextflow and Running nf-core Workflows’ workshop, or an understanding of the Nextflow concepts outlined in the workshop material\n\n\n\nLearning Objectives:\nBy the end of this workshop, participants should be able to:\n\nDevelop a basic Nextflow workflow consisting of processes that use multiple scripting languages\nGain an understanding of Groovy and Nextflow syntax\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\nRe-use and import processes, modules, and sub-workflows into a Nextflow workflow\nTest and set up profiles for a Nextflow workflow\nCreate conditional processes and conditional scripts within a process\nGain an understanding of Nextflow channel operators\nDevelop a basic Nextflow workflow with nf-core templates\nTroubleshoot known errors in workflow development\n\n\n\nSet up requirements\nPlease complete the Setup Instructions before the course.\nIf you have any trouble, please get in contact with us ASAP via Slack/Teams.\n\n\nWorkshop schedule\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nSetup\nFollow these instructions to install VS Code and setup your workspace\nPrior to workshop\n\n\nSession kick off\nSession kick off: Discuss learning outcomes and finalising workspace setup\nEvery week\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to nextflow channels, processes, data types and workflows\n29th May 2024\n\n\nDeveloping Reusable Workflows\nIntroduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions\n5th Jun 2024\n\n\nWorking with nf-core Templates\nIntroduction to developing nextflow workflow with nf-core templates\n12th Jun 2024\n\n\nWorking with Nextflow Built-in Functions\nIntroduction to nextflow operators, metadata propagation, grouping, and splitting\n19th Jun 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core." }, { "objectID": "index.html", "href": "index.html", "title": "Peter Mac Internal Nextflow Workshops", "section": "", - "text": "These workshops are designed to provide participants with a foundational understanding of Nextflow and nf-core pipelines. Participants are expected to have prior experience with the command-line interface and working with cluster systems like Slurm. The primary goal of the workshop is to equip researchers with the skills needed to use nextflow and nf-core pipelines for their research data.\n\nCourse Developers & Maintainers\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nCurrent and Future Workshop Sessions\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nCreating a workflow\nIntroduction to developing bioinformatics workflow with Nextflow (Part1)\n29th May 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part2)\n5th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part3)\n12th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part4)\n19th Jun 2024\n\n\n\n\n\nPast Workshop Sessions\n\n\n\nSession\nOverview\nDate\n\n\n\n\nIntroduction to Nextflow and running nf-core workflows\nIntroduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n22nd Nov 2023\n\n\nIntroduction to Nextflow and running nf-core workflows\n(Re-run) Introduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n15th May 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Customising Nf-Core Workshop materials from Sydney Informatics Hub" + "text": "These workshops are designed to provide participants with a foundational understanding of Nextflow and nf-core pipelines. Participants are expected to have prior experience with the command-line interface and working with cluster systems like Slurm. The primary goal of the workshop is to equip researchers with the skills needed to use nextflow and nf-core pipelines for their research data.\n\nCourse Developers & Maintainers\n\nRichard Lupat, Bioinformatics Core Facility\nMiriam Yeung, Cancer Genomics Translational Research Centre\nSong Li, Bioinformatics Core Facility\nSanduni Rajapaksa, Research Computing Facility\n\n\n\nCurrent and Future Workshop Sessions\n\n\n\nLesson\nOverview\nDate\n\n\n\n\nBasic to Create a Nextflow Workflow\nIntroduction to developing bioinformatics workflow with Nextflow (Part1)\n29th May 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part2)\n5th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part3)\n12th Jun 2024\n\n\nTo-be-added\nIntroduction to developing bioinformatics workflow with Nextflow (Part4)\n19th Jun 2024\n\n\n\n\n\nPast Workshop Sessions\n\n\n\nSession\nOverview\nDate\n\n\n\n\nIntroduction to Nextflow and running nf-core workflows\nIntroduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n22nd Nov 2023\n\n\nIntroduction to Nextflow and running nf-core workflows\n(Re-run) Introduction to Nextflow: Introduce nextflow’s core features and concepts + nf-core + how to run it at PeterMac\n15th May 2024\n\n\n\n\n\nCredits and acknowledgement\nThis workshop is adapted from Customising Nf-Core Workshop materials from Sydney Informatics Hub" }, { "objectID": "workshops/2.3_tips_and_tricks.html", @@ -60,7 +60,7 @@ "href": "workshops/3.1_creating_a_workflow.html#creating-an-rnaseq-workflow", "title": "Nextflow Development - Creating a Nextflow Workflow", "section": "Creating an RNAseq Workflow", - "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the varible reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nnultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core" + "text": "Creating an RNAseq Workflow\n\n\n\n\n\n\nObjectives\n\n\n\n\nDevelop a Nextflow workflow\nRead data of different types into a Nextflow workflow\nOutput Nextflow process results to a predefined directory\n\n\n\n\n4.1.1. Define Workflow Parameters\nLet’s create a Nextflow script rnaseq.nf for a RNA-seq workflow. The code begins with a shebang, which declares Nextflow as the interpreter.\n#!/usr/bin/env nextflow\nOne way to define the workflow parameters is inside the Nextflow script.\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nparams.transcriptome_file = \"/.../training/nf-training/data/ggal/transcriptome.fa\"\nparams.multiqc = \"/.../training/nf-training/multiqc\"\n\nprintln \"reads: $params.reads\"\nWorkflow parameters can be defined and accessed inside the Nextflow script by prepending the prefix params to a variable name, separated by a dot character, eg. params.reads.\nDifferent data types can be assigned as a parameter in Nextflow. The reads parameter is defined as multiple .fq files. The transcriptome_file parameter is defined as one file, /.../training/nf-training/data/ggal/transcriptome.fa. The multiqc parameter is defined as a directory, /.../training/nf-training/data/ggal/multiqc.\nThe Groovy println command is then used to print the contents of the reads parameter, which is access with the $ character.\nRun the script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [astonishing_raman] DSL2 - revision: 8c9adc1772\nreads: /.../training/nf-training/data/ggal/*_{1,2}.fq\n\n\n\n4.1.2. Create a transcriptome index file\nCommands or scripts can be executed inside a process.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nThe INDEX process takes an input path, and assigns that input as the variable transcriptome. The path type qualifier will allow Nextflow to stage the files in the process execution directory, where they can be accessed by the script via the defined variable name, ie. transcriptome. The code between the three double-quotes of the script block will be executed, and accesses the input transcriptome variable using $. The output is a path, with a filename salmon_idx. The output path can also be defined using wildcards, eg. path \"*_idx\".\nNote that the name of the input file is not used and is only referenced to by the input variable name. This feature allows pipeline tasks to be self-contained and decoupled from the execution environment. As best practice, avoid referencing files that are not defined in the process script.\nTo execute the INDEX process, a workflow scope will need to be added.\nworkflow {\n index_ch = INDEX(params.transcriptome_file)\n}\nHere, the params.transcriptome_file parameter we defined earlier in the Nextflow script is used as an input into the INDEX process. The output of the process is assigned to the index_ch channel.\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\n\nERROR ~ Error executing process > 'INDEX'\n\nCaused by:\n Process `INDEX` terminated with an error exit status (127)\n\nCommand executed:\n\n salmon index --threads 1 -t transcriptome.fa -i salmon_index\n\nCommand exit status:\n 127\n\nCommand output:\n (empty)\n\nCommand error:\n .command.sh: line 2: salmon: command not found\n\nWork dir:\n /.../work/85/495a21afcaaf5f94780aff6b2a964c\n\nTip: you can try to figure out what's wrong by changing to the process work dir and showing the script file named `.command.sh`\n\n -- Check '.nextflow.log' file for details\nWhen a process execution exits with a non-zero exit status, the workflow will be stopped. Nextflow will output the cause of the error, the command that caused the error, the exit status, the standard output (if available), the comand standard error, and the work directory where the process was executed.\nLet’s first look inside the process execution directory:\n>>> ls -a /.../work/85/495a21afcaaf5f94780aff6b2a964c \n\n. .command.begin .command.log .command.run .exitcode\n.. .command.err .command.out .command.sh transcriptome.fa\nWe can see that the input file transcriptome.fa has been staged inside this process execution directory by being symbolically linked. This allows it to be accessed by the script.\nInside the .command.err script, we can see that the salmon command was not found, resulting in the termination of the Nextflow workflow.\nSingularity containers can be used to execute the process within an environment that contains the package of interest. Create a config file nextflow.config containing the following:\nsingularity {\n enabled = true\n autoMounts = true\n cacheDir = \"/config/binaries/singularity/containers_devel/nextflow\"\n}\nThe container process directive can be used to specify the required container:\nprocess INDEX {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [distraught_goldwasser] DSL2 - revision: bdebf34e16\nexecutor > local (1)\n[37/7ef8f0] process > INDEX [100%] 1 of 1 ✔\nThe newly created nextflow.config files does not need to be specified in the nextflow run command. This file is automatically searched for and used by Nextflow.\nAn alternative to singularity containers is the use of a module. Since the script block is executed as a Bash script, it can contain any command or script normally executed on the command line. If there is a module present in the host environment, it can be loaded as part of the process script.\nprocess INDEX {\n input:\n path transcriptome\n\n output:\n path \"salmon_idx\"\n\n script:\n \"\"\"\n module purge\n module load salmon/1.3.0\n\n salmon index --threads $task.cpus -t $transcriptome -i salmon_idx\n \"\"\"\n}\nRun the Nextflow script:\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [reverent_liskov] DSL2 - revision: b74c22049d\nexecutor > local (1)\n[ba/3c12ab] process > INDEX [100%] 1 of 1 ✔\n\n\n\n4.1.3. Collect Read Files By Pairs\nPreviously, we have defined the reads parameter to be the following:\nparams.reads = \"/.../training/nf-training/data/ggal/*_{1,2}.fq\"\nChallenge: Convert the reads parameter into a tuple channel called reads_ch, where the first element is a unique grouping key, and the second element is the paired .fq files. Then, view the contents of reads_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nreads_ch = Channel.fromFilePairs(\"$params.reads\")\nreads_ch.view()\nThe fromFilePairs channel factory will automatically group input files into a tuple with a unique grouping key. The view() channel operator can be used to view the contents of the channel.\n>>> nextflow run rnaseq.nf\n\n[gut, [/.../training/nf-training/data/ggal/gut_1.fq, /.../training/nf-training/data/ggal/gut_2.fq]]\n[liver, [/.../training/nf-training/data/ggal/liver_1.fq, /.../training/nf-training/data/ggal/liver_2.fq]]\n[lung, [/.../training/nf-training/data/ggal/lung_1.fq, /.../training/nf-training/data/ggal/lung_2.fq]]\n\n\n\n\n\n4.1.4. Perform Expression Quantification\nLet’s add a new process QUANTIFICATION that uses both the indexed transcriptome file and the .fq file pairs to execute the salmon quant command.\nprocess QUANTIFICATION {\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nThe QUANTIFICATION process takes two inputs, the first is the path to the salmon_index created from the INDEX process. The second input is set to match the output of fromFilePairs – a tuple where the first element is a value (ie. grouping key), and the second element is a list of paths to the .fq reads.\nIn the script block, the salmon quant command saves the output of the tool as $sample_id. This output is emitted by the QUANTIFICATION process, using $ to access the Nextflow variable.\nChallenge:\nSet the following as the execution container for QUANTIFICATION:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\nAssign index_ch and reads_ch as the inputs to this process, and emit the process outputs as quant_ch. View the contents of quant_ch\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nTo assign a container to a process, the container directive can be used.\nprocess QUANTIFICATION {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-salmon-1.10.1--h7e5ed60_0.img\"\n\n input:\n path salmon_index\n tuple val(sample_id), path(reads)\n\n output:\n path \"$sample_id\"\n\n script:\n \"\"\"\n salmon quant --threads $task.cpus --libType=U \\\n -i $salmon_index -1 ${reads[0]} -2 ${reads[1]} -o $sample_id\n \"\"\"\n}\nTo run the QUANTIFICATION process and emit the outputs as quant_ch, the following can be added to the end of the workflow block:\nquant_ch = QUANTIFICATION(index_ch, reads_ch)\nquant_ch.view()\nThe script can now be run:\n>>> nextflow run rnaseq.nf \nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [elated_cray] DSL2 - revision: abe41f4f69\nexecutor > local (4)\n[e5/e75095] process > INDEX [100%] 1 of 1 ✔\n[4c/68a000] process > QUANTIFICATION (1) [100%] 3 of 3 ✔\n/.../work/b1/d861d26d4d36864a17d2cec8d67c80/liver\n/.../work/b4/a6545471c1f949b2723d43a9cce05f/lung\n/.../work/4c/68a000f7c6503e8ae1fe4d0d3c93d8/gut\nIn the Nextflow output, we can see that the QUANTIFICATION process has been ran three times, since the reads_ch consists of three elements. Nextflow will automatically run the QUANTIFICATION process on each of the elements in the input channel, creating separate process execution work directories for each execution.\n\n\n\n\n\n4.1.5. Quality Control\nNow, let’s implement a FASTQC quality control process for the input fastq reads.\nChallenge:\nCreate a process called FASTQC that takes reads_ch as an input, and declares the process input to be a tuple matching the structure of reads_ch, where the first element is assigned the variable sample_id, and the second variable is assigned the varible reads. This FASTQC process will first create an output directory fastqc_${sample_id}_logs, then perform fastqc on the input reads and save the results in the newly created directory fastqc_${sample_id}_logs:\nmkdir fastqc_${sample_id}_logs\nfastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\nTake fastqc_${sample_id}_logs as the output of the process, and assign it to the channel fastqc_ch. Finally, specify the process container to be the following:\n/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\n\n\n\n\n\n\nAnswer\n\n\n\n\n\nThe process FASTQC is created in rnaseq.nf. Since the input channel is a tuple, the process input declaration is a tuple containing elements that match the structure of the incoming channel. The first element of the tuple is assigned the variable sample_id, and the second element of the tuple is assigned the variable reads. The relevant container is specified using the container process directive.\nprocess FASTQC {\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-fastqc-0.12.1--hdfd78af_0.img\"\n\n input:\n tuple val(sample_id), path(reads)\n\n output:\n path \"fastqc_${sample_id}_logs\"\n\n script:\n \"\"\"\n mkdir fastqc_${sample_id}_logs\n fastqc -o fastqc_${sample_id}_logs -f fastq -q ${reads}\n \"\"\"\n}\nIn the workflow scope, the following can be added:\nfastqc_ch = FASTQC(reads_ch)\nThe FASTQC process is called, taking reads_ch as an input. The output of the process is assigned to be fastqc_ch.\n>>> nextflow run rnaseq.nf\nN E X T F L O W ~ version 23.04.1\nLaunching `rnaseq.nf` [sad_jennings] DSL2 - revision: cfae7ccc0e\nexecutor > local (7)\n[b5/6bece3] process > INDEX [100%] 1 of 1 ✔\n[32/46f20b] process > QUANTIFICATION (3) [100%] 3 of 3 ✔\n[44/27aa8d] process > FASTQC (2) [100%] 3 of 3 ✔\nIn the Nextflow output, we can see that the FASTQC has been ran three times as expected, since the reads_ch consists of three elements.\n\n\n\n\n\n4.1.6. MultiQC Report\nSo far, the generated outputs have all been saved inside the Nextflow work directory. For the FASTQC process, the specified output directory is only created inside the process execution directory. To save results to a specified folder, the publishDir process directive can be used.\nLet’s create a new MULTIQC process in our workflow that takes the outputs from the QUANTIFICATION and FASTQC processes to create a final report using the multiqc tool, and publish the process outputs to a directory outside of the process execution directory.\nprocess MULTIQC {\n publishDir params.outdir, mode:'copy'\n container \"/config/binaries/singularity/containers_devel/nextflow/depot.galaxyproject.org-singularity-multiqc-1.21--pyhdfd78af_0.img\"\n\n input:\n path quantification\n path fastqc\n\n output:\n path \"*.html\"\n\n script:\n \"\"\"\n multiqc . --filename $quantification\n \"\"\"\n}\nIn the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.\nAdd the following to the end of workflow scope:\nmultiqc_ch = MULTIQC(quant_ch, fastqc_ch)\nRun the pipeline, specifying an output directory using the outdir parameter:\nnextflow run rnaseq.nf --outdir \"results\"\nA results directory containing the output multiqc reports will be created outside of the process execution directory.\n>>> ls results\ngut.html liver.html lung.html\n\n\n\n\n\n\n\nKey points\n\n\n\n\nCommands or scripts can be executed inside a process\nEnvironments can be defined using the container process directive\nThe input declaration for a process must match the structure of the channel that is being passed into that process\n\n\n\n\nThis workshop is adapted from Fundamentals Training, Advanced Training, Developer Tutorials, and Nextflow Patterns materials from Nextflow and nf-core" }, { "objectID": "workshops/2.2_troubleshooting.html", diff --git a/sessions/2_nf_dev_intro.html b/sessions/2_nf_dev_intro.html index 06bde34..6f84e84 100644 --- a/sessions/2_nf_dev_intro.html +++ b/sessions/2_nf_dev_intro.html @@ -248,10 +248,25 @@

Workshop schedule

Every week -Creating a workflow -Introduction to Nextflow: Introduce nextflow’s core features and concepts; including CLI and how to run it on Peter Mac cluster +Basic to Create a Nextflow Workflow +Introduction to nextflow channels, processes, data types and workflows 29th May 2024 + +Developing Reusable Workflows +Introduction to modules imports, sub-workflows, setting up test-profile, and common useful groovy functions +5th Jun 2024 + + +Working with nf-core Templates +Introduction to developing nextflow workflow with nf-core templates +12th Jun 2024 + + +Working with Nextflow Built-in Functions +Introduction to nextflow operators, metadata propagation, grouping, and splitting +19th Jun 2024 + diff --git a/sitemap.xml b/sitemap.xml index 88a95e5..50896b8 100644 --- a/sitemap.xml +++ b/sitemap.xml @@ -2,42 +2,42 @@ https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/2_nf_dev_intro.html - 2024-05-28T16:30:47.814Z + 2024-05-29T01:04:34.157Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/index.html - 2024-05-28T16:30:46.965Z + 2024-05-29T01:04:33.255Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.3_tips_and_tricks.html - 2024-05-28T16:30:45.064Z + 2024-05-29T01:04:31.453Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.2_intro_nf_core.html - 2024-05-28T16:30:44.104Z + 2024-05-29T01:04:30.469Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/1.1_intro_nextflow.html - 2024-05-28T16:30:42.450Z + 2024-05-29T01:04:28.781Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/3.1_creating_a_workflow.html - 2024-05-28T16:30:41.749Z + 2024-05-29T01:04:28.073Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.2_troubleshooting.html - 2024-05-28T16:30:43.270Z + 2024-05-29T01:04:29.577Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/00_setup.html - 2024-05-28T16:30:44.559Z + 2024-05-29T01:04:30.910Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/workshops/2.1_customise_and_run.html - 2024-05-28T16:30:46.447Z + 2024-05-29T01:04:32.863Z https://PMCC-BioinformaticsCore.github.io/nextflow-intro-workshop/sessions/1_intro_run_nf.html - 2024-05-28T16:30:47.391Z + 2024-05-29T01:04:33.703Z diff --git a/workshops/3.1_creating_a_workflow.html b/workshops/3.1_creating_a_workflow.html index f11940b..38e0cb2 100644 --- a/workshops/3.1_creating_a_workflow.html +++ b/workshops/3.1_creating_a_workflow.html @@ -896,7 +896,7 @@

4.1.6. MultiQC Repo }

In the MULTIQC process, the multiqc command is performed on both quantification and fastqc inputs, and publishes the report to a directory defined by the outdir parameter. Only files that match the declaration in the output block are published, not all the outputs of a process. By default, files are published to the target folder creating a symbolic link to the file produced in the process execution directory. This behavior can be modified using the mode option, eg. copy, which copies the file from the process execution directory to the specified output directory.

Add the following to the end of workflow scope:

-
nultiqc_ch = MULTIQC(quant_ch, fastqc_ch)
+
multiqc_ch = MULTIQC(quant_ch, fastqc_ch)

Run the pipeline, specifying an output directory using the outdir parameter:

nextflow run rnaseq.nf --outdir "results"

A results directory containing the output multiqc reports will be created outside of the process execution directory.