mskilab-org/nf-JaBbA is a new state-of-art bioinformatics pipeline from mskilab-org
that is intended to run JaBbA
, a MIP based joint inference of copy number and rearrangement state in cancer whole genome sequence data. It runs all the pre-requisite modules necessary to run JaBbA in mskilab-org
. This pipeline is built to handle only tumor-normal pairs as input and is designed and tested to run on human samples.
We drew our inspiration and ideas from nf-core/Sarek
, a workflow designed to detect variants on whole genome or targeted sequencing data. nf-jabba
is built using Nextflow
and is implemented using Nextflow DSL2
. All the modules uses Docker
and Singularity
containers to make the pipeline easily reproducible and maintain its dependencies. Some of the modules/processes are used from nf-core/modules
.
This pipeline has been designed to start from scratch using FASTQ files or start directly from BAM files as input. File paths should be supplied in a CSV file (please refer to the documentation below for the input format of the .csv file). We incorporated a modified version of the Alignment
step from nf-core/Sarek
for the nf-JaBbA
pipeline.
To run the pipeline, first create an --input
samplesheet.csv
file with your file paths. A typical input should look like this:
patient,sex,status,sample,lane,fastq_1,fastq_2
TCXX49,XX,0,TCXX49_N,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
Each row represents a pair of fastq files (paired end) for each Sample. A typical command for running the pipeline is as follows:
nextflow run mskilab-org/nf-jabba --input ./samplesheet.csv --outdir ./results --genome GATK.GRCh37 --tools <TOOLS> -profile singularity
This will launch the pipeline and run all the processes with the tools specified in --tools
to JaBbA
. It will create all the following files in the working directory from where the command is run:
work # Directory containing the nextflow working files
<OUTDIR> # Finished results in specified location (defined with --outdir)
.nextflow_log # Log file from Nextflow
# Other nextflow hidden files, eg. history of pipeline runs and old logs.
If you wish to use the same parameters for multiple runs, rather than specifying each flag in the command, you can specify these in a params file.
Pipeline settings can be provided in a yaml
or json
file via -params-file <file>
.
⚠️ Do not use-c <file>
to specify parameters as this will result in errors. Custom config files specified with-c
must only be used for tuning process resource specifications, other infrastructural tweaks (such as output directories), or module arguments (args).
The above pipeline run specified with a params file in yaml format:
nextflow run mskilab-org/nf-jabba -profile singularity -params-file params.yaml
with params.yaml
containing:
input: './samplesheet.csv'
outdir: './results/'
genome: 'GATK.GRCh37'
tools: 'svaba,hetpileups,...,jabba'
<...>
You can also generate such YAML
/JSON
files via nf-core/launch.
You need to create a samplesheet with information regarding the samples that you want to run the pipeline on. You need to specify the path of your samplesheet using the --input
flag to specify the location. Make sure the input file is a comma-separated file and contains headers with the info discussed below. It is highly recommended to provide the absolute path for inputs inside the samplesheet rather than relative paths.
To mention a sample as paired tumor-normal, it has to be specified with the same patient
ID, a different sample
, and their respective status
. For instance, a tumor
sample should be mentioned 1 in status
field for a sample, if it is normal mention 0. If there are multiple sample
IDs, nf-jabba
will consider them as separate samples and output the results on separate folders based on patient
, rest assured all the runs will be separate based on patient
, so no need to be concerned with getting the outputs mixed.
--input '[path to samplesheet file]'
A typical sample sheet should populate with the column names as shown below:
Column Name | Description |
---|---|
patient | Patient or Sample ID. This should differentiate each patient/sample. Note: Each patient can have multiple sample names. |
sample | Sample ID for each Patient. Should differentiate between tumor and normal. Sample IDs should be unique to Patient IDs |
lane | If starting with FASTQ files, if there are multiple lanes for each sample for each patient, mention lane name. **Required for --step alignment . |
sex | If known, please provide the sex for the patient. For instance if Male type XY, else if Female type XX, else others and unknown should be NA. |
status | This should tell if your sample is tumor or normal. For normal, write 0, and for tumor, write 1. |
fastq_1 | Full Path to FASTQ file read 1. The extension should be .fastq.gz or .fq.gz . Required for --step alignment . |
fastq_2 | Full Path to FASTQ file read 2. The extension should be .fastq.gz or .fq.gz . Required for --step alignment . |
bam | Full Path to BAM file. The extension should be .bam . Required for --step sv_calling . |
bai | Full Path to BAM index file. The extension should be .bam.bai . Required for --step sv_calling . |
cram | Full Path to CRAM file. The extension should be .cram . Required for --step sv_calling if file is of type CRAM . |
crai | Full Path to CRAM index file. The extension should be .cram.crai . Required for --step sv_calling if file is of type CRAM . |
table | Full path to Recalibration table file. Required for --step recalibrate . |
vcf | Full path to VCF file. Required for --step jabba . |
hets | Full path to HetPileups .txt file. Required for --step jabba . |
There are multiple --steps
for nf-jabba
. Each tool is separate so that one can run only their desired tools rather than running every tool in the pipeline. There are 2 primary --steps
in the pipeline which can lead to JaBbA
. A full input csv
files for these 2 steps are shown below:
--step alignment
patient,sex,status,sample,lane,fastq_1,fastq_2
TCXX49,XX,0,TCXX49_N,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
TCXX49,XX,0,TCXX49_N,lane_2,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
TCXX49,XX,1,TCXX49_T,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
TCXX49,XX,1,TCXX49_T,lane_2,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
TCXX52,NA,0,TCXX52_N,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
TCXX52,NA,1,TCXX52_T,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz
--step sv_calling
patient,sex,status,sample,bam,bai
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
Note If you are using cram files, just replace bam and bai headers with cram and crai headers and pass the cram and crai paths there
There are also many secondary --steps
for this pipeline that are designed to only run a specific --tools
for the module but is not adequate to run JaBbA
. You can also run only JaBbA
if you have all the inputs available and can be provided in the --input
csv
file. Below, we provide the other secondary steps and their desired input csv
files for each step.
--step markduplicates
patient,sex,status,sample,bam,bai
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
Note If you are using cram files, just replace bam and bai headers with cram and crai headers and pass the cram and crai paths there
--step prepare_recalibration
patient,sex,status,sample,bam,bai
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
Note If you are using cram files, just replace bam and bai headers with cram and crai headers and pass the cram and crai paths there
--step recalibrate
patient,sex,status,sample,bam,bai,table
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai,TCXX49_N.table
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai,TCXX49_T.table
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai,TCXX52_N.table
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai,TCXX52_T.table
--step fragcounter
patient,sex,status,sample,bam,bai
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
Note If you are using cram files, just replace bam and bai headers with cram and crai headers and pass the cram and crai paths there
--step dryclean
(Note: you should also mention--tools dryclean
to use Dryclean. This step also hasCBS
, if you want to perform both Dryclean and CBS, use--tools dryclean,cbs
)
patient,sex,status,sample,cov
TCXX49,XX,0,TCXX49_N,/path/to/coverage.rds
TCXX49,XX,1,TCXX49_T,/path/to/coverage.rds
TCXX52,NA,0,TCXX52_N,/path/to/coverage.rds
TCXX52,NA,1,TCXX52_T,/path/to/coverage.rds
--step hetpileups
(Note: you should also mention--tools hetpileups
to use HetPileups.)
patient,sex,status,sample,bam,bai
TCXX49,XX,0,TCXX49_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX49,XX,1,TCXX49_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,0,TCXX52_N,/path/to/alignment.bam,/path/to/alignment.bam.bai
TCXX52,NA,1,TCXX52_T,/path/to/alignment.bam,/path/to/alignment.bam.bai
Note HetPileups does not support for
cram
files, you must usebam
files for this step.
--step ascat
(Note: you should also mention--tools ascat
to use ASCAT.)
patient,sex,status,sample,cov,hets
TCXX49,XX,1,TCXX49_T,/path/to/coverage.rds,/path/to/hetpileups/sites.txt
TCXX52,NA,1,TCXX52_T,/path/to/coverage.rds,/path/to/hetpileups/sites.txt
--step jabba
(Note: you should also mention--tools jabba
to use JaBbA.)
patient,sex,status,sample,cov,vcf
TCXX49,XX,1,TCXX49_T,/path/to/tumor/coverage.rds,/path/to/sv_caller/tumor/somatic.vcf
TCXX52,NA,1,TCXX52_T,/path/to/tumor/coverage.rds,/path/to/sv_caller/tumor/somatic.vcf
When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline:
nextflow pull mskilab-org/nf-jabba
NB: These options are part of Nextflow and use a single hyphen (pipeline parameters use a double-hyphen).
Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments.
Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Shifter, Charliecloud, Apptainer, Conda) - see below.
We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported.
The pipeline also dynamically loads configurations from https://github.com/nf-core/configs when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the nf-core/configs documentation.
Note that multiple profiles can be loaded, for example: -profile test,docker
- the order of arguments is important!
They are loaded in sequence, so later profiles can overwrite earlier profiles.
If -profile
is not specified, the pipeline will run locally and expect all software to be installed and available on the PATH
. This is not recommended, since it can lead to different results on different machines dependent on the computer enviroment.
test
- A profile with a complete configuration for automated testing
- Includes links to test data so needs no other parameters
docker
- A generic configuration profile to be used with Docker
singularity
- A generic configuration profile to be used with Singularity
podman
- A generic configuration profile to be used with Podman
shifter
- A generic configuration profile to be used with Shifter
charliecloud
- A generic configuration profile to be used with Charliecloud
apptainer
- A generic configuration profile to be used with Apptainer
conda
- A generic configuration profile to be used with Conda. Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity, Podman, Shifter, Charliecloud, or Apptainer.
Specify this when restarting a pipeline. Nextflow will use cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. For input to be considered the same, not only the names must be identical but the files' contents as well. For more info about this parameter, see this blog post.
You can also supply a run name to resume a specific run: -resume [run-name]
. Use the nextflow log
command to show previous run names.
Specify the path to a specific config file (this is a core Nextflow command). See the nf-core website documentation for more information.
Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished.
The Nextflow -bg
flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file.
Alternatively, you can use screen
/ tmux
or similar tool to create a detached session which you can log back into at a later time.
Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs).
To debug any step or process that failed, please check your current execution_trace*.txt
file inside the <outdir>/pipeline_info/
folder and gather the hash
number for that process. Then go inside the work
folder and paste that hash
number to locate thw working directory for that process. There should be multiple .command.*
files inside that folder which corresponds to your run. This includes log, sh, trace, error files. One good thing is you can run .command.sh
script locally to check where it is exactly breaking and replicat the issue (though you might need to edit the command a bit to run it successfully locally).
Whilst the default requirements set within the pipeline will hopefully work for most people and with most input data, you may find that you want to customise the compute resources that the pipeline requests. Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with any of the error codes specified here it will automatically be resubmitted with higher requests (2 x original, then 3 x original). If it still fails after the third attempt then the pipeline execution is stopped.
To change the resource requests, please see the max resources and tuning workflow resources section of the nf-core website.
In some cases you may wish to change which container or conda environment a step of the pipeline uses for a particular tool. By default nf-core pipelines use containers and software from the biocontainers or bioconda projects. However in some cases the pipeline specified version maybe out of date.
To use a different container from the default container or conda environment specified in a pipeline, please see the updating tool versions section of the nf-core website.
A pipeline might not always support every possible argument or option of a particular tool used in pipeline. Fortunately, nf-core pipelines provide some freedom to users to insert additional parameters that the pipeline does not include by default.
To learn how to provide additional arguments to a particular tool of the pipeline, please see the customising tool arguments section of the nf-core website.
In most cases, you will only need to create a custom config as a one-off but if you and others within your organisation are likely to be running nf-core pipelines regularly and need to use the same settings regularly it may be a good idea to request that your custom config file is uploaded to the nf-core/configs
git repository. Before you do this please can you test that the config file works with your pipeline of choice using the -c
parameter. You can then create a pull request to the nf-core/configs
repository with the addition of your config file, associated documentation file (see examples in nf-core/configs/docs
), and amending nfcore_custom.config
to include your custom profile.
See the main Nextflow documentation for more information about creating your own configuration files.
If you have any questions or issues please send us a message on Slack on the #configs
channel.
In some cases, the Nextflow Java virtual machines can start to request a large amount of memory.
We recommend adding the following line to your environment to limit this (typically in ~/.bashrc
or ~./bash_profile
):
NXF_OPTS='-Xms1g -Xmx4g'