From 00315a5d295de2793a566d5437b915854ef50e4f Mon Sep 17 00:00:00 2001 From: Shihab Dider Date: Tue, 3 Dec 2024 12:11:18 -0500 Subject: [PATCH 1/4] docs: update README --- README.md | 168 ++++++++++++++++++++++++++++++++---------------------- 1 file changed, 101 insertions(+), 67 deletions(-) diff --git a/README.md b/README.md index ddd89f9..0ea8023 100644 --- a/README.md +++ b/README.md @@ -5,48 +5,28 @@ [![Nextflow](https://img.shields.io/badge/nextflow%20DSL2-%E2%89%A523.04.0-23aa62.svg)](https://www.nextflow.io/) [![run with docker](https://img.shields.io/badge/run%20with-docker-0db7ed?labelColor=000000&logo=docker)](https://www.docker.com/) [![run with singularity](https://img.shields.io/badge/run%20with-singularity-1d355c.svg?labelColor=000000)](https://sylabs.io/docs/) -[![Launch on Nextflow Tower](https://img.shields.io/badge/Launch%20%F0%9F%9A%80-Nextflow%20Tower-%234256e7)](https://tower.nf/launch?pipeline=https://github.com/mskilab-org/nf-jabba) - -## Citations - -An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. - -This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE). - -> **Most large structural variants in cancer genomes can be detected without long reads.** -> Choo, ZN., Behr, J.M., Deshpande, A. et al. -> -> _Nat Genet_ 2023 Nov 09. doi: [https://doi.org/10.1038/s41588-023-01540-6](https://doi.org/10.1038/s41588-023-01540-6) - -> **The nf-core framework for community-curated bioinformatics pipelines.** -> -> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. -> -> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). ## Introduction -**mskilab-org/nf-JaBbA** is a new state-of-the-art bioinformatics pipeline from [`mskilab-org`](https://www.mskilab.org/) for running [`JaBbA`](https://github.com/mskilab-org/JaBbA/tree/master), our algorithm for doing MIP based joint inference of copy number and rearrangement state in cancer whole genome sequence data. This pipeline runs all the pre-requisite modules and generates the necessary inputs for running JaBbA. It is designed to take tumor-normal pairs of human samples as input. - -We took inspiration from [`nf-core/Sarek`](https://github.com/nf-core/sarek), a workflow for detecting variants in whole genome or targeted sequencing data. **`nf-jabba`** is built using [`Nextflow`](https://www.nextflow.io/) and the `Nextflow DSL2`. All the modules use [`Docker`](https://www.docker.com/) and [`Singularity`](https://sylabs.io/docs/) containers, for easy execution and reproducibility. Some of the modules/processes are derived from open source [`nf-core/modules`](https://github.com/nf-core/modules). - -This pipeline has been designed to start from **FASTQ** files or directly from **BAM** files. Paths to these files should be supplied in a **CSV** file (*please refer to the section below for the input format of the .csv file*). +**mskilab-org/nf-casereports** is a bioinformatics pipeline from [`mskilab-org`](https://www.mskilab.org/) for running [`JaBbA`](https://github.com/mskilab-org/JaBbA/), our algorithm for MIP based joint inference of copy number and rearrangement state in cancer whole genome sequence data. This pipeline runs all the pre-requisite tools (among others) and generates the necessary inputs for running JaBbA and loading into [case-reports](https://github.com/mskilab-org/case-report), our clinical front-end. It is designed to take paired tumor-normal samples or tumor-only samples as input. ## Workflow Summary: -1. Alignment to Reference Genome (currently supports `BWA-MEM` & `BWA-MEM2`; a modified version of the `Alignment` step from `nf-core/Sarek` is used here). -) -2. Quality Control (using [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) -3. Trimming (must turn on using `--trim_fastq`) (using `fastp`) -4. Marking Duplicates (using `GATK MarkDuplicates`) -5. Base recalibration (using `GATK BaseRecalibrator`) -6. Applying BQSR (using `GATK ApplyBQSR`) -7. Performing structural variant calling (using [`SVABA`](https://github.com/walaj/svaba) and/or [`GRIDSS`](https://github.com/PapenfussLab/gridss); must mention using `--tools`) -8. Perform pileups (using mskilab's custom `HetPileups` module; must mention using `--tools`) -9. Generate raw coverages and correct for GC & Mappability bias (using [`fragCounter`](https://github.com/mskilab-org/fragCounter); must mention using `--tools`) -10. Remove biological and technical noise from coverage data. (using [`Dryclean`](https://github.com/mskilab-org/dryclean); must mention using `--tools`) -11. Perform segmentation using tumor/normal ratios of corrected read counts, (using the `CBS` (circular binary segmentation) algorithm; must mention using `--tools`) -12. Purity & ploidy estimation (currently supports [`ASCAT`](https://www.crick.ac.uk/research/labs/peter-van-loo/software) to pass ploidy values to JaBbA; must mention using `--tools`) -13. Execute JaBbA (using inputs from `Dryclean`, `CBS`, `HetPileups` and/or `ASCAT`; must mention using `--tools`) +1. Align to Reference Genome (currently supports `BWA-MEM`, `BWA-MEM2`, and GPU accelerated `fq2bam`). +2. Quality Control (using [`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/), [`Picard CollectWGSMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037269351-CollectWgsMetrics-Picard), [`Picard CollectMultipleMetrics`](https://gatk.broadinstitute.org/hc/en-us/articles/360037594031-CollectMultipleMetrics-Picard), and [`GATK4 EstimateLibraryComplexity`](https://gatk.broadinstitute.org/hc/en-us/articles/360037428891-EstimateLibraryComplexity-Picard)) +4. Mark Duplicates (using [`GATK MarkDuplicates`](https://gatk.broadinstitute.org/hc/en-us/articles/360037052812-MarkDuplicates-Picard)) +5. Base recalibration (using [`GATK BaseRecalibrator`](https://gatk.broadinstitute.org/hc/en-us/articles/360036898312-BaseRecalibrator)) +6. Apply BQSR (using [`GATK ApplyBQSR`](https://gatk.broadinstitute.org/hc/en-us/articles/360037055712-ApplyBQSR)) +7. Perform structural variant calling (using [`GRIDSS`](https://github.com/PapenfussLab/gridss)) +8. Perform pileups (using [`AMBER`](https://github.com/hartwigmedical/hmftools/blob/master/amber/README.md)) +9. Generate raw coverages and correct for GC & Mappability bias (using [`fragCounter`](https://github.com/mskilab-org/fragCounter)) +10. Remove biological and technical noise from coverage data. (using [`Dryclean`](https://github.com/mskilab-org/dryclean)) +11. Perform segmentation using tumor/normal ratios of corrected read counts, (using the `CBS` (circular binary segmentation) algorithm) +12. Purity & ploidy estimation (using [`PURPLE`](https://github.com/hartwigmedical/hmftools/blob/master/purple/README.md)) +13. Junction copy number estimation and event calling (using [`JaBbA`](https://github.com/mskilab-org/JaBbA/) +14. Call SNVs and indels (using [`SAGE`](https://github.com/hartwigmedical/hmftools/blob/master/sage/README.md)) +15. Annotate variants (using [`Snpeff`](https://pcingola.github.io/SnpEff/)) +16. Assign mutational signatures (using [`SigProfiler`](https://github.com/AlexandrovLab/SigProfilerAssignment/)) +17. Detect HRD (Homologous Recombination Deficiency) (using [`HRDetect`](https://github.com/Nik-Zainal-Group/signature.tools.lib)) ## Usage @@ -58,27 +38,40 @@ This pipeline has been designed to start from **FASTQ** files or directly from * ### Setting up the ***samplesheet.csv*** file for input: -You need to create a samplesheet with information regarding the samples you want to run the pipeline on. You need to specify the path of your **samplesheet** using the `--input` flag to specify the location. Make sure the input file is a *comma-separated* file and contains the headers discussed below. *It is highly recommended to provide the **absolute path** for inputs inside the samplesheet rather than relative paths.* +You need to create a samplesheet with information regarding the samples you +want to run the pipeline on. You need to specify the path of your +**samplesheet** using the `--input` flag to specify the location. Make sure the +input `samplesheet.csv` file is a *comma-separated* file and contains the +headers discussed below. *It is highly recommended to provide the **absolute +path** for inputs inside the samplesheet rather than relative paths.* -To mention a sample as paired tumor-normal, it has to be specified with the same `patient` ID, a different `sample`, and their respective `status`. A **1** in the `status` field indicates a tumor sample, while a **0** indicates a normal sample. If there are multiple `sample` IDs, `nf-jabba` will consider them as separate samples and output the results in separate folders based on the `patient` attribute. All the runs will be separated by `patient`, to ensure that there is no mixing of outputs. +For paired tumor-normal samples, use the same `patient` ID, but different +`sample` names. Indicate their respective tumor/normal `status`, where **1** in +the `status` field indicates a tumor sample, and **0** indicates a normal +sample. You may pass multiple `sample` IDs per patient, `nf-casereports` will +consider them as separate samples belonging to the same patient and output the +results accordingly. -You need to specify the desired output root directory using `--outdir` flag. The outputs will then be stored in your designated folder, organized by `tool` and `sample`. +Specify the desired output root directory using the `--outdir` flag. +The outputs will be organized first by `tool` and then `sample`. -To run the pipeline from the beginning, first create an `--input` `sampleSheet.csv` file with your file paths. A typical input whould look like this: +The input samplesheet should look like this: ```csv patient,sex,status,sample,lane,fastq_1,fastq_2 TCXX49,XX,0,TCXX49_N,lane_1,/path/to/fastq_1.fq.gz,/path/to/fastq_2.gz ``` -Each row represents a pair of fastq files (paired end) for each sample. -After the input file is ready, you can run the pipeline using: + +Each row represents a pair of fastq files (paired end) for a single sample (in +this case a normal sample, status: 0). After the input file is ready, you can +run the pipeline using: ```bash nextflow run mskilab-org/nf-jabba \ - -profile \ + -profile \ --input samplesheet.csv \ --outdir \ - --tools \ + --tools \ --genome ``` > **Warning:** @@ -89,25 +82,56 @@ nextflow run mskilab-org/nf-jabba \ ### Discussion of expected fields in input file and expected inputs for each `--step` -A typical sample sheet should populate with the column names as shown below: - -| Column Name | Description | -|-----------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------| -| patient | Patient or Sample ID. This should differentiate each patient/sample. *Note*: Each patient can have multiple sample names. | -| sample | Sample ID for each Patient. Should differentiate between tumor and normal. Sample IDs should be unique to Patient IDs | -| lane | If starting with FASTQ files, and if there are multiple lanes for each sample for each patient, mention lane name. **Required for `--step alignment`. | -| sex | If known, please provide the sex for the patient. For instance if **Male** type XY, else if **Female** type XX, otherwise put NA. | -| status | This should indicate if your sample is **tumor** or **normal**. For **normal**, write 0, and for **tumor**, write 1. | -| fastq_1 | Full Path to FASTQ file read 1. The extension should be `.fastq.gz` or `.fq.gz`. **Required** for `--step alignment`. | -| fastq_2 | Full Path to FASTQ file read 2. The extension should be `.fastq.gz` or `.fq.gz`. **Required** for `--step alignment`. | -| bam | Full Path to BAM file. The extension should be `.bam`. **Required** for `--step sv_calling`. | -| bai | Full Path to BAM index file. The extension should be `.bam.bai`. **Required** for `--step sv_calling`. | -| cram | Full Path to CRAM file. The extension should be `.cram`. **Required** for `--step sv_calling` if file is of type `CRAM`. | -| crai | Full Path to CRAM index file. The extension should be `.cram.crai`. **Required** for `--step sv_calling` if file is of type `CRAM`. | -| table | Full path to Recalibration table file. **Required** for `--step recalibrate`. | -| vcf | Full path to VCF file. **Required** for `--step jabba`. | -| hets | Full path to HetPileups .txt file. **Required** for `--step jabba`. | - +A typical sample sheet can populate with all or some of the column names as +shown below. The pipeline will use the information provided in the samplesheet +and the tools specified in the run to parsimoniously run the steps of the +pipeline to generate all remaining outputs. + +**N.B You do not need to supply all the columns in the table below. The table represents all the possible inputs that can be passed. If you are starting from BAMs just pass `bam` and `bai` columns. If you are starting from FASTQs, pass `fastq1` (and `fastq2` for paired reads). If you have already generated other outputs, you may pass them as well to prevent the pipeline from running tools for which you already have outputs.** + +| Column Name | Description | +|---------------------|----------------------------------------------------------------------------------------------------------------------------------------------| +| patient | (required) Patient or Sample ID. This should differentiate each patient/sample. *Note*: Each patient can have multiple sample names. | +| sample | (required) Sample ID for each Patient. Should differentiate between tumor and normal (e.g `sample1_t` vs. `sample1_n`). Sample IDs should be unique to Patient IDs | +| lane | If starting with FASTQ files, and if there are multiple lanes for each sample for each patient, mention lane name. | +| sex | If known, please provide the sex for the patient. For instance if **Male** type XY, else if **Female** type XX, otherwise put NA. | +| status | (required) This should indicate if your sample is **tumor** or **normal**. For **normal**, write 0, and for **tumor**, write 1. | +| fastq_1 | Full path to FASTQ file read 1. The extension should be `.fastq.gz` or `.fq.gz`. | +| fastq_2 | Full path to FASTQ file read 2 (if paired reads). The extension should be `.fastq.gz` or `.fq.gz`. | +| bam | Full path to BAM file. The extension should be `.bam`. | +| bai | Full path to BAM index file. The extension should be `.bam.bai`. | +| hets | Full path to sites.txt file. | +| amber_dir | Full path to AMBER output directory. | +| frag_cov | Full path to the fragCounter coverage file. | +| dryclean_cov | Full path to the Dryclean corrected coverage file. | +| ploidy | Ploidies for each sample. | +| seg | Full path to the CBS segmented file. | +| nseg | Full path to the CBS segmented file for normal samples. | +| vcf | Full path to the GRIDSS VCF file. | +| vcf_tbi | Full path to the GRIDSS VCF index file. | +| jabba_rds | Full path to the JaBbA RDS (`jabba.simple.rds`) file. | +| jabba_gg | Full path to the JaBbA gGraph (`jabba.gg.rds`) file. | +| ni_balanced_gg | Full path to the non-integer balanced gGraph (`non_integer.balanced.gg.rds`) file. | +| lp_phased_gg | Full path to the LP phased gGraph (`lp_phased.balanced.gg.rds`) file. | +| events | Full path to the events file. | +| fusions | Full path to the fusions file. | +| snv_somatic_vcf | Full path to the somatic SNV VCF file. | +| snv_somatic_tbi | Full path to the somatic SNV VCF index file. | +| snv_germline_vcf | Full path to the germline SNV VCF file. | +| snv_germline_tbi | Full path to the germline SNV VCF index file. | +| variant_somatic_ann | Full path to the somatic SNV annotated VCF file. | +| variant_somatic_bcf | Full path to the somatic SNV BCF file. | +| variant_germline_ann| Full path to the germline SNV annotated VCF file. | +| variant_germline_bcf| Full path to the germline SNV BCF file. | +| snv_multiplicity | Full path to the SNV multiplicity file. | +| sbs_signatures | Full path to the SBS signatures file. | +| indel_signatures | Full path to the indel signatures file. | +| signatures_matrix | Full path to the signatures matrix file. | +| hrdetect | Full path to the HRDetect file. | + +## Tumor-Only Samples + +For tumor-only samples, simply add the flag `--tumor_only true` to the nextflow command. The pipeline will then run in tumor-only mode. For more information regarding the pipeline usage and the inputs necesaary for each step, please follow the [Usage](docs//usage.md) documentation. @@ -154,16 +178,26 @@ To debug any step or process that failed, first check your current `execution_tr ## Credits -`nf-jabba` was written by [`Tanubrata Dey`](https://github.com/tanubrata) and [`Shihab Dider`](https://github.com/shihabdider) at the Perlmutter Cancer Center and the New York Genome Center. +`nf-casereports` was written by [`Shihab Dider`](https://github.com/shihabdider) and [`Tanubrata Dey`](https://github.com/tanubrata) and at the Perlmutter Cancer Center and the New York Genome Center. We thank the following people for their extensive guidance in the development of this pipeline: - [Marcin Imielinski](https://github.com/imielinski) - [Joel Rosiene](https://github.com/jrosiene) +## Citations -## Contributions and Support +An extensive list of references for the tools used by the pipeline can be found in the [`CITATIONS.md`](CITATIONS.md) file. -If you would like to contribute to this pipeline, please see the [contributing guidelines](.github/CONTRIBUTING.md). +This pipeline uses code and infrastructure developed and maintained by the [nf-core](https://nf-co.re) community, reused here under the [MIT license](https://github.com/nf-core/tools/blob/master/LICENSE). +> **Most large structural variants in cancer genomes can be detected without long reads.** +> Choo, ZN., Behr, J.M., Deshpande, A. et al. +> +> _Nat Genet_ 2023 Nov 09. doi: [https://doi.org/10.1038/s41588-023-01540-6](https://doi.org/10.1038/s41588-023-01540-6) +> **The nf-core framework for community-curated bioinformatics pipelines.** +> +> Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen. +> +> _Nat Biotechnol._ 2020 Feb 13. doi: [10.1038/s41587-020-0439-x](https://dx.doi.org/10.1038/s41587-020-0439-x). From ad959055ee984635be89811282b138a211b21caa Mon Sep 17 00:00:00 2001 From: Shihab Dider Date: Wed, 4 Dec 2024 11:59:57 -0500 Subject: [PATCH 2/4] fix: tumor-normal split for coverage --- workflows/nfcasereports.nf | 41 ++++++++++++++++++++++---------------- 1 file changed, 24 insertions(+), 17 deletions(-) diff --git a/workflows/nfcasereports.nf b/workflows/nfcasereports.nf index 68aad59..fe41b47 100644 --- a/workflows/nfcasereports.nf +++ b/workflows/nfcasereports.nf @@ -1066,6 +1066,8 @@ workflow NFCASEREPORTS { normal_frag_cov = Channel.empty() .mix(NORMAL_FRAGCOUNTER.out.fragcounter_cov) .mix(fragcounter_existing_outputs.normal) + + normal_frag_cov_for_merge = normal_frag_cov.map { meta, frag_cov -> [ meta.sample, meta, frag_cov ] } } TUMOR_FRAGCOUNTER(bam_fragcounter_status.tumor) @@ -1074,6 +1076,8 @@ workflow NFCASEREPORTS { .mix(TUMOR_FRAGCOUNTER.out.fragcounter_cov) .mix(fragcounter_existing_outputs.tumor) + tumor_frag_cov_for_merge = tumor_frag_cov.map { meta, frag_cov -> [ meta.sample, meta, frag_cov ] } + // Only need one versions because its just one program (fragcounter) versions = versions.mix(TUMOR_FRAGCOUNTER.out.versions) } @@ -1083,15 +1087,15 @@ workflow NFCASEREPORTS { if (tools_used.contains("all") || tools_used.contains("dryclean")) { cov_dryclean_inputs = inputs .filter { it.dryclean_cov.isEmpty() } - .map { it -> [it.meta] } + .map { it -> [it.meta.sample, it.meta] } .branch{ - normal: it[0].status == 0 - tumor: it[0].status == 1 + normal: it[1].status == 0 + tumor: it[1].status == 1 } - cov_dryclean_tumor_input = tumor_frag_cov + cov_dryclean_tumor_input = tumor_frag_cov_for_merge .join(cov_dryclean_inputs.tumor) - .map{ it -> [ it[0], it[1] ] } // meta, frag_cov + .map{ it -> [ it[1], it[2] ] } // meta, frag_cov dryclean_existing_outputs = inputs .map { it -> [it.meta, it.dryclean_cov] } @@ -1114,15 +1118,18 @@ workflow NFCASEREPORTS { versions = versions.mix(TUMOR_DRYCLEAN.out.versions) if (!params.tumor_only) { - cov_dryclean_normal_input = normal_frag_cov + cov_dryclean_normal_input = normal_frag_cov_for_merge .join(cov_dryclean_inputs.normal) - .map{ it -> [ it[0], it[1] ] } // meta, frag_cov + .map{ it -> [ it[1], it[2] ] } // meta, frag_cov NORMAL_DRYCLEAN(cov_dryclean_normal_input) dryclean_normal_cov = Channel.empty() .mix(NORMAL_DRYCLEAN.out.dryclean_cov) .mix(dryclean_existing_outputs.normal) + + dryclean_normal_cov_for_merge = dryclean_normal_cov + .map { it -> [ it[0].patient, it[1] ] } // meta.patient, dryclean_cov } } @@ -1131,22 +1138,22 @@ workflow NFCASEREPORTS { if (tools_used.contains("all") || tools_used.contains("cbs")) { cbs_inputs = inputs .filter { it.seg.isEmpty() || it.nseg.isEmpty() } - .map { it -> [it.meta] } + .map { it -> [it.meta.patient, it.meta] } .branch{ - normal: it[0].status == 0 - tumor: it[0].status == 1 + normal: it[1].status == 0 + tumor: it[1].status == 1 } - cbs_tumor_input = dryclean_tumor_cov - .join(cbs_inputs.tumor) - .map{ it -> [ it[0].patient, it[0], it[1] ] } // meta.patient, meta, dryclean tumor cov + cbs_tumor_input = cbs_inputs.tumor + .join(dryclean_tumor_cov_for_merge) + .map{ it -> [ it[0], it[1], it[2] ] } // meta.patient, meta, dryclean tumor cov if (params.tumor_only) { cov_cbs = cbs_tumor_input.map { patient, meta, tumor_cov -> [ meta, tumor_cov, [] ] } } else { - cbs_normal_input = dryclean_normal_cov - .join(cbs_inputs.normal) - .map{ it -> [ it[0].patient, it[0], it[1] ] } // meta.patient, meta, dryclean normal cov + cbs_normal_input = cbs_inputs.normal + .join(dryclean_normal_cov_for_merge) + .map{ it -> [ it[0], it[1], it[2] ] } // meta.patient, meta, dryclean normal cov cov_cbs = cbs_tumor_input.cross(cbs_normal_input) .map { tumor, normal -> @@ -1623,7 +1630,7 @@ workflow NFCASEREPORTS { if (tools_used.contains("all") || tools_used.contains("fusions")) { fusions_inputs = inputs.filter { it.fusions.isEmpty() }.map { it -> [it.meta.patient, it.meta] } - if (tools_used.contains("non_integer_balance")) { + if (tools_used.contains("non_integer_balance") || tools_used.contains("all")) { fusions_input_non_integer_balance = non_integer_balance_balanced_gg_for_merge .join(fusions_inputs) .map { it -> [ it[0], it[1] ] } // meta.patient, balanced_gg From 1536588ddcb48123b1f751c83d26a82f75d57fbc Mon Sep 17 00:00:00 2001 From: Shihab Dider Date: Wed, 4 Dec 2024 12:01:18 -0500 Subject: [PATCH 3/4] fix: increase resources for qc --- conf/base.config | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/conf/base.config b/conf/base.config index efde261..566fe7e 100644 --- a/conf/base.config +++ b/conf/base.config @@ -28,7 +28,7 @@ process { // See https://www.nextflow.io/docs/latest/config.html#config-process-selectors withLabel:process_single { cpus = { check_max( 1 , 'cpus' ) } - memory = { check_max( 6.GB * task.attempt, 'memory' ) } + memory = { check_max( 18.GB * task.attempt, 'memory' ) } time = { check_max( 4.h * task.attempt, 'time' ) } } withLabel:process_low { From a778eff2c9450dc3da60580617a399d5eebbf598 Mon Sep 17 00:00:00 2001 From: Shihab Dider Date: Tue, 7 Jan 2025 10:08:24 -0500 Subject: [PATCH 4/4] fix: memory blowup --- conf/base.config | 4 ++-- conf/modules/aligner.config | 10 ++++++++++ conf/modules/recalibrate.config | 8 ++++---- modules/nf-core/bwamem2/index/main.nf | 2 +- nextflow.config | 2 +- tests/test_runs/full_test/params.json | 2 +- workflows/nfcasereports.nf | 8 ++++++++ 7 files changed, 27 insertions(+), 9 deletions(-) diff --git a/conf/base.config b/conf/base.config index 566fe7e..a022789 100644 --- a/conf/base.config +++ b/conf/base.config @@ -65,9 +65,9 @@ process { cpus = { check_max( 12 * task.attempt, 'cpus' ) } memory = { check_max( 4.GB * task.attempt, 'memory' ) } } - withName: 'BWAMEM1_MEM|BWAMEM2_MEM' { + withName: 'BWAMEM2_MEM|BWAMEM2_MEM' { cpus = { check_max( 24 * task.attempt, 'cpus' ) } - memory = { check_max( 30.GB * task.attempt, 'memory' ) } + memory = { check_max( 72.GB * task.attempt, 'memory' ) } } withName: 'PARABRICKS_FQ2BAM' { cpus = { check_max( 24 * task.attempt, 'cpus' ) } diff --git a/conf/modules/aligner.config b/conf/modules/aligner.config index 7e783a8..62d46de 100644 --- a/conf/modules/aligner.config +++ b/conf/modules/aligner.config @@ -84,4 +84,14 @@ process { withName: 'MERGE_BAM' { ext.prefix = { "${meta.id}.sorted" } } + + + withName: 'CRAM_TO_BAM_FINAL' { + publishDir = [ + mode: params.publish_dir_mode, + path: { "${params.outdir}/alignment/final/${meta.id}/" }, + pattern: "*{bam,bai}", + ] + } } + diff --git a/conf/modules/recalibrate.config b/conf/modules/recalibrate.config index fe11ecb..5e3fe0b 100644 --- a/conf/modules/recalibrate.config +++ b/conf/modules/recalibrate.config @@ -18,7 +18,7 @@ process { withName: 'GATK4_APPLYBQSR|GATK4_APPLYBQSR_SPARK' { ext.prefix = { meta.num_intervals <= 1 ? "${meta.id}.recal" : "${meta.id}_${intervals.simpleName}.recal" } publishDir = [ - enabled: !params.save_output_as_bam, + enabled: params.save_mapped, mode: params.publish_dir_mode, path: { "${params.outdir}/alignment/" }, pattern: "*cram", @@ -30,7 +30,7 @@ process { ext.prefix = { "${meta.id}.recal" } publishDir = [ - enabled: !params.save_output_as_bam, + enabled: params.save_mapped, mode: params.publish_dir_mode, path: { "${params.outdir}/alignment/recalibrated/${meta.id}/" }, pattern: "*cram" @@ -39,7 +39,7 @@ process { withName: 'MSKILABORG_NFJABBA:NFJABBA:(BAM_APPLYBQSR|BAM_APPLYBQSR_SPARK):CRAM_MERGE_INDEX_SAMTOOLS:INDEX_CRAM' { publishDir = [ - enabled: !params.save_output_as_bam, + enabled: params.save_mapped, mode: params.publish_dir_mode, path: { "${params.outdir}/alignment/recalibrated/${meta.id}/" }, pattern: "*{recal.cram,recal.cram.crai}" @@ -50,7 +50,7 @@ process { ext.prefix = { "${meta.id}.recal" } publishDir = [ - enabled: params.save_output_as_bam, + enabled: params.save_mapped, mode: params.publish_dir_mode, path: { "${params.outdir}/alignment/recalibrated/${meta.id}/" }, pattern: "*{recal.bam,recal.bam.bai}" diff --git a/modules/nf-core/bwamem2/index/main.nf b/modules/nf-core/bwamem2/index/main.nf index 9fabda2..7c6036c 100644 --- a/modules/nf-core/bwamem2/index/main.nf +++ b/modules/nf-core/bwamem2/index/main.nf @@ -1,6 +1,6 @@ process BWAMEM2_INDEX { tag "$fasta" - label 'process_low' + label 'process_high' conda "bioconda::bwa-mem2=2.2.1" container "${ workflow.containerEngine == 'singularity' && !task.ext.singularity_pull_docker_container ? diff --git a/nextflow.config b/nextflow.config index 2405679..9d9fbdb 100644 --- a/nextflow.config +++ b/nextflow.config @@ -48,7 +48,7 @@ params { fq2bam_mark_duplicates = true // Whether fq2bam should mark duplicates, set false if not using fq2bam fq2bam_low_memory = false // Set to true if using fq2bam with gpus that have <24GB memory optical_duplicate_pixel_distance = 2500 // For computing optical duplicates, 2500 for NovaSeqX+ - save_mapped = true // Mapped BAMs are saved + save_mapped = false // Mapped BAMs are saved save_output_as_bam = true // Output files from alignment are saved as bam by default and not as cram files seq_center = null // No sequencing center to be written in read group CN field by aligner seq_platform = null // Default platform written in read group PL field by aligner, null by default. diff --git a/tests/test_runs/full_test/params.json b/tests/test_runs/full_test/params.json index e12c32a..bba1b60 100644 --- a/tests/test_runs/full_test/params.json +++ b/tests/test_runs/full_test/params.json @@ -6,7 +6,7 @@ "bwa": "/gpfs/commons/home/sdider/DB/GATK/bwa/", "outdir": "./results", "pon_dryclean": "/gpfs/commons/home/tdey/data/dryclean/MONSTER_PON_RAW/MONSTER_PON_RAW_SORTED/fixed.detergent.rds", - "tools": "fusions", + "tools": "bamqc", "field_dryclean": "reads", "genome": "GATK.GRCh37", "email": "shihabdider@gmail.com" diff --git a/workflows/nfcasereports.nf b/workflows/nfcasereports.nf index fe41b47..c2b2eed 100644 --- a/workflows/nfcasereports.nf +++ b/workflows/nfcasereports.nf @@ -368,6 +368,14 @@ inputs = inputs ch_items.meta = ch_items.meta - ch_items.meta.subMap('lane') + [num_lanes: num_lanes.toInteger(), read_group: read_group.toString(), size: 1] + } else if (ch_items.fastq_2) { + ch_items.meta = ch_items.meta + [id: ch_items.meta.sample.toString()] + def CN = params.seq_center ? "CN:${params.seq_center}\\t" : '' + + def flowcell = flowcellLaneFromFastq(ch_items.fastq_1) + def read_group = "\"@RG\\tID:${flowcell}.${ch_items.meta.sample}\\t${CN}PU:${ch_items.meta.sample}\\tSM:${ch_items.meta.patient}_${ch_items.meta.sample}\\tLB:${ch_items.meta.sample}\\tDS:${params.fasta}\\tPL:${params.seq_platform}\"" + + ch_items.meta = ch_items.meta + [num_lanes: num_lanes.toInteger(), read_group: read_group.toString(), size: 1] } else if (ch_items.meta.lane && ch_items.bam) { ch_items.meta = ch_items.meta + [id: "${ch_items.meta.sample}-${ch_items.meta.lane}".toString()] def CN = params.seq_center ? "CN:${params.seq_center}\\t" : ''