Merge pull request #129 from Plant-Food-Research-Open/dev

Release candidate for 0.6.0
Plant-Food-Research-Open · Dec 20, 2024 · 023488c · 023488c
2 parents ee702d7 + d069633
commit 023488c
Show file tree

Hide file tree

Showing 44 changed files with 2,767 additions and 131 deletions.
diff --git a/.github/workflows/branch.yml b/.github/workflows/branch.yml
@@ -1,15 +1,15 @@
 name: nf-core branch protection
-# This workflow is triggered on PRs to master branch on the repository
-# It fails when someone tries to make a PR against the nf-core `master` branch instead of `dev`
+# This workflow is triggered on PRs to main branch on the repository
+# It fails when someone tries to make a PR against the Plant-Food-Research-Open `main` branch instead of `dev`
 on:
   pull_request_target:
-    branches: [master]
+    branches: [main]
 
 jobs:
   test:
     runs-on: ubuntu-latest
     steps:
-      # PRs to the nf-core repo master branch are only ok if coming from the nf-core repo `dev` or any `patch` branches
+      # PRs to the nf-core repo main branch are only ok if coming from the nf-core repo `dev` or any `patch` branches
       - name: Check PRs
         if: github.repository == 'Plant-Food-Research-Open/genepal'
         run: |
@@ -22,7 +22,7 @@ jobs:
         uses: mshick/add-pr-comment@b8f338c590a895d50bcbfa6c5859251edc8952fc # v2
         with:
           message: |
-            ## This PR is against the `master` branch :x:
+            ## This PR is against the `main` branch :x:
 
             * Do not close this PR
             * Click _Edit_ and change the `base` to `dev`
@@ -32,9 +32,9 @@ jobs:
 
             Hi @${{ github.event.pull_request.user.login }},
 
-            It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `master` branch.
-            The `master` branch on nf-core repositories should always contain code from the latest release.
-            Because of this, PRs to `master` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch.
+            It looks like this pull-request is has been made against the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `main` branch.
+            The `main` branch should always contain code from the latest release.
+            Because of this, PRs to `main` are only allowed if they come from the [${{github.event.pull_request.head.repo.full_name }}](https://github.com/${{github.event.pull_request.head.repo.full_name }}) `dev` branch.
 
             You do not need to close this PR, you can change the target branch to `dev` by clicking the _"Edit"_ button at the top of this page.
             Note that even after this, the test will continue to show as failing until you push a new commit.

diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -55,7 +55,7 @@ jobs:
         uses: actions/[email protected]
 
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v2
+        uses: nf-core/setup-nextflow@v2.0.0
         with:
           version: "${{ matrix.NXF_VER }}"
 

diff --git a/.github/workflows/download_pipeline.yml b/.github/workflows/download_pipeline.yml
@@ -2,7 +2,7 @@ name: Test successful pipeline download with 'nf-core pipelines download'
 
 # Run the workflow when:
 #  - dispatched manually
-#  - when a PR is opened or reopened to master branch
+#  - when a PR is opened or reopened to main branch
 #  - the head branch of the pull request is updated, i.e. if fixes for a release are pushed last minute to dev.
 on:
   workflow_dispatch:
@@ -17,10 +17,10 @@ on:
       - edited
       - synchronize
     branches:
-      - master
+      - main
   pull_request_target:
     branches:
-      - master
+      - main
 
 env:
   NXF_ANSI_LOG: false
@@ -30,7 +30,7 @@ jobs:
     runs-on: ubuntu-latest
     steps:
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v2
+        uses: nf-core/setup-nextflow@v2.0.0
 
       - name: Disk space cleanup
         uses: jlumbroso/free-disk-space@54081f138730dfa15788a46383842cd2f914a1be # v1.3.1

diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml
@@ -34,7 +34,7 @@ jobs:
         uses: actions/checkout@0ad4b8fadaa221de15dcec353f45205ec38ea70b # v4
 
       - name: Install Nextflow
-        uses: nf-core/setup-nextflow@v2
+        uses: nf-core/setup-nextflow@v2.0.0
 
       - uses: actions/setup-python@82c7e631bb3cdc910f68e0081d67478d79c6982d # v5
         with:

diff --git a/.nf-core.yml b/.nf-core.yml
@@ -30,5 +30,5 @@ template:
   outdir: .
   skip_features:
     - igenomes
-  version: 0.5.0
+  version: 0.6.0
 update: null
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,32 @@
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
+## v0.6.0 - [20-Dec-2024]
+
+### 'Added'
+
+1. Added cDNA and CDS outputs to <OUTPUT_DIR>/annotations/<SAMPLE> directory [#118](https://github.com/Plant-Food-Research-Open/genepal/issues/118)
+2. Added parameter `add_attrs_to_proteins_cds_fastas`
+3. Added parameter `filter_genes_by_aa_length` with default set to `24` which allows removal of genes with ORFs shorter than 24 [#125](https://github.com/Plant-Food-Research-Open/genepal/issues/125)
+
+### `Fixed`
+
+1. Fixed an issue where TSEBRA failed because LIFTOFF lifted non-protein coding genes [#121](https://github.com/Plant-Food-Research-Open/genepal/issues/121)
+2. Switched branch name from `master` to `main` in the GHA CIs
+3. Fixed an issue in `genepal_report.Rmd` which caused the pangene matrix plot to fail when the number of clusters exceeded 65536 [#124](https://github.com/Plant-Food-Research-Open/genepal/issues/124)
+4. Fixed an issue where `GENEPALREPORT` process failed due to OOM kill signal from SLURM [#123](https://github.com/Plant-Food-Research-Open/genepal/issues/123)
+5. Fixed an issue where Gff merge after liftoff failed when one of the Gff files did not contain any genes
+6. Fixed an issue where `gxf_fasta_agat_spaddintrons_spextractsequences` crashed due to short introns [#89](https://github.com/Plant-Food-Research-Open/genepal/issues/89)
+
+### `Dependencies`
+
+1. Nextflow!>=24.04.2
+2. [email protected]
+
+### `Deprecated`
+
+1. Removed parameter `add_attrs_to_proteins_fasta`
+
 ## v0.5.0 - [21-Nov-2024]
 
 ### `Added`

diff --git a/CITATION.cff b/CITATION.cff
@@ -31,7 +31,7 @@ authors:
   - family-names: "Thomson"
     given-names: "Susan"
 title: "genepal: A Nextflow pipeline for genome and pan-genome annotation"
-version: 0.5.0
+version: 0.6.0
 date-released: 2024-11-21
 url: "https://github.com/Plant-Food-Research-Open/genepal"
 doi: 10.5281/zenodo.14195006
diff --git a/README.md b/README.md
@@ -35,14 +35,16 @@
   - Merge multi-reference liftoffs
   - Remove liftoff transcripts marked by _valid_ORF=False_
   - Remove liftoff genes with any intron shorter than 10 bp
-  - Remove rRNA and tRNA from liftoff
+  - Remove rRNA, tRNA and other non-protein coding models from liftoff
   - Optionally, allow or remove iso-forms
   - Remove BRAKER models from Liftoff loci
   - Merge Liftoff and BRAKER models
   - Optionally, remove models without any EggNOG-mapper hits
 - [EggNOG-mapper](https://github.com/eggnogdb/eggnog-mapper): Add functional annotation to gff
 - [GenomeTools](https://github.com/genometools/genometools): GFF format validation
-- [GffRead](https://github.com/gpertea/gffread): Extraction of protein sequences
+- [GffRead](https://github.com/gpertea/gffread)
+  - Extraction of protein sequences
+  - Optionally, remove models with ORFs shorter than `N` amino acids
 - [OrthoFinder](https://github.com/davidemms/OrthoFinder): Perform phylogenetic orthology inference across genomes
 - [GffCompare](https://github.com/gpertea/gffcompare): Compare and benchmark against an existing annotation
 - [BUSCO](https://gitlab.com/ezlab/busco): Completeness statistics for genome and annotation through proteins
@@ -97,7 +99,7 @@ sbatch ./pfr_genepal
 
 plant-food-research-open/genepal workflows were originally scripted by Jason Shiller ([@jasonshiller](https://github.com/jasonshiller)). Usman Rashid ([@gallvp](https://github.com/gallvp)) wrote the Nextflow pipeline.
 
-We thank the following people for their extensive assistance in the development of this pipeline:
+We thank the following people for extensive assistance in the development of the pipeline,
 
 - Cecilia Deng [@CeciliaDeng](https://github.com/CeciliaDeng)
 - Charles David [@charlesdavid](https://github.com/charlesdavid)
@@ -107,6 +109,10 @@ We thank the following people for their extensive assistance in the development
 - Susan Thomson [@cflsjt](https://github.com/cflsjt)
 - Ting-Hsuan Chen [@ting-hsuan-chen](https://github.com/ting-hsuan-chen)
 
+and for contributions to the codebase,
+
+- Liam Le Lievre [@liamlelievre](https://github.com/liamlelievre)
+
 The pipeline uses nf-core modules contributed by following authors:
 
 <a href="https://github.com/gallvp"><img src="https://github.com/gallvp.png" width="50" height="50"></a>
@@ -139,6 +145,7 @@ The pipeline uses nf-core modules contributed by following authors:
 <a href="https://github.com/charles-plessy"><img src="https://github.com/charles-plessy.png" width="50" height="50"></a>
 <a href="https://github.com/bunop"><img src="https://github.com/bunop.png" width="50" height="50"></a>
 <a href="https://github.com/abhi18av"><img src="https://github.com/abhi18av.png" width="50" height="50"></a>
+<a href="https://github.com/liamlelievre"><img src="https://github.com/liamlelievre.png" width="50" height="50"></a>
 
 ## Contributions and Support
 

diff --git a/assets/multiqc_config.yml b/assets/multiqc_config.yml
@@ -1,7 +1,7 @@
 report_comment: >
   This report has been generated by the <a href="https://github.com/plant-food-research-open/genepal" target="_blank">plant-food-research-open/genepal</a>
   analysis pipeline. For information about how to interpret these results, please see the
-  <a href="https://github.com/plant-food-research-open/genepal/blob/0.5.0/docs/usage.md" target="_blank">documentation</a>.
+  <a href="https://github.com/plant-food-research-open/genepal/blob/0.6.0/docs/usage.md" target="_blank">documentation</a>.
 
 report_section_order:
   "plant-food-research-open-genepal-methods-description":

diff --git a/bin/genepal_report.Rmd b/bin/genepal_report.Rmd
@@ -190,22 +190,43 @@ cat("<br>")
 
 
 ```{r pheatmap, eval=(exists("n0_df") && !is.null(n0_df$heatmap)), results='hide', fig.align='center', fig.cap="Heatmap showing number of proteins present in each orthocluster (clusters where all individuals have 1 copy are excluded). Columns = Orthologue cluster, Row = Individual", fig.width=7, fig.height=7, dpi=150, warning=FALSE}
-pheatmap(n0_df$heatmap,
+
+# Max 65536 allowed
+# https://github.com/Plant-Food-Research-Open/genepal/issues/124
+
+n_cols <- ncol(n0_df$heatmap)
+max_cols_allowed <- min(n_cols, 5000)
+
+# Approach 1: Random selection of columns
+# selected_cols <- sample(n_cols, max_cols_allowed)
+
+# Approach 2: First N largest clusters
+selected_cols <- order(colSums(n0_df$heatmap), decreasing = TRUE)[seq(1, max_cols_allowed)]
+
+prefix_text <- ""
+
+if ( n_cols != max_cols_allowed ) {
+  prefix_text <- paste0("Top ", max_cols_allowed, " ")
+}
+
+pheatmap(n0_df$heatmap[, selected_cols],
   show_colnames = FALSE,
-  main = "Orthologue clusters containing accessory proteins",
+  main = paste0(prefix_text, "Orthologue clusters"),
   legend = TRUE,
   legend_labels = TRUE,
   border_color = "white"
 )
 
-pheatmap(n0_df$heatmap,
+pheatmap(n0_df$heatmap[, selected_cols],
   filename = file.path(outputs_folder, "pangene.matrix.heatmap.pdf"),
   show_colnames = FALSE,
-  main = "Orthologue clusters containing accessory proteins",
+  main = paste0(prefix_text, "Orthologue clusters"),
   legend = TRUE,
   legend_labels = TRUE,
   border_color = "white"
 )
+
+write.csv(x = transform_hogs(n0o), file = file.path(outputs_folder, "pangenome.matrix.csv"), row.names = FALSE)
 ```
 
 

diff --git a/conf/base.config b/conf/base.config
@@ -74,4 +74,7 @@ process {
         cpus = { 8          * task.attempt  }
         time = { 7.days     * task.attempt  }
     }
+    withName:GENEPALREPORT {
+        memory = { 20.GB    * task.attempt  }
+    }
 }
diff --git a/conf/modules.config b/conf/modules.config
@@ -199,7 +199,7 @@ process { // SUBWORKFLOW: FASTA_LIFTOFF
     }
 
     withName: '.*:FASTA_LIFTOFF:GFFREAD_BEFORE_LIFTOFF' {
-        ext.args = '--no-pseudo --keep-genes'
+        ext.args = '--no-pseudo --keep-genes -C'
     }
 
     withName: '.*:FASTA_LIFTOFF:MERGE_LIFTOFF_ANNOTATIONS' {
@@ -212,7 +212,7 @@ process { // SUBWORKFLOW: FASTA_LIFTOFF
 
     withName: '.*:FASTA_LIFTOFF:GFFREAD_AFTER_LIFTOFF' {
         ext.prefix = { "${meta.id}.liftoff" }
-        ext.args = '--keep-genes'
+        ext.args = '--no-pseudo --keep-genes -C'
     }
 
     withName: '.*:FASTA_LIFTOFF:GFF_TSEBRA_SPFILTERFEATUREFROMKILLLIST:AGAT_CONVERTSPGFF2GTF' {
@@ -240,6 +240,10 @@ process { // SUBWORKFLOW: GFF_MERGE_CLEANUP
         ext.prefix = { "${meta.id}.liftoff.braker" }
     }
 
+    withName: '.*:GFF_MERGE_CLEANUP:FILTER_BY_ORF_SIZE' {
+        ext.args = params.filter_genes_by_aa_length ? "--no-pseudo --keep-genes -C -l ${ ( params.filter_genes_by_aa_length + 1 ) * 3 }" : ''
+    }
+
     withName: '.*:GFF_MERGE_CLEANUP:GT_GFF3' {
         ext.args = '-tidy -retainids -sort'
     }
@@ -286,7 +290,7 @@ process { // SUBWORKFLOW: GFF_STORE
     }
 
     withName: '.*:GFF_STORE:EXTRACT_PROTEINS' {
-        ext.args = params.add_attrs_to_proteins_fasta ? '-F -D -y' : '-y'
+        ext.args = params.add_attrs_to_proteins_cds_fastas ? '-F -D -y' : '-y'
         ext.prefix = { "${meta.id}.pep" }
 
         publishDir = [
@@ -295,6 +299,27 @@ process { // SUBWORKFLOW: GFF_STORE
             saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
         ]
     }
+
+    withName: '.*:GFF_STORE:EXTRACT_CDS' {
+        ext.args = params.add_attrs_to_proteins_cds_fastas ? '-F -D -x' : '-x'
+        ext.prefix = { "${meta.id}.cds" }
+
+        publishDir = [
+            path: { "${params.outdir}/annotations/$meta.id" },
+            mode: params.publish_dir_mode,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
+    withName: '.*:GFF_STORE:EXTRACT_CDNA' {
+        ext.args = params.add_attrs_to_proteins_cds_fastas ? '-F -D -w' : '-w'
+        ext.prefix = { "${meta.id}.cdna" }
+
+        publishDir = [
+            path: { "${params.outdir}/annotations/$meta.id" },
+            mode: params.publish_dir_mode,
+            saveAs: { filename -> filename.equals('versions.yml') ? null : filename },
+        ]
+    }
 }
 
 process { // SUBWORKFLOW: FASTA_ORTHOFINDER

diff --git a/docs/output.md b/docs/output.md
@@ -169,6 +169,8 @@ If more than one genome is included in the pipeline, [ORTHOFINDER](https://githu
   - `Y/`
     - `Y.gt.gff3`: Final annotation file for genome `Y` which contains gene models and their functional annotations
     - `Y.pep.fasta`: Protein sequences for the gene models
+    - `Y.cdna.fasta`: cDNA sequences for the gene models
+    - `Y.cds.fasta`: Coding sequences for the gene models
 
 </details>
 

diff --git a/docs/parameters.md b/docs/parameters.md
@@ -59,19 +59,20 @@ A Nextflow pipeline for consensus, phased and pan-genome annotation.
 
 ## Post-annotation filtering options
 
-| Parameter                     | Description                                                       | Type      | Default | Required | Hidden |
-| ----------------------------- | ----------------------------------------------------------------- | --------- | ------- | -------- | ------ |
-| `allow_isoforms`              | Allow multiple isoforms for gene models                           | `boolean` | True    |          |        |
-| `enforce_full_intron_support` | Require every model to have external evidence for all its introns | `boolean` | True    |          |        |
-| `filter_liftoff_by_hints`     | Use BRAKER hints to filter Liftoff models                         | `boolean` | True    |          |        |
-| `eggnogmapper_purge_nohits`   | Purge transcripts which do not have a hit against eggnog          | `boolean` |         |          |        |
+| Parameter                     | Description                                                                                                                                                     | Type      | Default | Required | Hidden |
+| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | ------- | -------- | ------ |
+| `allow_isoforms`              | Allow multiple isoforms for gene models                                                                                                                         | `boolean` | True    |          |        |
+| `enforce_full_intron_support` | Require every model to have external evidence for all its introns                                                                                               | `boolean` | True    |          |        |
+| `filter_liftoff_by_hints`     | Use BRAKER hints to filter Liftoff models                                                                                                                       | `boolean` | True    |          |        |
+| `eggnogmapper_purge_nohits`   | Purge transcripts which do not have a hit against eggnog                                                                                                        | `boolean` |         |          |        |
+| `filter_genes_by_aa_length`   | Filter genes with open reading frames shorter than the specified number of amino acids excluding the stop codon. If set to `null`, this filter step is skipped. | `integer` | 24      |          |        |
 
 ## Annotation output options
 
-| Parameter                     | Description                          | Type      | Default | Required | Hidden |
-| ----------------------------- | ------------------------------------ | --------- | ------- | -------- | ------ |
-| `braker_save_outputs`         | Save BRAKER files                    | `boolean` |         |          |        |
-| `add_attrs_to_proteins_fasta` | Add gff attributes to proteins fasta | `boolean` |         |          |        |
+| Parameter                          | Description                                   | Type      | Default | Required | Hidden |
+| ---------------------------------- | --------------------------------------------- | --------- | ------- | -------- | ------ |
+| `braker_save_outputs`              | Save BRAKER files                             | `boolean` |         |          |        |
+| `add_attrs_to_proteins_cds_fastas` | Add gff attributes to proteins/cDNA/CDS fasta | `boolean` |         |          |        |
 
 ## Evaluation options
 

diff --git a/modules.json b/modules.json
@@ -111,7 +111,7 @@
                     },
                     "gxf_fasta_agat_spaddintrons_spextractsequences": {
                         "branch": "main",
-                        "git_sha": "7bf6fbca23edc94490ffa6709f52b2f71c6fb130",
+                        "git_sha": "ed4146008dbdcfd4823252b456de32059e2d07f4",
                         "installed_by": ["subworkflows"]
                     }
                 }