[NCBI Scrub Standalone Workflows] Correct output declarations for the…

… number of spots removed (#610) * correct ncbi_scrub_human_spots_removed output variable * fix broken tables * add documentation for NCBI-Scrub --------- Co-authored-by: Sage Wright <[email protected]>
theiagen · Sep 17, 2024 · dab1316 · dab1316
1 parent 6d44d5f
commit dab1316
Show file tree

Hide file tree

Showing 15 changed files with 103 additions and 15 deletions.
diff --git a/docs/workflows/data_export/zip_column_content.md b/docs/workflows/data_export/zip_column_content.md
@@ -4,7 +4,7 @@ title: Zip_Column_Content
 
 ## Quick Facts
 
-| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |||||
+| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
 |---|---|---|---|---|
 | [Exporting Data From Terra](../../workflows_overview/workflows_type.md/#exporting-data-from-terra) | [Any taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.1.0 | Yes | Set-level |
 

diff --git a/docs/workflows/genomic_characterization/theiacov.md b/docs/workflows/genomic_characterization/theiacov.md
@@ -573,7 +573,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
 !!! tip ""
     These tasks are performed regardless of organism, and perform read trimming and various quality control steps.
 
-??? task "`versioning`: Version capture for TheiaEuk"
+??? task "`versioning`: Version capture for TheiaCoV"
 
     The `versioning` task captures the workflow version from the GitHub (code repository) version.
         
@@ -739,7 +739,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | Tasks | [task_ncbi_scrub.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_ncbi_scrub.wdl#L68) (SE subtask)<br>[task_artic_guppyplex.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_artic_guppyplex.wdl)<br>[task_kraken2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/contamination/task_kraken2.wdl#L3)|
         | Software Source Code | [NCBI Scrub on GitHub](https://github.com/ncbi/sra-human-scrubber)<br>[Artic on GitHub](https://github.com/artic-network/fieldbioinformatics)<br>[Kraken2 on GitHub](https://github.com/DerrickWood/kraken2/) |
         | Software Documentation | [NCBI Scrub](<https://github.com/ncbi/sra-human-scrubber/blob/master/README.md>)<br>[Artic pipeline](https://artic.readthedocs.io/en/latest/?badge=latest)<br>[Kraken2](https://github.com/DerrickWood/kraken2/wiki) |
-        | Original Publication(s) | [*STAT: a fast, scalable, MinHash-based *k*-mer tool to assess Sequence Read Archive next-generation sequence submissions](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0)<br>*[Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)  |
+        | Original Publication(s) | [STAT: a fast, scalable, MinHash-based *k*-mer tool to assess Sequence Read Archive next-generation sequence submissions](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02490-0)<br>[Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0)  |
 
 #### Assembly tasks
 
@@ -765,7 +765,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | Tasks | [task_bwa.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/alignment/task_bwa.wdl)<br>[task_ivar_primer_trim.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_ivar_primer_trim.wdl)<br>[task_assembly_metrics.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/basic_statistics/task_assembly_metrics.wdl)<br>[task_ivar_variant_call.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/gene_typing/variant_detection/task_ivar_variant_call.wdl)<br>[task_ivar_consensus.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_ivar_consensus.wdl) |
         | Software Source Code | [BWA on GitHub](https://github.com/lh3/bwa), [iVar on GitHub](https://andersen-lab.github.io/ivar/html/) |
         | Software Documentation | [BWA on SourceForge](https://bio-bwa.sourceforge.net/), [iVar on GitHub](https://andersen-lab.github.io/ivar/html/) |
-        | Original Publication(s) | [*Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM](https://doi.org/10.48550/arXiv.1303.3997)<br>[*An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar](http://dx.doi.org/10.1186/s13059-018-1618-7) |
+        | Original Publication(s) | [Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM](https://doi.org/10.48550/arXiv.1303.3997)<br>[An amplicon-based sequencing framework for accurately measuring intrahost virus diversity using PrimalSeq and iVar](http://dx.doi.org/10.1186/s13059-018-1618-7) |
 
 ??? toggle "`artic_consensus`: Alignment, Primer Trimming, Variant Detection, and Consensus ==_for non-flu organisms in ONT & ClearLabs workflows_=="
 
@@ -794,7 +794,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | --- | --- |
         | Task | [task_irma.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_irma.wdl) |
         | Software Documentation | [IRMA website](https://wonder.cdc.gov/amd/flu/irma/) |
-        | Original Publication(s) | [*Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6) |
+        | Original Publication(s) | [Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6) |
 
 #### Organism-specific characterization tasks {#org-specific-tasks}
 
@@ -873,7 +873,7 @@ All input reads are processed through "core tasks" in the TheiaCoV Illumina, ONT
         | --- | --- |
         | Task | [task_irma.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/assembly/task_irma.wdl) |
         | Software Documentation | [IRMA website](https://wonder.cdc.gov/amd/flu/irma/) |
-        | Original Publication(s) | [*Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6) |
+        | Original Publication(s) | [Viral deep sequencing needs an adaptive approach: IRMA, the iterative refinement meta-assembler](https://bmcgenomics.biomedcentral.com/articles/10.1186/s12864-016-3030-6) |
 
 ??? task "`abricate`"
 

diff --git a/docs/workflows/phylogenetic_construction/core_gene_snp.md b/docs/workflows/phylogenetic_construction/core_gene_snp.md
@@ -18,8 +18,6 @@ The Core_Gene_SNP workflow is intended for pangenome analysis, core gene alignme
 
 ### Inputs
 
-### Optional User Inputs
-
 For further detail regarding Pirate options, please see [PIRATE's documentation)[https://github.com/SionBayliss/PIRATE). For further detail regarding IQ-TREE options, please see `http://www.iqtree.org/doc/Command-Reference`.
 
 This workflow runs on the set level.

diff --git a/docs/workflows/phylogenetic_construction/mashtree_fasta.md b/docs/workflows/phylogenetic_construction/mashtree_fasta.md
@@ -64,6 +64,7 @@ By default, this task appends a Phandango coloring tag to color all items from t
 ### Outputs
 
 | **Variable** | **Type** | **Description** |
+|---|---|---|
 | mashtree_docker | String | The Docker image used to run the mashtree task |
 | mashtree_filtered_metadata | File | Optional output file with filtered metadata that is only produced if the optional `summarize_data` task is used |
 | mashtree_matrix | File | The SNP matrix made |

diff --git a/docs/workflows/phylogenetic_construction/snippy_tree.md b/docs/workflows/phylogenetic_construction/snippy_tree.md
@@ -226,7 +226,7 @@ Sequencing data used in the Snippy_Tree workflow must:
 
     The SNP-distance output can be visualized using software such as [Phandango](http://jameshadfield.github.io/phandango/#/main) to explore the relationships between the genomic sequences. The task adds a Phandango coloring tag (:c1) to the column names in the output matrix to ensure that all columns are colored with the same color scheme throughout.
 
-    - **Technical details**  
+    !!! techdetails "SNP-dists Technical Details"
         
         |  | Links |
         | --- | --- |

diff --git a/docs/workflows/phylogenetic_placement/usher.md b/docs/workflows/phylogenetic_placement/usher.md
@@ -14,7 +14,7 @@
 
 While this workflow is technically a set-level workflow, it works on the sample-level too. When run on the set-level, the samples are placed with respect to each other.
 
-| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | |
+| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
 |---|---|---|---|---|---|
 | usher_workflow | **assembly_fasta** | Array[File] | The assembly files for the samples you want to place on the pre-existing; can either be a set of samples, an individual sample, or multiple individual samples |  | Required |
 | usher_workflow | **organism** | String | What organism to run UShER on; the following organism have default global phylogenies and reference files provided: sars-cov-2, mpox, RSV-A, RSV-B.  |  | Required |

diff --git a/docs/workflows/public_data_sharing/mercury_prep_n_batch.md b/docs/workflows/public_data_sharing/mercury_prep_n_batch.md
@@ -62,7 +62,7 @@ To help users collect all required metadata, we have created the following Excel
 
 This workflow runs on the set-level.
 
-| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | |
+| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
 |---|---|---|---|---|---|
 | mercury_prep_n_batch | **gcp_bucket_uri** | String | Google bucket where your SRA reads will be temporarily stored before transferring to SRA. Example: "gs://theiagen_sra_transfer" |  | Required |
 | mercury_prep_n_batch | **sample_names** | Array[String] | The samples you want to submit |  | Required |

diff --git a/docs/workflows/public_data_sharing/terra_2_ncbi.md b/docs/workflows/public_data_sharing/terra_2_ncbi.md
@@ -84,7 +84,7 @@ We are constantly working on improving these spreadsheets and they will be updat
 
 ### Running the Workflow
 
-We recommend running a test submission before your first production submission to ensure that all data has been formatted correctly. Please contact Theiagen (`[email protected]`) to get this set up.
+We recommend running a test submission before your first production submission to ensure that all data has been formatted correctly. Please contact Theiagen (<[email protected]>) to get this set up.
 
 In the test submission, any real BioProject accession numbers you provide will not be recognized. You will have to make a "fake" or "test" BioProject. This cannot be done through the NCBI portal. Theiagen can provide assistance in creating this as it requires manual command-line work on the NCBI FTP using the account they provided for you.
 

diff --git a/docs/workflows/standalone/ncbi_scrub.md b/docs/workflows/standalone/ncbi_scrub.md
@@ -0,0 +1,84 @@
+# NCBI_Scrub
+
+## Quick Facts
+
+| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
+|---|---|---|---|---|
+| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB v2.2.1 | Yes | Sample-level |
+
+## NCBI Scrub Workflows
+
+NCBI Scrub, also known as the human read removal tool (HRRT), is based on the [SRA Taxonomy Analysis Tool](https://doi.org/10.1186/s13059-021-02490-0) that will take as input a FASTQ file, and produce as output a FASTQ file in which all reads identified as potentially of human origin are either removed (default) or masked with 'N'.
+There are three Kraken2 workflows:
+
+- `NCBI_Scrub_PE` is compatible with **Illumina paired-end data**
+- `NCBI_Scrub_SE` is compatible with **Illumina single-end data**
+
+### Inputs
+
+| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | **Workflow** |
+|---|---|---|---|---|---|---|
+| dehost_pe or dehost_se | **read1** | File | | | Required | PE, SE |
+| dehost_pe or dehost_se | **read2** | File | | | Required | PE |
+| dehost_pe or dehost_se | **samplename** | String | | | Required | PE, SE |
+| kraken2 | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
+| kraken2 | **disk_size** | Int | Amount of storage (in GB) to allocate to the task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | PE, SE |
+| kraken2 | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | PE, SE |
+| kraken2 | **kraken2_db** | String | The database used to run Kraken2 | /kraken2-db | Optional | PE, SE |
+| kraken2 | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | PE, SE |
+| kraken2 | **read2** | File | Internal component, do not modify | | Do not modify, Optional | SE |
+| kraken2 | **target_organism** | String | The organism whose abundance the user wants to check in their reads. This should be a proper taxonomic name recognized by the Kraken database. | | Optional | PE, SE |
+| ncbi_scrub_pe or ncbi_scrub_se | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
+| ncbi_scrub_pe or ncbi_scrub_se | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | PE, SE |
+| ncbi_scrub_pe or  | **docker** | Int | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/ncbi/sra-human-scrubber:2.2.1 | Optional | PE, SE |
+| ncbi_scrub_pe or ncbi_scrub_se | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | PE, SE |
+| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | PE, SE |
+| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) |  | Optional | PE, SE |
+
+### Workflow Tasks
+
+This workflow is composed of two tasks, one to dehost the input reads and another to screen the clean reads with kraken2 and the viral+human database.
+
+??? task "`ncbi_scrub`: human read removal tool"
+    Briefly, the HRRT employs a k-mer database constructed of k-mers from Eukaryota derived from all human RefSeq records and subtracts any k-mers found in non-Eukaryota RefSeq records. The remaining set of k-mers compose the database used to identify human reads by the removal tool.
+
+    !!! techdetails "Tool Name Technical Details"
+        |  | Links | 
+        | --- | --- | 
+        | Task | [task_ncbi_scrub.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/read_filtering/task_ncbi_scrub.wdl) |
+        | Software Source Code | [HRRT on GitHub](https://github.com/ncbi/sra-human-scrubber) |
+        | Software Documentation | [HRRT on NCBI](https://ncbiinsights.ncbi.nlm.nih.gov/2023/02/02/scrubbing-human-sequences-sra-submissions/) |
+
+??? task "`kraken2`: taxonomic profiling"
+
+    Kraken2 is a bioinformatics tool originally designed for metagenomic applications. It has additionally proven valuable for validating taxonomic assignments and checking contamination of single-species (e.g. bacterial isolate, eukaryotic isolate, viral isolate, etc.) whole genome sequence data.
+
+    Kraken2 is run on the set of raw reads, provided as input, as well as the set of clean reads that are resulted from the `read_QC_trim` workflow
+
+    !!! info "Database-dependent"
+        TheiaCoV automatically uses a viral-specific Kraken2 database.
+
+    !!! techdetails "Kraken2 Technical Details"    
+        
+        |  | Links |
+        | --- | --- |
+        | Task | [task_kraken2.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/taxon_id/task_kraken2.wdl) |
+        | Software Source Code | [Kraken2 on GitHub](https://github.com/DerrickWood/kraken2/) |
+        | Software Documentation | <https://github.com/DerrickWood/kraken2/wiki> |
+        | Original Publication(s) | [Improved metagenomic analysis with Kraken 2](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1891-0) |
+
+### Outputs
+
+| **Variable** | **Type** | **Description** | **Workflow** |
+|---|---|---|---|
+| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | PE, SE |
+| kraken_report_dehosted | File | Full Kraken report after host removal | PE, SE |
+| kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | PE, SE |
+| kraken_version_dehosted | String | Version of Kraken2 software used | PE, SE |
+| ncbi_scrub_docker | String | Docker image used to run HRRT | PE, SE |
+| ncbi_scrub_human_spots_removed | Int | Number of spots removed (or masked) | PE, SE |
+| ncbi_scrub_pe_analysis_date | String | Date of analysis | PE, SE |
+| ncbi_scrub_pe_version | String | Version of HRRT software used | PE, SE |
+| read1_dehosted | File | Dehosted forward reads | PE, SE |
+| read2_dehosted | File | Dehosted reverse reads | PE |
+
diff --git a/docs/workflows_overview/workflows_alphabetically.md b/docs/workflows_overview/workflows_alphabetically.md
@@ -25,6 +25,7 @@ title: Alphabetical Workflows
 | [**MashTree_FASTA**](../workflows/phylogenetic_construction/mashtree_fasta.md)| Mash-distance based phylogenetic analysis from assemblies | Bacteria, Mycotics, Viral | Set-level | Some optional features incompatible, Yes | v2.1.0 | [MashTree_FASTA_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/MashTree_FASTA_PHB:main?tab=info) |
 | [**Mercury_Prep_N_Batch**](../workflows/public_data_sharing/mercury_prep_n_batch.md)| Prepare metadata and sequence data for submission to NCBI and GISAID | Influenza, Monkeypox virus, SARS-CoV-2, Viral | Set-level | No | v2.2.0 | [Mercury_Prep_N_Batch_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Mercury_Prep_N_Batch_PHB:main?tab=info) |
 | [**NCBI-AMRFinderPlus**](../workflows/standalone/ncbi_amrfinderplus.md)| Runs NCBI's AMRFinderPlus on genome assemblies (bacterial and fungal) | Bacteria, Mycotics | Sample-level | Yes | v2.0.0 | [NCBI-AMRFinderPlus_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/NCBI-AMRFinderPlus_PHB:main?tab=info) |
+| [**NCBI_Scrub**](../workflows/standalone/ncbi_scrub.md)| Runs NCBI's HRRT on Illumina FASTQs | Any taxa | Sample-level | Yes | v2.2.1 | [NCBI_Scrub_PE_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/NCBI_Scrub_PE_PHB:main?tab=info)[NCBI_Scrub_SE_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/NCBI_Scrub_SE_PHB:main?tab=info) |
 | [**Pangolin_Update**](../workflows/genomic_characterization/pangolin_update.md) | Update Pangolin assignments | SARS-CoV-2, Viral | Sample-level | Yes | v2.0.0 | [Pangolin_Update_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Pangolin_Update_PHB:main?tab=info) |
 | [**RASUSA**](../workflows/standalone/rasusa.md)| Randomly subsample sequencing reads to a specified coverage | Any taxa | Sample-level | Yes | v2.0.0 | [RASUSA_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/RASUSA_PHB:main?tab=info) |
 | [**Rename_FASTQ**](../workflows/standalone/rename_fastq.md)| Rename paired-end or single-end read files in a Terra data table in a non-destructive way | Any taxa | Sample-level | Yes | v2.1.0 | [Rename_FASTQ_PHB](https://dockstore.org/workflows/github.com/theiagen/public_health_bioinformatics/Rename_FASTQ_PHB:im-utilities-rename-files?tab=info) |