Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kraken2] Split database from Kraken2 TheiaCoV task #608

Closed
wants to merge 56 commits into from
Closed
Show file tree
Hide file tree
Changes from 39 commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
fd3ed46
split kraken database and tool, use standalone task with default data…
jrotieno Aug 28, 2024
428be7f
renaming kraken2 task calls to just kraken instead of suffixing with …
jrotieno Sep 6, 2024
201e298
update output name
jrotieno Sep 6, 2024
3b99abf
renaming kraken outputs to kraken2
jrotieno Sep 6, 2024
7f672a0
additional kraken outputs
jrotieno Sep 6, 2024
0ea39a8
Merge branch 'main' into jro-kraken-split-database-and-task
jrotieno Sep 6, 2024
7be05c0
clearlabs outputs fix
jrotieno Sep 6, 2024
0ca1515
Merge branch 'jro-kraken-split-database-and-task' of https://github.c…
jrotieno Sep 6, 2024
9aa1681
updating RSV Kraken2 target organism identifiers and exposing the Kra…
jrotieno Sep 9, 2024
88f2279
md5sum
jrotieno Sep 9, 2024
f9a2ef9
inputs to manage CI errors
jrotieno Sep 17, 2024
b838b15
CI error, again!
jrotieno Sep 17, 2024
e10df04
adding a test kraken database for CI
jrotieno Sep 17, 2024
c388970
fix test theiacov inputs
jrotieno Sep 17, 2024
9e3eadf
md5sum
jrotieno Sep 20, 2024
bd6fba9
optional target_organism for theiacov SE
jrotieno Sep 20, 2024
8e8bc0b
md5sum
jrotieno Sep 20, 2024
d69d01d
new test database
jrotieno Sep 30, 2024
a2a68fb
updated test kraken database
jrotieno Sep 30, 2024
644f69e
md5sum
jrotieno Sep 30, 2024
912327c
update CI for kraken2 report in theiacov clearlabs, ilmn pe, and ilmn…
kapsakcj Sep 30, 2024
64a619e
update CI
cimendes Oct 7, 2024
258e5f3
Merge branch 'main' into jro-kraken-split-database-and-task
cimendes Oct 7, 2024
0d73768
update ci
cimendes Oct 7, 2024
07c081e
fiz ouput workflow name
cimendes Oct 7, 2024
c7925c4
update docs - kraken2 standalone
cimendes Oct 7, 2024
098d982
update ci again
cimendes Oct 7, 2024
0659d74
hide call_kraken from input table
cimendes Oct 17, 2024
d7c8795
update input table for TheiaCoV
cimendes Oct 17, 2024
576efa8
update outputs for theiacov
cimendes Oct 17, 2024
07736d4
report SC2 proportion only if target organisms is SC2 - TheiaCoV clea…
cimendes Oct 18, 2024
9ab5ac0
update CI
cimendes Oct 18, 2024
7bbf779
make TheiaCoV ONT compatible
cimendes Oct 18, 2024
84292df
update docs - theiacov outputs
cimendes Oct 18, 2024
cc69a98
CI once more
cimendes Oct 18, 2024
e4022ca
forgot to change output types
cimendes Oct 21, 2024
2629708
solve parsing issue - it was a BUG!!!! :bug:
cimendes Oct 21, 2024
6fe1875
no more bugs hopefully :buh:
cimendes Oct 21, 2024
729ba4a
this CI is never happy :bug:
cimendes Oct 21, 2024
e9969fc
Merge branch 'main' into jro-kraken-split-database-and-task
cimendes Oct 25, 2024
5fe7123
rename kraken2 outputs to match other theiacov, rename kraken2_db input
cimendes Oct 25, 2024
88bc2f7
kraken2_db
cimendes Oct 25, 2024
bb90802
kraken2_db
cimendes Oct 25, 2024
fe5e93d
kraken2_db
cimendes Oct 25, 2024
a28ff96
kraken -> kraken2
cimendes Oct 25, 2024
e529d19
kraken -> kraken2
cimendes Oct 25, 2024
72dc913
more kraken -> kraken2
cimendes Oct 25, 2024
9e17028
kraken -> kraken2 continued
cimendes Oct 25, 2024
bcddf62
more kraken -> kraken2
cimendes Oct 25, 2024
4b6d267
krakren -> kraken2
cimendes Oct 25, 2024
bc72236
last kraken -> kraken2 (ignoring nullabor)
cimendes Oct 25, 2024
283d7e1
fix output declaration
cimendes Oct 25, 2024
6e7e0ef
forgot about the ncbi_scrub standalone wfs again
cimendes Oct 25, 2024
35e9d74
update CI for theiaprok and making freyja_fastq functional with the n…
cimendes Oct 25, 2024
11d56dd
change output type
cimendes Oct 25, 2024
51b1c45
add missing pe
cimendes Oct 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 18 additions & 13 deletions docs/workflows/genomic_characterization/theiacov.md
Original file line number Diff line number Diff line change
Expand Up @@ -217,19 +217,22 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| ivar_consensus | **stats_n_coverage_primtrim_disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | SE,PE | HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2 |
| ivar_consensus | **stats_n_coverage_primtrim_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/samtools:1.15 | Optional | SE,PE | HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2 |
| ivar_consensus | **stats_n_coverage_primtrim_memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | SE,PE | HIV, MPXV, WNV, rsv_a, rsv_b, sars-cov-2 |
| kraken2_dehosted | **classified_out** | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **docker_image** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **kraken2_db** | String | The database used to run Kraken2 | /kraken2-db | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | CL | sars-cov-2 |
| kraken2_dehosted | **read2** | File | Internal component, do not modify | | Do not modify, Optional | CL | sars-cov-2 |
| kraken2_dehosted | **unclassified_out** | String | Allows user to rename the unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional | CL | sars-cov-2 |
| kraken2_raw | **classified_out** | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional | CL | sars-cov-2 |
| kraken2_raw | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | CL | sars-cov-2 |
| kraken2_raw | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | CL | sars-cov-2 |
| kraken2_raw | **docker_image** | Int | Docker container used in this task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_raw | **docker** | Int | Docker container used in this task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.0.8-beta_hv | Optional | CL | sars-cov-2 |
| kraken2_raw | **kraken2_db** | String | The database used to run Kraken2 | /kraken2-db | Optional | CL | sars-cov-2 |
| kraken2_raw | **memory** | String | Amount of memory/RAM (in GB) to allocate to the task | 8 | Optional | CL | sars-cov-2 |
| kraken2_raw | **read_processing** | String | The tool used for trimming of primers from reads. Options are trimmomatic and fastp | trimmomatic | Optional | | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| kraken2_raw | **read2** | File | Internal component, do not modify | | Do not modify, Optional | CL | sars-cov-2 |
| kraken2_raw | **unclassified_out** | String | Allows user to rename the unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional | CL | sars-cov-2 |
| nanoplot_clean | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| nanoplot_clean | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | ONT | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| nanoplot_clean | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/nanoplot:1.40.0 | Optional | ONT | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
Expand Down Expand Up @@ -373,6 +376,7 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| workflow name | **flu_segment** | String | Influenza genome segment being analyzed. Options: "HA" or "NA". | HA | Optional, Required | FASTA | |
| workflow name | **flu_subtype** | String | The influenza subtype being analyzed. Options: "Yamagata", "Victoria", "H1N1", "H3N2", "H5N1". Automatically determined. | | Optional | FASTA | |
| workflow name | **genome_length** | Int | Use to specify the expected genome length | | Optional | FASTA, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| workflow name | **kraken_db** | File | A Kraken2 database in .tar.gz format. Must contain viral and human sequences. | gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz | Optional | CL, ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| workflow name | **max_genome_length** | Int | Maximum genome length able to pass read screening | 2673870 | Optional | ONT, PE, SE | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| workflow name | **max_length** | Int | Maximum length for a read based on the SARS-CoV-2 primer scheme | 700 | Optional | ONT | HIV, MPXV, WNV, flu, rsv_a, rsv_b, sars-cov-2 |
| workflow name | **medaka_docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/artic-ncov2019:1.3.0-medaka-1.4.3 | Optional | CL | |
Expand Down Expand Up @@ -1035,16 +1039,17 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| ivar_vcf | File | iVar tsv output converted to VCF format | PE, SE |
| ivar_version_consensus | String | Version of iVar for running the iVar consensus command | PE, SE |
| ivar_version_primtrim | String | Version of iVar for running the iVar trim command | PE, SE |
| kraken_human | Float | Percent of human read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_report | File | Full Kraken report | CL, ONT, PE, SE |
| kraken_report_dehosted | File | Full Kraken report after host removal | CL, ONT, PE |
| kraken_sc2 | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_sc2_dehosted | Float | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_target_organism | String | Percent of target organism read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken_target_organism_dehosted | String | Percent of target organism read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken_target_organism_name | String | The name of the target organism; e.g., "Monkeypox" or "Human immunodeficiency virus" | CL, ONT, PE, SE |
| kraken_version | String | Version of Kraken software used | CL, ONT, PE, SE |
| kraken2_database | String | Database file used for Kraken2 analysis | CL, ONT, PE, SE |
| kraken2_human | Float | Percent of human read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken2_human_dehosted | Float | Percent of human read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken2_report | File | Full Kraken report | CL, ONT, PE, SE |
| kraken2_report_dehosted | File | Full Kraken report after host removal | CL, ONT, PE |
| kraken2_sc2 | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken2_sc2_dehosted | String | Percent of SARS-CoV-2 read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken2_target_organism | String | Percent of target organism read data detected using the Kraken2 software | CL, ONT, PE, SE |
| kraken2_target_organism_dehosted | String | Percent of target organism read data detected using the Kraken2 software after host removal | CL, ONT, PE |
| kraken2_target_organism_name | String | The name of the target organism; e.g., "Monkeypox" or "Human immunodeficiency virus" | CL, ONT, PE, SE |
| kraken2_version | String | Version of Kraken software used | CL, ONT, PE, SE |
| meanbaseq_trim | Float | Mean quality of the nucleotide basecalls aligned to the reference genome after primer trimming | CL, ONT, PE, SE |
| meanmapq_trim | Float | Mean quality of the mapped reads to the reference genome after primer trimming | CL, ONT, PE, SE |
| medaka_reference | String | Reference sequence used in medaka task | CL, ONT |
Expand Down
36 changes: 25 additions & 11 deletions docs/workflows/standalone/kraken2.md
Original file line number Diff line number Diff line change
Expand Up @@ -33,10 +33,11 @@ Besides the data input types, there are minimal differences between these two wo
| Database name | Database Description | Suggested Applications | GCP URI (for usage in Terra) | Source | Database Size (GB) | Date of Last Update |
| --- | --- | --- | --- | --- | --- | --- |
| **Kalamari v5.1** | Kalamari is a database of complete public assemblies, that has been fine-tuned for enteric pathogens and is backed by trusted institutions. [Full list available here ( in chromosomes.tsv and plasmids.tsv)](https://github.com/lskatz/Kalamari/tree/master/src) | Single-isolate enteric bacterial pathogen analysis (Salmonella, Escherichia, Shigella, Listeria, Campylobacter, Vibrio, Yersinia) | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2.kalamari_5.1.tar.gz`** | ‣ | 1.5 | 18/5/2022 |
| **standard 8GB** | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 8GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_08gb_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 7.5 | 12/1/2024 |
| **standard 16GB** | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 16GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_16gb_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 15 | 12/1/2024 |
| **standard** | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 72 | 18/4/2023 |
| **standard 16GB** | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 16GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_16gb_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 15 | 12/1/2024 |
| **standard 8GB** | Standard RefSeq database (archaea, bacteria, viral, plasmid, human, UniVec_Core) capped at 8GB | Prokaryotic or viral organisms, but for enteric pathogens, we recommend Kalamari | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_standard_08gb_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 7.5 | 12/1/2024 |
| **viral** | RefSeq viral | Viral metagenomics | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/k2_viral_20240112.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 0.6 | 12/1/2024 |
| **viral with human** | Refseq viral plus human (GRCh38) | Viral metagenomics | **`gs://theiagen-large-public-files-rp/terra/databases/kraken2/kraken2_humanGRCh38_viralRefSeq_20240828.tar.gz`** | Theiagen Genomics | 2.76 | 10/7/2024 |
| **EuPathDB48** | Eukaryotic pathogen genomes with contaminants removed. [Full list available here](https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt) | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | **`gs://theiagen-public-files-rp/terra/theiaprok-files/k2_eupathdb48_20201113.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 30.3 | 13/11/2020 |
| **EuPathDB48** | Eukaryotic pathogen genomes with contaminants removed. [Full list available here](https://genome-idx.s3.amazonaws.com/kraken/k2_eupathdb48_20201113/EuPathDB48_Contents.txt) | Eukaryotic organisms (Candida spp., Aspergillus spp., etc) | **`gs://theiagen-large-public-files-rp/terra/databases/kraken/k2_eupathdb48_20230407.tar.gz`** | https://benlangmead.github.io/aws-indexes/k2 | 11 | 7/4/2023 |

Expand All @@ -48,13 +49,13 @@ Besides the data input types, there are minimal differences between these two wo
| *workflow_name | **read1** | File | | | Required | ONT, PE, SE |
| *workflow_name | **read2** | File | | | Required for PE only | PE |
| *workflow_name | **samplename** | String | | | Required | ONT, PE, SE |
| kraken2_pe or kraken2_se | **classified_out** | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **disk_size** | Int | GB of storage to request for VM used to run the kraken2 task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **kraken2_args** | String | Allows a user to supply additional kraken2 command-line arguments | | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional | ONT, PE, SE |
| kraken2_pe or kraken2_se | **unclassified_out** | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional | ONT, PE, SE |
| kraken2 | **classified_out** | String | Allows user to rename the classified FASTQ files output. Must include .fastq as the suffix | classified#.fastq | Optional | ONT, PE, SE |
| kraken2 | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | ONT, PE, SE |
| kraken2 | **disk_size** | Int | GB of storage to request for VM used to run the kraken2 task. Increase this when using large (>30GB kraken2 databases such as the "k2_standard" database) | 100 | Optional | ONT, PE, SE |
| kraken2 | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/staphb/kraken2:2.1.2-no-db | Optional | ONT, PE, SE |
| kraken2 | **kraken2_args** | String | Allows a user to supply additional kraken2 command-line arguments | | Optional | ONT, PE, SE |
| kraken2 | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 32 | Optional | ONT, PE, SE |
| kraken2 | **unclassified_out** | String | Allows user to rename unclassified FASTQ files output. Must include .fastq as the suffix | unclassified#.fastq | Optional | ONT, PE, SE |
| krona | **cpu** | Int | Number of CPUs to allocate to the task | 4 | Optional | PE, SE |
| krona | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 100 | Optional | PE, SE |
| krona | **docker** | String | The Docker container to use for the task | us-docker.pkg.dev/general-theiagen/biocontainers/krona:2.7.1--pl526_5 | Optional | PE, SE |
Expand Down Expand Up @@ -133,7 +134,13 @@ When assessing the taxonomic identity of a single isolate's sequence, it is norm

[Krona](https://github.com/marbl/Krona) produces an interactive report that allows hierarchical data, such as the one from Kraken2, to be explored with zooming, multi-layered pie charts. These pie charts are intuitive and highly responsive.

Krona will only output hierarchical results for bacterial organisms in its current implementation.
!!! warning

Krona will only output hierarchical results for **bacterial organisms** in its current implementation.

!!! warning

Krona is only available for Kraken reports generated with **Illumina data**, paired or singled-ended.

??? toggle "Example Krona report"

Expand All @@ -146,4 +153,11 @@ Krona will only output hierarchical results for bacterial organisms in its curre
| --- | --- |
| Software Source Code | [Kraken2 on GitHub](https://github.com/DerrickWood/kraken2/) |
| Software Documentation | <https://github.com/DerrickWood/kraken2/blob/master/docs/MANUAL.markdown> |
| Original Publication(s) | [Improved metagenomic analysis with Kraken 2](https://link.springer.com/article/10.1186/s13059-019-1891-0) |
| Original Publication(s) | [Improved metagenomic analysis with Kraken 2](https://link.springer.com/article/10.1186/s13059-019-1891-0) |

!!! techdetails "Krona Technical Details"
| | Links |
| --- | --- |
| Software Source Code | [Krona on GitHub](https://github.com/marbl/Krona) |
| Software Documentation | <https://github.com/marbl/Krona/wiki> |
| Original Publication(s) | [Interactive metagenomic visualization in a Web browser](https://doi.org/10.1186/1471-2105-12-385) |
Loading