diff --git a/docs/workflows/genomic_characterization/theiaprok.md b/docs/workflows/genomic_characterization/theiaprok.md index c1ee1574d..02f9277a4 100644 --- a/docs/workflows/genomic_characterization/theiaprok.md +++ b/docs/workflows/genomic_characterization/theiaprok.md @@ -4,7 +4,7 @@ | **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** | |---|---|---|---|---| -| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.2.0 | Yes, some optional features incompatible | Sample-level | +| [Genomic Characterization](../../workflows_overview/workflows_type.md/#genomic-characterization) | [Bacteria](../../workflows_overview/workflows_kingdom.md/#bacteria) | PHB v2.3.0 | Yes, some optional features incompatible | Sample-level | ## TheiaProk Workflows @@ -78,6 +78,12 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al | *workflow name | **originating_lab** | String | Will be used in the "originating_lab" column in any taxon-specific tables created in the Export Taxon Tables task | | Optional | FASTA, ONT, PE, SE | | *workflow name | **perform_characterization** | Boolean | Set to "false" if you want to only generate an assembly and relevant QC metrics and skip all characterization tasks | TRUE | Optional | FASTA, ONT, PE, SE | | *workflow name | **qc_check_table** | File | TSV value with taxons for rows and QC values for columns; internal cells represent user-determined QC thresholds; if provided, turns on the QC Check task.
Click on the variable name for an example QC Check table | | Optional | FASTA, ONT, PE, SE | +| *workflow name | **read1_lane2** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | +| *workflow name | **read1_lane3** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | +| *workflow name | **read1_lane4** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | +| *workflow name | **read2_lane2** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | +| *workflow name | **read2_lane3** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | +| *workflow name | **read2_lane4** | File | If provided, the Concatenate_Illumina_Lanes subworkflow will concatenate all files from the same lane before doing any subsequent analysis | | Optional | PE, SE | | *workflow name | **run_id** | String | Will be used in the "run_id" column in any taxon-specific tables created in the Export Taxon Tables task | | Optional | FASTA, ONT, PE, SE | | *workflow name | **seq_method** | String | Will be used in the "seq_id" column in any taxon-specific tables created in the Export Taxon Tables task | | Optional | FASTA, ONT, PE, SE | | *workflow name | **skip_mash** | Boolean | If true, skips estimation of genome size and coverage in read screening steps. As a result, providing true also prevents screening using these parameters. | TRUE | Optional | ONT, SE | @@ -603,6 +609,17 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al | --- | --- | | Task | [task_versioning.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/task_versioning.wdl) | +??? task "`concatenate_illumina_lanes`: Concatenate Multi-Lane Illumina FASTQs ==_for Illumina only_==" + + The `concatenate_illumina_lanes` task concatenates Illumina FASTQ files from multiple lanes into a single file. This task only runs if the `read1_lane2` input file has been provided. All read1 lanes are concatenated together and are used in subsequent tasks, as are the read2 lanes. These concatenated files are also provided as output. + + !!! techdetails "Concatenate Illumina Lanes Technical Details" + The `concatenate_illumina_lanes` task is run twice, once for raw reads and once for clean reads. The task is the same for both PE and SE workflows. + + | | Links | + | --- | --- | + | Task | [wf_concatenate_illumina_lanes.wdl](https://github.com/theiagen/public_health_bioinformatics/blob/main/workflows/utilities/file_handling/wf_concatenate_illumina_lanes.wdl) + ??? task "`screen`: Total Raw Read Quantification and Genome Size Estimation" The [`screen`](https://github.com/theiagen/public_health_bioinformatics/blob/main/tasks/quality_control/comparisons/task_screen.wdl) task ensures the quantity of sequence data is sufficient to undertake genomic analysis. It uses [`fastq-scan`](https://github.com/rpetit3/fastq-scan) and bash commands for quantification of reads and base pairs, and [mash](https://mash.readthedocs.io/en/latest/index.html) sketching to estimate the genome size and its coverage. At each step, the results are assessed relative to pass/fail criteria and thresholds that may be defined by optional user inputs. Samples that do not meet these criteria will not be processed further by the workflow: @@ -705,12 +722,12 @@ All input reads are processed through "[core tasks](#core-tasks-performed-for-al 1. **Species Groups**: - MIDAS clusters bacterial genomes based on 96.5% sequence identity, forming over 5,950 species groups from 31,007 genomes. These groups align with the gold-standard species definition (95% ANI), ensuring highly accurate species identification. - 2. **Genomic Data Structure**: + 1. **Genomic Data Structure**: - **Marker Genes**: Contains 15 universal single-copy genes used to estimate species abundance. - **Representative Genome**: Each species group has a selected representative genome, which minimizes genetic variation and aids in accurate SNP identification. - **Pan-genome**: The database includes clusters of non-redundant genes, with options for multi-level clustering (e.g., 99%, 95%, 90% identity), enabling MIDAS to identify gene content within strains at various clustering thresholds. - 3. **Taxonomic Annotation**: + 1. **Taxonomic Annotation**: - Genomes are annotated based on consensus Latin names. Discrepancies in name assignments may occur due to factors like unclassified genomes or genus-level ambiguities. --- diff --git a/docs/workflows/standalone/concatenate_illumina_lanes.md b/docs/workflows/standalone/concatenate_illumina_lanes.md new file mode 100644 index 000000000..282844fa4 --- /dev/null +++ b/docs/workflows/standalone/concatenate_illumina_lanes.md @@ -0,0 +1,47 @@ +# Concatenate Illumina Lanes + +## Quick Facts + +| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** | +|---|---|---|---|---| +| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | PHB 2.3.0 | Yes | Sample-level | + +## Concatenate_Illumina_Lanes_PHB + +Some Illumina machines produce multi-lane FASTQ files for a single sample. This workflow concatenates the multiple lanes into a single FASTQ file per read type (forward or reverse). + +### Inputs + +| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** | +|---|---|---|---|---|---| +| concatenate_illumina_lanes | **read1_lane1** | File | The first lane for the forward reads | | Required | +| concatenate_illumina_lanes | **read1_lane2** | File | The second lane for the forward reads | | Required | +| concatenate_illumina_lanes | **samplename** | String | The name of the sample, used to name the output files | | Required | +| cat_lanes | **cpu** | Int | Number of CPUs to allocate to the task | 2 | Optional | +| cat_lanes | **disk_size** | Int | Amount of storage (in GB) to allocate to the task | 50 | Optional | +| cat_lanes | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/utility:1.2" | Optional | +| cat_lanes | **memory** | Int | Amount of memory/RAM (in GB) to allocate to the task | 4 | Optional | +| concatenate_illumina_lanes | **read1_lane3** | File | The third lane for the forward reads | | Optional | +| concatenate_illumina_lanes | **read1_lane4** | File | The fourth lane for the forward reads | | Optional | +| concatenate_illumina_lanes | **read2_lane1** | File | The first lane for the reverse reads | | Optional | +| concatenate_illumina_lanes | **read2_lane2** | File | The second lane for the reverse reads | | Optional | +| concatenate_illumina_lanes | **read2_lane3** | File | The third lane for the reverse reads | | Optional | +| concatenate_illumina_lanes | **read2_lane4** | File | The fourth lane for the reverse reads | | Optional | +| version_capture | **docker** | String | The Docker container to use for the task | "us-docker.pkg.dev/general-theiagen/theiagen/alpine-plus-bash:3.20.0" | Optional | +| version_capture | **timezone** | String | Set the time zone to get an accurate date of analysis (uses UTC by default) | | Optional | + +### Workflow Tasks + +This workflow concatenates the Illumina lanes for forward and (if provided) reverse reads. The output files are named as followed: + +- Forward reads: `_merged_R1.fastq.gz` +- Reverse reads: `_merged_R2.fastq.gz` + +### Outputs + +| **Variable** | **Type** | **Description** | +|---|---|---| +| concatenate_illumina_lanes_analysis_date | String | Date of analysis | +| concatenate_illumina_lanes_version | String | Version of PHB used for the analysis | +| read1_concatenated | File | Concatenated forward reads | +| read2_concatenated | File | Concatenated reverse reads |