Skip to content

Commit

Permalink
update docs and workflow style part 1
Browse files Browse the repository at this point in the history
  • Loading branch information
fraser-combe committed Nov 5, 2024
1 parent e4167f4 commit b5636ec
Show file tree
Hide file tree
Showing 6 changed files with 43 additions and 51 deletions.
69 changes: 39 additions & 30 deletions docs/workflows/standalone/dorado_basecalling.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@

| **Workflow Type** | **Applicable Kingdom** | **Last Known Changes** | **Command-line Compatibility** | **Workflow Level** |
|---|---|---|---|---|
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | Dorado v1.0 | Yes | Sample-level |
| [Standalone](../../workflows_overview/workflows_type.md/#standalone) | [Any Taxa](../../workflows_overview/workflows_kingdom.md/#any-taxa) | v2.2.1 | Yes | Sample-level |

## Dorado Basecalling Overview
## Dorado_Basecalling_PHB

The Dorado Basecalling workflow is used to convert Oxford Nanopore `POD5` sequencing files into `FASTQ` format by utilizing a GPU-accelerated environment. This workflow is ideal for high-throughput applications where fast and accurate basecalling is essential. The workflow will upload fastq files to a user designated terra table for downstream analysis.

Expand All @@ -27,34 +27,24 @@ Automatic Detection: When set to sup, hac, or fast, Dorado will automatically se
- `[email protected]`
- `[email protected]`

### Workflow Structure
### Inputs

1. **Dorado Basecalling**: Converts `POD5` files to 'SAM' files using the specified model.
2. **Samtools Convert**: Converts the generated SAM files to BAM for efficient processing.
3. **Dorado Demultiplexing**: Demultiplexes BAM files to produce barcode-specific FASTQ files.
4. **FASTQ File Transfer**: Transfers files to Terra for downstream analysis.
5. **Terra Table Creation**: Generates a Terra table with the uploaded FASTQ files for downstream analyses.

---

## Inputs

| **Task** | **Variable** | **Type** | **Description** | **Default Value** | **Required** |
| **Terra Task Name** | **Variable** | **Type** | **Description** | **Default Value** | **Terra Status** |
|---|---|---|---|---|---|
| Basecalling | **input_files** | Array[File] | Array of `POD5` files for basecalling | None | Yes |
| Basecalling | **dorado_model** | Boolean | Model accuracy or full model name (default: 'sup')| sup | No |
| Basecalling | **fastq_file_name** | String | Prefix for naming output FASTQ files | None | Yes |
| Basecalling | **kit_name** | String | Sequencing kit name used (e.g., `SQK-RPB114-24`). | None | Yes |
| Basecalling | **cpu** | Int | Number of CPUs allocated | 8 | No |
| Basecalling | **memory** | String | Amount of memory to allocate | 32GB | No |
| Basecalling | **gpuCount** | Int | Number of GPUs to use | 1 | No |
| Basecalling | **gpuType** | String | Type of GPU (e.g., `nvidia-tesla-t4`). | nvidia-tesla-t4 | No |
| Demultiplexing | **fastq_upload_path** | String | Location to upload FASTQ files on Terra (copy path from terra folder) | None | Yes |
| Demultiplexing | **fastq_file_name** | String | Prefix for naming output FASTQ files| None| Yes |
| Terra Table | **terra_project** | String | Terra project ID for final fastq file uplaod to terra table | None | Yes |
| Terra Table | **terra_workspace** | String | Terra workspace name for final fastq file upload to Terra table | None | Yes |

---
| dorado_basecalling_workflow | **input_files** | Array[File] | Array of `POD5` files for basecalling | None | Required |
| dorado_basecalling_workflow | **dorado_model** | String | Model accuracy or full model name (default: 'sup') | "sup" | Optional |
| dorado_basecalling_workflow | **fastq_file_name** | String | Prefix for naming output FASTQ files | None | Required |
| dorado_basecalling_workflow | **kit_name** | String | Sequencing kit name used (e.g., `SQK-RPB114-24`) | None | Required |
| basecall_task.basecall | **cpu** | Int | Number of CPUs allocated | 8 | Optional |
| basecall_task.basecall | **memory** | Int | Amount of memory to allocate (GB) | 32 | Optional |
| basecall_task.basecall | **gpuCount** | Int | Number of GPUs to use | 1 | Optional |
| basecall_task.basecall | **gpuType** | String | Type of GPU (e.g., `nvidia-tesla-t4`) | "nvidia-tesla-t4" | Optional |
| dorado_basecalling_workflow | **fastq_upload_path** | String | Terra folder path for uploading FASTQ files | None | Required |
| dorado_basecalling_workflow | **terra_project** | String | Terra project ID for FASTQ file upload | None | Required |
| dorado_basecalling_workflow | **terra_workspace** | String | Terra workspace for final FASTQ file upload | None | Required |
| dorado_basecalling_workflow | **paired_end** | Boolean | Indicates if data is paired-end | false | Optional |
| dorado_basecalling_workflow | **assembly_data** | Boolean | Indicates if the data is for assembly | false | Optional |
| dorado_basecalling_workflow | **file_ending** | String? | File extension pattern for identifying files (e.g., ".fastq.gz") | None | Optional |

!!! info "Detailed Input Information"
- **dorado_model**: If set to 'sup', 'hac', or 'fast', the workflow will run with automatic model selection. If a full model name is provided, Dorado will use that model directly.
Expand All @@ -71,9 +61,27 @@ Automatic Detection: When set to sup, hac, or fast, Dorado will automatically se
- **Accepted Prefix**: `projectname-barcode01.fastq.gz`
- **Not Recommended**: `projectname_2024_test-barcode01.fastq.gz` (would recognize only `projectname` as the sample name, leading to ambiguity with multiple files).

---
### Workflow Tasks

This workflow is composed of several tasks to process, basecall, and analyze rabies genome data:

??? task "`Dorado Basecalling`: Converts `POD5` files to 'SAM' files"
The basecalling task takes `POD5` files as input and converts them into 'SAM' format using the specified model. This step leverages GPU acceleration for efficient processing.

??? task "`Samtools Convert`: Converts SAM to BAM"
Once the SAM files are generated, this task converts them into BAM format, optimizing them for downstream applications and saving storage space.

??? task "`Dorado Demultiplexing`: Produces barcode-specific FASTQ files"
This task demultiplexes the BAM files based on barcodes, generating individual FASTQ files for each barcode to support further analyses.

??? task "`FASTQ File Transfer`: Transfers files to Terra"
After demultiplexing, the FASTQ files are uploaded to Terra for storage and potential use in other workflows.

??? task "`Terra Table Creation`: Creates a Terra table with FASTQ files"
A Terra table is created to index the uploaded FASTQ files, enabling easy access and integration with other workflows for downstream analyses.


## Outputs
### Outputs

| **Variable** | **Type** | **Description** |
|---|---|---|
Expand All @@ -82,5 +90,6 @@ Automatic Detection: When set to sup, hac, or fast, Dorado will automatically se
| **logs** | Array[File] | Log files from the demultiplexing process |
| **terra_table_tsv** | File | TSV file for Terra table upload |

## References
<!-- -->
><https://github.com/nanoporetech/dorado/>
6 changes: 3 additions & 3 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ nav:
- Zip_Column_Content: workflows/data_export/zip_column_content.md
- Standalone:
- Cauris_CladeTyper: workflows/standalone/cauris_cladetyper.md
- Dorado_basecalling: workflows/standalone/dorado_basecalling.md
- Dorado_Bsecalling: workflows/standalone/dorado_basecalling.md
- GAMBIT_Query: workflows/standalone/gambit_query.md
- Kraken2: workflows/standalone/kraken2.md
- NCBI-AMRFinderPlus: workflows/standalone/ncbi_amrfinderplus.md
Expand All @@ -68,7 +68,7 @@ nav:
- BaseSpace_Fetch: workflows/data_import/basespace_fetch.md
- Concatenate_Column_Content: workflows/data_export/concatenate_column_content.md
- Create_Terra_Table: workflows/data_import/create_terra_table.md
- Dorado_basecalling: workflows/standalone/dorado_basecalling.md
- Dorado_Basecalling: workflows/standalone/dorado_basecalling.md
- Kraken2: workflows/standalone/kraken2.md
- NCBI-Scrub: workflows/standalone/ncbi_scrub.md
- RASUSA: workflows/standalone/rasusa.md
Expand Down Expand Up @@ -128,7 +128,7 @@ nav:
- Core_Gene_SNP: workflows/phylogenetic_construction/core_gene_snp.md
- Create_Terra_Table: workflows/data_import/create_terra_table.md
- CZGenEpi_Prep: workflows/phylogenetic_construction/czgenepi_prep.md
- Dorado_basecalling: workflows/standalone/dorado_basecalling.md
- Dorado_Basecalling: workflows/standalone/dorado_basecalling.md
- Find_Shared_Variants: workflows/phylogenetic_construction/find_shared_variants.md
- Freyja Workflow Series: workflows/genomic_characterization/freyja.md
- GAMBIT_Query: workflows/standalone/gambit_query.md
Expand Down
2 changes: 0 additions & 2 deletions tasks/basecalling/task_dorado_basecall.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,9 @@ task basecall {

echo "Basecalling completed for ~{input_file}. SAM file: $sam_file"
>>>

output {
Array[File] sam_files = glob("output/sam/*.sam")
}

runtime {
docker: docker
cpu: cpu
Expand Down
1 change: 0 additions & 1 deletion tasks/basecalling/task_dorado_demux.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,6 @@ task dorado_demux {
output {
Array[File] fastq_files = glob("~{fastq_file_name}-*.fastq.gz")
}

runtime {
docker: docker
cpu: cpu
Expand Down
1 change: 0 additions & 1 deletion tasks/basecalling/task_samtools_convert.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,6 @@ task samtools_convert {
output {
Array[File] bam_files = glob("output/bam/*.bam")
}

runtime {
docker: docker
cpu: cpu
Expand Down
15 changes: 1 addition & 14 deletions workflows/utilities/wf_dorado_basecalling.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,6 @@ workflow dorado_basecalling_workflow {
meta {
description: "GPU-accelerated workflow for basecalling Oxford Nanopore POD5 files, generating SAM outputs and supporting downstream demultiplexing and FASTQ output."
}

input {
Array[File] input_files
String dorado_model = "sup"
Expand All @@ -22,42 +21,31 @@ workflow dorado_basecalling_workflow {
String? file_ending
String terra_project
String terra_workspace
String fastq_file_name
Int cpu = 8
Int memory = 32
Int disk_size = 100
String fastq_file_name
}

scatter (file in input_files) {
call basecall_task.basecall as basecall_step {
input:
input_file = file,
dorado_model = dorado_model,
kit_name = kit_name,
cpu = cpu,
memory = memory,
disk_size = disk_size
}
}

call samtools_convert_task.samtools_convert {
input:
sam_files = flatten(basecall_step.sam_files)
}

call dorado_demux_task.dorado_demux {
input:
bam_files = samtools_convert.bam_files,
kit_name = kit_name,
fastq_file_name = fastq_file_name
}

call transfer_fastq_files.transfer_files as transfer_files {
input:
files_to_transfer = dorado_demux.fastq_files,
target_bucket = fastq_upload_path
}

if (defined(transfer_files.transferred_files)) {
call terra_fastq_table.create_terra_table as create_terra_table {
input:
Expand All @@ -70,7 +58,6 @@ workflow dorado_basecalling_workflow {
terra_workspace = terra_workspace
}
}

output {
Array[File] fastq_files = dorado_demux.fastq_files
File? terra_table_tsv = create_terra_table.terra_table_to_upload
Expand Down

0 comments on commit b5636ec

Please sign in to comment.