Skip to content

Commit

Permalink
Merge branch 'main' into smw-tbprofiler-updates-dev
Browse files Browse the repository at this point in the history
  • Loading branch information
sage-wright authored Nov 12, 2024
2 parents ab6334a + 2669f99 commit f996a96
Show file tree
Hide file tree
Showing 30 changed files with 144 additions and 56 deletions.
4 changes: 4 additions & 0 deletions docs/workflows/genomic_characterization/freyja.md
Original file line number Diff line number Diff line change
Expand Up @@ -327,12 +327,16 @@ The main output file used in subsequent Freyja workflows is found under the `fre
| bwa_version | String | Version of BWA used to map read data to the reference genome | PE, SE |
| fastp_html_report | File | The HTML report made with fastp | PE, SE |
| fastp_version | String | Version of fastp software used | PE, SE |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length | PE, SE |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length | PE |
| fastq_scan_num_reads_clean_pairs | String | Number of clean read pairs | PE |
| fastq_scan_num_reads_clean1 | Int | Number of clean forward reads | PE, SE |
| fastq_scan_num_reads_clean2 | Int | Number of clean reverse reads | PE |
| fastq_scan_num_reads_raw_pairs | String | Number of raw read pairs | PE |
| fastq_scan_num_reads_raw1 | Int | Number of raw forward reads | PE, SE |
| fastq_scan_num_reads_raw2 | Int | Number of raw reverse reads | PE |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length | PE, SE |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length | PE |
| fastq_scan_version | String | Version of fastq_scan used for read QC analysis | PE, SE |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser | PE, SE |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser | PE |
Expand Down
4 changes: 4 additions & 0 deletions docs/workflows/genomic_characterization/theiacov.md
Original file line number Diff line number Diff line change
Expand Up @@ -1026,6 +1026,8 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| est_percent_gene_coverage_tsv | File | Percent coverage for each gene in the organism being analyzed (depending on the organism input) | CL, ONT, PE, SE |
| fastp_html_report | File | HTML report for fastp | PE, SE |
| fastp_version | String | Fastp version used | PE, SE |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length | PE, SE, CL |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length | PE |
| fastq_scan_num_reads_clean_pairs | String | Number of paired reads after filtering as determined by fastq_scan | PE |
| fastq_scan_num_reads_clean1 | Int | Number of forward reads after filtering as determined by fastq_scan | CL, PE, SE |
| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after filtering as determined by fastq_scan | PE |
Expand All @@ -1036,6 +1038,8 @@ All TheiaCoV Workflows (not TheiaCoV_FASTA_Batch)
| fastq_scan_r1_mean_q_raw | Float | Forward read mean quality value before quality trimming and adapter removal | |
| fastq_scan_r1_mean_readlength_clean | Float | Forward read mean read length value after quality trimming and adapter removal | |
| fastq_scan_r1_mean_readlength_raw | Float | Forward read mean read length value before quality trimming and adapter removal | |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length | PE, SE, CL |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length | PE |
| fastq_scan_version | String | Version of fastq_scan used for read QC analysis | CL, PE, SE |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser | PE, SE |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser | PE |
Expand Down
4 changes: 4 additions & 0 deletions docs/workflows/genomic_characterization/theiaeuk.md
Original file line number Diff line number Diff line change
Expand Up @@ -484,6 +484,10 @@ The TheiaEuk workflow automatically activates taxa-specific tasks after identifi
| cg_pipeline_report | File | TSV file of read metrics from raw reads, including average read length, number of reads, and estimated genome coverage |
| est_coverage_clean | Float | Estimated coverage calculated from clean reads and genome length |
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length |
| r1_mean_q_clean | Float | Mean quality score of clean forward reads |
| r1_mean_q_raw | Float | Mean quality score of raw forward reads |
| r2_mean_q_clean | Float | Mean quality score of clean reverse reads |
Expand Down
4 changes: 4 additions & 0 deletions docs/workflows/genomic_characterization/theiameta.md
Original file line number Diff line number Diff line change
Expand Up @@ -295,12 +295,16 @@ The TheiaMeta_Illumina_PE workflow processes Illumina paired-end (PE) reads ge
| fastp_html_report | File | Report file for fastp in HTML format |
| fastp_version | String | Version of fastp used |
| fastq_scan_docker | String | Docker image of fastq_scan |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length |
| fastq_scan_num_reads_clean_pairs | String | Number of read pairs after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean1 | Int | Number of forward reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after cleaning as calculated by fastq_scan |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs as calculated by fastq_scan |
| fastq_scan_num_reads_raw1 | Int | Number of input forward reads as calculated by fastq_scan |
| fastq_scan_num_reads_raw2 | Int | Number of input reserve reads as calculated by fastq_scan |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length |
| fastq_scan_version | String | fastq_scan version |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser |
Expand Down
4 changes: 4 additions & 0 deletions docs/workflows/genomic_characterization/theiaprok.md
Original file line number Diff line number Diff line change
Expand Up @@ -1731,12 +1731,16 @@ The TheiaProk workflows automatically activate taxa-specific sub-workflows after
| est_coverage_raw | Float | Estimated coverage calculated from raw reads and genome length | ONT, PE, SE |
| fastp_html_report | File | The HTML report made with fastp | PE, SE |
| fastp_version | String | Version of fastp software used | PE, SE |
| fastq_scan_clean1_json | File | JSON file output from `fastq-scan` containing summary stats about clean forward read quality and length | PE, SE |
| fastq_scan_clean2_json | File | JSON file output from `fastq-scan` containing summary stats about clean reverse read quality and length | PE |
| fastq_scan_num_reads_clean_pairs | String | Number of read pairs after cleaning as calculated by fastq_scan | PE |
| fastq_scan_num_reads_clean1 | Int | Number of forward reads after cleaning as calculated by fastq_scan | PE, SE |
| fastq_scan_num_reads_clean2 | Int | Number of reverse reads after cleaning as calculated by fastq_scan | PE |
| fastq_scan_num_reads_raw_pairs | String | Number of input read pairs calculated by fastq_scan | PE |
| fastq_scan_num_reads_raw1 | Int | Number of input forward reads calculated by fastq_scan | PE, SE |
| fastq_scan_num_reads_raw2 | Int | Number of input reverse reads calculated by fastq_scan | PE |
| fastq_scan_raw1_json | File | JSON file output from `fastq-scan` containing summary stats about raw forward read quality and length | PE, SE |
| fastq_scan_raw2_json | File | JSON file output from `fastq-scan` containing summary stats about raw reverse read quality and length | PE |
| fastq_scan_version | String | Version of fastq-scan software used | PE, SE |
| fastqc_clean1_html | File | Graphical visualization of clean forward read quality from fastqc to open in an internet browser | PE, SE |
| fastqc_clean2_html | File | Graphical visualization of clean reverse read quality from fastqc to open in an internet browser | PE |
Expand Down
65 changes: 41 additions & 24 deletions tasks/quality_control/basic_statistics/task_fastq_scan.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -6,14 +6,16 @@ task fastq_scan_pe {
File read2
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
String read2_name = basename(basename(basename(read2, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Int disk_size = 50
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:1.0.1--h4ac6f70_3"
Int memory = 2
Int cpu = 2
Int cpu = 1
}
command <<<
# capture date and version
date | tee DATE
# exit task in case anything fails in one-liners or variables are unset
set -euo pipefail

# capture version
fastq-scan -v | tee VERSION

# set cat command based on compression
Expand All @@ -24,11 +26,21 @@ task fastq_scan_pe {
fi

# capture forward read stats
echo "DEBUG: running fastq-scan on $(basename ~{read1})"
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read1_name}_fastq-scan.json > READ1_SEQS
echo "DEBUG: number of reads in $(basename ~{read1}): $(cat READ1_SEQS)"
read1_seqs=$(cat READ1_SEQS)
echo

# capture reverse read stats
echo "DEBUG: running fastq-scan on $(basename ~{read2})"
eval "${cat_reads} ~{read2}" | fastq-scan | tee ~{read2_name}_fastq-scan.json
cat ~{read2_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ2_SEQS

# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read2_name}_fastq-scan.json > READ2_SEQS
echo "DEBUG: number of reads in $(basename ~{read2}): $(cat READ2_SEQS)"
read2_seqs=$(cat READ2_SEQS)

# capture number of read pairs
Expand All @@ -37,26 +49,27 @@ task fastq_scan_pe {
else
read_pairs="Uneven pairs: R1=${read1_seqs}, R2=${read2_seqs}"
fi

echo $read_pairs | tee READ_PAIRS

# use simple redirect so STDOUT is not confusing
echo "$read_pairs" > READ_PAIRS
echo "DEBUG: number of read pairs: $(cat READ_PAIRS)"
>>>
output {
File read1_fastq_scan_report = "~{read1_name}_fastq-scan.json"
File read2_fastq_scan_report = "~{read2_name}_fastq-scan.json"
File read1_fastq_scan_json = "~{read1_name}_fastq-scan.json"
File read2_fastq_scan_json = "~{read2_name}_fastq-scan.json"
Int read1_seq = read_int("READ1_SEQS")
Int read2_seq = read_int("READ2_SEQS")
String read_pairs = read_string("READ_PAIRS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: docker
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
disk: disk_size + " GB"
preemptible: 1
maxRetries: 3
}
}
Expand All @@ -65,14 +78,16 @@ task fastq_scan_se {
input {
File read1
String read1_name = basename(basename(basename(read1, ".gz"), ".fastq"), ".fq")
Int disk_size = 100
Int disk_size = 50
Int memory = 2
Int cpu = 2
String docker = "quay.io/biocontainers/fastq-scan:0.4.4--h7d875b9_1"
Int cpu = 1
String docker = "us-docker.pkg.dev/general-theiagen/biocontainers/fastq-scan:1.0.1--h4ac6f70_3"
}
command <<<
# capture date and version
date | tee DATE
# exit task in case anything fails in one-liners or variables are unset
set -euo pipefail

# capture version
fastq-scan -v | tee VERSION

# set cat command based on compression
Expand All @@ -83,23 +98,25 @@ task fastq_scan_se {
fi

# capture forward read stats
echo "DEBUG: running fastq-scan on $(basename ~{read1})"
eval "${cat_reads} ~{read1}" | fastq-scan | tee ~{read1_name}_fastq-scan.json
cat ~{read1_name}_fastq-scan.json | jq .qc_stats.read_total | tee READ1_SEQS
# using simple redirect so STDOUT is not confusing
jq .qc_stats.read_total ~{read1_name}_fastq-scan.json > READ1_SEQS
echo "DEBUG: number of reads in $(basename ~{read1}): $(cat READ1_SEQS)"
>>>
output {
File fastq_scan_report = "~{read1_name}_fastq-scan.json"
File fastq_scan_json = "~{read1_name}_fastq-scan.json"
Int read1_seq = read_int("READ1_SEQS")
String version = read_string("VERSION")
String pipeline_date = read_string("DATE")
String fastq_scan_docker = docker
}
runtime {
docker: docker
memory: memory + " GB"
cpu: cpu
disks: "local-disk " + disk_size + " SSD"
disk: disk_size + " GB" # TES
preemptible: 0
disk: disk_size + " GB"
preemptible: 1
maxRetries: 3
}
}
13 changes: 11 additions & 2 deletions tasks/utilities/data_export/task_broad_terra_tools.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,10 @@ task export_taxon_tables {
Int? num_reads_raw2
String? num_reads_raw_pairs
String? fastq_scan_version
File? fastq_scan_raw1_json
File? fastq_scan_raw2_json
File? fastq_scan_clean1_json
File? fastq_scan_clean2_json
Int? num_reads_clean1
Int? num_reads_clean2
String? num_reads_clean_pairs
Expand Down Expand Up @@ -390,7 +394,8 @@ task export_taxon_tables {
volatile: true
}
command <<<

set -euo pipefail

# capture taxon and corresponding table names from input taxon_tables
taxon_array=($(cut -f1 ~{taxon_tables} | tail +2))
echo "Taxon array: ${taxon_array[*]}"
Expand Down Expand Up @@ -446,6 +451,10 @@ task export_taxon_tables {
"num_reads_raw2": "~{num_reads_raw2}",
"num_reads_raw_pairs": "~{num_reads_raw_pairs}",
"fastq_scan_version": "~{fastq_scan_version}",
"fastq_scan_raw1_json": "~{fastq_scan_raw1_json}",
"fastq_scan_raw2_json": "~{fastq_scan_raw2_json}",
"fastq_scan_clean1_json": "~{fastq_scan_clean1_json}",
"fastq_scan_clean2_json": "~{fastq_scan_clean2_json}",
"num_reads_clean1": "~{num_reads_clean1}",
"num_reads_clean2": "~{num_reads_clean2}",
"num_reads_clean_pairs": "~{num_reads_clean_pairs}",
Expand Down Expand Up @@ -778,7 +787,7 @@ task export_taxon_tables {
"agrvate_version": "~{agrvate_version}",
"agrvate_docker": "~{agrvate_docker}",
"srst2_vibrio_detailed_tsv": "~{srst2_vibrio_detailed_tsv}",
"srst2_vibrio_version": "~{srst2_vibrio_version}",~
"srst2_vibrio_version": "~{srst2_vibrio_version}",
"srst2_vibrio_docker": "~{srst2_vibrio_docker}",
"srst2_vibrio_database": "~{srst2_vibrio_database}",
"srst2_vibrio_ctxA": "~{srst2_vibrio_ctxA}",
Expand Down
1 change: 1 addition & 0 deletions tasks/utilities/data_export/task_export_two_tsvs.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ task export_two_tsvs {
volatile: true
}
command <<<
set -euo pipefail
python3 /scripts/export_large_tsv/export_large_tsv.py --project ~{terra_project1} --workspace ~{terra_workspace1} --entity_type ~{datatable1} --tsv_filename "~{datatable1}_table1.tsv"

# check if second project is provided; if not, use first
Expand Down
2 changes: 2 additions & 0 deletions tasks/utilities/data_handling/task_summarize_data.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ task summarize_data {
volatile: true
}
command <<<
set -euo pipefail

# when running on terra, comment out all input_table mentions
python3 /scripts/export_large_tsv/export_large_tsv.py --project "~{terra_project}" --workspace "~{terra_workspace}" --entity_type ~{terra_table} --tsv_filename ~{terra_table}-data.tsv

Expand Down
2 changes: 2 additions & 0 deletions tasks/utilities/data_handling/task_theiacov_fasta_batch.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ task sm_theiacov_fasta_wrangling { # the sm stands for supermassive
Int memory = 4
}
command <<<
set -euo pipefail

# check if nextclade json file exists
if [ -f ~{nextclade_json} ]; then
# this line splits into individual json files
Expand Down
4 changes: 4 additions & 0 deletions tasks/utilities/data_import/task_create_terra_table.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,10 @@ task create_terra_table {
done <filelist-fullpath.txt

echo "DEBUG: terra table created, now beginning upload"

# set error handling to exit if the subsequent import_large_tsv.py task fails
set -euo pipefail

python3 /scripts/import_large_tsv/import_large_tsv.py --project "~{terra_project}" --workspace "~{terra_workspace}" --tsv terra_table_to_upload.tsv
>>>
output {
Expand Down
2 changes: 2 additions & 0 deletions tasks/utilities/file_handling/task_transfer_files.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ task transfer_files {
volatile: true
}
command <<<
set -euo pipefail

file_path_array="~{sep=' ' files_to_transfer}"

gsutil -m cp -n ${file_path_array[@]} ~{target_bucket}
Expand Down
2 changes: 2 additions & 0 deletions tasks/utilities/submission/task_submission.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,8 @@ task prune_table {
volatile: true
}
command <<<
set -euo pipefail

# when running on terra, comment out all input_table mentions
python3 /scripts/export_large_tsv/export_large_tsv.py --project "~{project_name}" --workspace "~{workspace_name}" --entity_type ~{table_name} --tsv_filename ~{table_name}-data.tsv

Expand Down
1 change: 0 additions & 1 deletion tests/config/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@ name: pytest-env-CI
channels:
- conda-forge
- bioconda
- defaults
dependencies:
- python >=3.7
- cromwell=86
Expand Down
10 changes: 4 additions & 6 deletions tests/workflows/theiacov/test_wf_theiacov_clearlabs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -115,17 +115,16 @@
- path: miniwdl_run/call-fastq_scan_clean_reads/inputs.json
contains: ["read1", "clearlabs"]
- path: miniwdl_run/call-fastq_scan_clean_reads/outputs.json
contains: ["fastq_scan_se", "pipeline_date", "read1_seq"]
contains: ["fastq_scan_se", "read1_seq"]
- path: miniwdl_run/call-fastq_scan_clean_reads/stderr.txt
- path: miniwdl_run/call-fastq_scan_clean_reads/stderr.txt.offset
- path: miniwdl_run/call-fastq_scan_clean_reads/stdout.txt
- path: miniwdl_run/call-fastq_scan_clean_reads/task.log
contains: ["wdl", "theiacov_clearlabs", "fastq_scan_clean_reads", "done"]
- path: miniwdl_run/call-fastq_scan_clean_reads/work/DATE
- path: miniwdl_run/call-fastq_scan_clean_reads/work/READ1_SEQS
md5sum: 097e79b36919c8377c56088363e3d8b7
- path: miniwdl_run/call-fastq_scan_clean_reads/work/VERSION
md5sum: 8e4e9cdfbacc9021a3175ccbbbde002b
md5sum: a59bb42644e35c09b8fa8087156fa4c2
- path: miniwdl_run/call-fastq_scan_clean_reads/work/_miniwdl_inputs/0/clearlabs_R1_dehosted.fastq.gz
- path: miniwdl_run/call-fastq_scan_clean_reads/work/clearlabs_R1_dehosted_fastq-scan.json
md5sum: 869dd2e934c600bba35f30f08e2da7c9
Expand All @@ -134,17 +133,16 @@
- path: miniwdl_run/call-fastq_scan_raw_reads/inputs.json
contains: ["read1", "clearlabs"]
- path: miniwdl_run/call-fastq_scan_raw_reads/outputs.json
contains: ["fastq_scan_se", "pipeline_date", "read1_seq"]
contains: ["fastq_scan_se", "read1_seq"]
- path: miniwdl_run/call-fastq_scan_raw_reads/stderr.txt
- path: miniwdl_run/call-fastq_scan_raw_reads/stderr.txt.offset
- path: miniwdl_run/call-fastq_scan_raw_reads/stdout.txt
- path: miniwdl_run/call-fastq_scan_raw_reads/task.log
contains: ["wdl", "theiacov_clearlabs", "fastq_scan_raw_reads", "done"]
- path: miniwdl_run/call-fastq_scan_raw_reads/work/DATE
- path: miniwdl_run/call-fastq_scan_raw_reads/work/READ1_SEQS
md5sum: 097e79b36919c8377c56088363e3d8b7
- path: miniwdl_run/call-fastq_scan_raw_reads/work/VERSION
md5sum: 8e4e9cdfbacc9021a3175ccbbbde002b
md5sum: a59bb42644e35c09b8fa8087156fa4c2
- path: miniwdl_run/call-fastq_scan_raw_reads/work/_miniwdl_inputs/0/clearlabs.fastq.gz
- path: miniwdl_run/call-fastq_scan_raw_reads/work/clearlabs_fastq-scan.json
md5sum: 869dd2e934c600bba35f30f08e2da7c9
Expand Down
Loading

0 comments on commit f996a96

Please sign in to comment.