Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Snippy_Variants QC outputs to Snippy_Tree and Snippy_Sreamline workflow outputs #592

Merged
merged 16 commits into from
Nov 6, 2024
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -188,6 +188,7 @@ For all cases:
| snippy_centroid_version | String | Centroid version used |
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | The concatenated variants file |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. The file is a tab-separated values (TSV) file with the following columns:<br>- samplename<br>- reads_aligned_to_reference<br>- total_reads<br>- percent_reads_aligned<br>- variants_total<br>- percent_ref_coverage<br>- #rname<br>- startpos<br>- endpos<br>- numreads<br>- covbases<br>- coverage<br>- meandepth<br>- meanbaseq<br>- meanmapq<br><br>The last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Final phylogenetic tree produced by Snippy_Streamline |
Expand Down
1 change: 1 addition & 0 deletions docs/workflows/phylogenetic_construction/snippy_tree.md
Original file line number Diff line number Diff line change
Expand Up @@ -312,6 +312,7 @@ Sequencing data used in the Snippy_Tree workflow must:
|---|---|---|
| snippy_cg_snp_matrix | File | CSV file of core genome pairwise SNP distances between samples, calculated from the final alignment |
| snippy_concatenated_variants | File | Concatenated snippy_results file across all samples in the set |
| snippy_combined_qc_metrics | File | Combined QC metrics file containing concatenated QC metrics from all samples. The file is a tab-separated values (TSV) file with the following columns:<br>- samplename<br>- reads_aligned_to_reference<br>- total_reads<br>- percent_reads_aligned<br>- variants_total<br>- percent_ref_coverage<br>- #rname<br>- startpos<br>- endpos<br>- numreads<br>- covbases<br>- coverage<br>- meandepth<br>- meanbaseq<br>- meanmapq<br><br>The last set of columns (`#rname` to `meanmapq`) may repeat for each chromosome or contig in the reference genome. |
| snippy_filtered_metadata | File | TSV recording the columns of the Terra data table that were used in the summarize_data task |
| snippy_final_alignment | File | Final alignment (FASTA file) used to generate the tree (either after snippy alignment, gubbins recombination removal, and/or core site selection with SNP-sites) |
| snippy_final_tree | File | Newick tree produced from the final alignment. Depending on user input for core_genome, the tree could be a core genome tree (default when core_genome is true) or whole genome tree (if core_genome is false) |
Expand Down
45 changes: 39 additions & 6 deletions tasks/gene_typing/variant_detection/task_snippy_variants.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -85,23 +85,55 @@ task snippy_variants {
reference_length_passed_depth=$(cat "~{samplename}/~{samplename}_depth_~{min_coverage}.tsv" | wc -l)
echo $reference_length_passed_depth | tee REFERENCE_LENGTH_PASSED_DEPTH

# check if reference_length is equal to 0, if so, output a warning
if [ "$reference_length" -eq 0 ]; then
echo "Could not compute percent reference coverage: reference length is 0" > PERCENT_REF_COVERAGE
echo "0" > PERCENT_REF_COVERAGE
fraser-combe marked this conversation as resolved.
Show resolved Hide resolved
else
# compute percent reference coverage
echo $reference_length_passed_depth $reference_length | awk '{ print ($1/$2)*100 }' > PERCENT_REF_COVERAGE
echo $reference_length_passed_depth $reference_length | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_REF_COVERAGE
fi

# Compute percentage of reads aligned
reads_aligned=$(cat READS_ALIGNED_TO_REFERENCE)
total_reads=$(samtools view -c "~{samplename}/~{samplename}.bam")
echo $total_reads > TOTAL_READS
if [ "$total_reads" -eq 0 ]; then
echo "Could not compute percent reads aligned: total reads is 0" > PERCENT_READS_ALIGNED
echo "0" > PERCENT_READS_ALIGNED
else
echo $reads_aligned $total_reads | awk '{ print ($1/$2)*100 }' > PERCENT_READS_ALIGNED
echo $reads_aligned $total_reads | awk '{ printf("%.2f", ($1/$2)*100) }' > PERCENT_READS_ALIGNED
fi

# Create QC metrics file
line_count=$(wc -l < "~{samplename}/~{samplename}_coverage.tsv")
# Check the number of lines in the coverage file, to consider scenarios e.g. for V. cholerae that has two chromosomes and therefore coverage metrics per chromosome
if [ "$line_count" -eq 2 ]; then
head -n 1 "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_HEADER
sed -n '2p' "~{samplename}/~{samplename}_coverage.tsv" | tr ' ' '\t' > COVERAGE_VALUES
elif [ "$line_count" -gt 2 ]; then
# Multiple chromosomes (header + multiple data lines)
header=$(head -n 1 "~{samplename}/~{samplename}_coverage.tsv")
output_header=""
output_values=""
# while loop to iterate over each line in the coverage file
while read -r line; do
if [ -z "$output_header" ]; then
output_header="$header"
output_values="$line"
else
output_header="$output_header\t$header"
output_values="$output_values\t$line"
fi
done < <(tail -n +2 "~{samplename}/~{samplename}_coverage.tsv")
echo "$output_header" | tr ' ' '\t' > COVERAGE_HEADER
echo "$output_values" | tr ' ' '\t' > COVERAGE_VALUES
else
# Coverage file has insufficient data
echo "Coverage file has insufficient data." > COVERAGE_HEADER
echo "" > COVERAGE_VALUES
fi

# Build the QC metrics file
echo -e "samplename\treads_aligned_to_reference\ttotal_reads\tpercent_reads_aligned\tvariants_total\tpercent_ref_coverage\t$(cat COVERAGE_HEADER)" > "~{samplename}/~{samplename}_qc_metrics.tsv"
echo -e "~{samplename}\t$reads_aligned\t$total_reads\t$(cat PERCENT_READS_ALIGNED)\t$(cat VARIANTS_TOTAL)\t$(cat PERCENT_REF_COVERAGE)\t$(cat COVERAGE_VALUES)" >> "~{samplename}/~{samplename}_qc_metrics.tsv"

>>>
output {
String snippy_variants_version = read_string("VERSION")
Expand All @@ -120,6 +152,7 @@ task snippy_variants {
String snippy_variants_ref_length = read_string("REFERENCE_LENGTH")
String snippy_variants_ref_length_passed_depth = read_string("REFERENCE_LENGTH_PASSED_DEPTH")
String snippy_variants_percent_ref_coverage = read_string("PERCENT_REF_COVERAGE")
File snippy_variants_qc_metrics = "~{samplename}/~{samplename}_qc_metrics.tsv"
String snippy_variants_percent_reads_aligned = read_string("PERCENT_READS_ALIGNED")
}
runtime {
Expand Down
4 changes: 3 additions & 1 deletion workflows/phylogenetics/wf_snippy_streamline.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -50,7 +50,8 @@ workflow snippy_streamline {
tree_name = tree_name_updated,
snippy_variants_outdir_tarball = snippy_variants_wf.snippy_variants_outdir_tarball,
samplenames = samplenames,
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta])
reference_genome_file = select_first([reference_genome_file, ncbi_datasets_download_genome_accession.ncbi_datasets_assembly_fasta]),
snippy_variants_qc_metrics = snippy_variants_wf.snippy_variants_qc_metrics
}
call versioning.version_capture {
input:
Expand Down Expand Up @@ -111,5 +112,6 @@ workflow snippy_streamline {
File? snippy_filtered_metadata = snippy_tree_wf.snippy_filtered_metadata
File? snippy_concatenated_variants = snippy_tree_wf.snippy_concatenated_variants
File? snippy_shared_variants_table = snippy_tree_wf.snippy_shared_variants_table
File? snippy_combined_qc_metrics = snippy_tree_wf.snippy_combined_qc_metrics
}
}
12 changes: 12 additions & 0 deletions workflows/phylogenetics/wf_snippy_tree.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ workflow snippy_tree_wf {
Boolean use_gubbins = true
Boolean core_genome = true
Boolean call_shared_variants = true
Array[File]? snippy_variants_qc_metrics

String? data_summary_terra_project
String? data_summary_terra_workspace
Expand Down Expand Up @@ -186,6 +187,14 @@ workflow snippy_tree_wf {
concatenated_file_name = tree_name_updated
}
}
if (defined(snippy_variants_qc_metrics)) {
call file_handling.cat_files as concatenate_qc_metrics {
input:
files_to_cat = select_first([snippy_variants_qc_metrics]),
concatenated_file_name = tree_name_updated + "_combined_qc_metrics.tsv",
skip_extra_headers = true
}
}
call versioning.version_capture {
input:
}
Expand Down Expand Up @@ -233,5 +242,8 @@ workflow snippy_tree_wf {
# shared snps outputs
File? snippy_concatenated_variants = concatenate_variants.concatenated_variants
File? snippy_shared_variants_table = shared_variants.shared_variants_table

# combined qc metrics
File? snippy_combined_qc_metrics = concatenate_qc_metrics.concatenated_files
}
}
1 change: 1 addition & 0 deletions workflows/standalone_modules/wf_snippy_variants.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ workflow snippy_variants_wf {
File snippy_variants_coverage_tsv = snippy_variants.snippy_variants_coverage_tsv
Int snippy_variants_num_variants = snippy_variants.snippy_variants_num_variants
Float snippy_variants_percent_ref_coverage = snippy_variants.snippy_variants_percent_ref_coverage
File snippy_variants_qc_metrics = snippy_variants.snippy_variants_qc_metrics
Float snippy_variants_percent_reads_aligned = snippy_variants.snippy_variants_percent_reads_aligned
# snippy gene query outputs
String? snippy_variants_query = snippy_gene_query.snippy_variants_query
Expand Down