Updates closes #56 and additions to the output.md file

sanger-tol · Aug 8, 2024 · 6008006 · 6008006
1 parent 7405a9b
commit 6008006
Show file tree

Hide file tree

Showing 9 changed files with 359 additions and 30 deletions.
diff --git a/docs/images/mqc_fastqc_adapter.png b/docs/images/mqc_fastqc_adapter.png
diff --git a/docs/images/mqc_fastqc_counts.png b/docs/images/mqc_fastqc_counts.png
diff --git a/docs/images/mqc_fastqc_quality.png b/docs/images/mqc_fastqc_quality.png
diff --git a/docs/output.md b/docs/output.md
@@ -12,46 +12,374 @@ The directories listed below will be created in the results directory after the
 
 The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps:
 
-- [FastQC](#fastqc) - Raw read QC
-- [MultiQC](#multiqc) - Aggregate report describing results and QC from the whole pipeline
+- [YamlInput](#yamlinput) -
+- [Validate TaxID](#validate-taxid) -
+- [Filter Fasta](#filter-fasta) -
+- [GC Content](#gc-content) -
+- [Generate Genome](#generate-genome) -
+- [Trailing Ns Check](#trailing-ns-check) -
+- [Get KMERS profile](#get-kmers-profile) -
+- [Extract Tiara Hits](#extract-tiara-hits) -
+- [Mito organellar blast](#mito-organellar-blast) -
+- [Plastid organellar blast](#plastid-organellar-blast) -
+- [Run FCS Adaptor](#run-fcs-adaptor) -
+- [Run FCS-GX](#run-fcs-gx) -
+- [Pacbio Barcode Check](#pacbio-barcode-check) -
+- [Run Read Coverage](#run-read-coverage) -
+- [Run Vecscreen](#run-vecscreen) -
+- [Run NT Kraken](#run-nt-kraken) -
+- [Nucleotide Diamond Blast](#nucleotide-diamond-blast) -
+- [Uniprot Diamond Blast](#uniprot-diamond-blast) -
+- [Create BTK dataset](#create-btk-dataset) -
+- [Autofilter and check assembly](#autofilter-and-check-assembly) -
+- [Generate samplesheet](#generate-samplesheet) -
+- [Sanger-TOL BTK](#sanger-tol-btk) -
+- [Merge BTK datasets](#merge-btk-datasets) -
+- [ASCC Merge Tables](#ascc-merge-tables) -
 - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution
 
-### FastQC
+### YamlInput
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `fastqc/`
-  - `*_fastqc.html`: FastQC report containing quality metrics.
-  - `*_fastqc.zip`: Zip archive containing the FastQC report, tab-delimited data file and plot images.
+- `NA`
 
 </details>
 
-[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the [FastQC help pages](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/Help/).
+YamlInput parses the input yaml into channels for later use in the pipeline.
 
-![MultiQC - FastQC sequence counts plot](images/mqc_fastqc_counts.png)
 
-![MultiQC - FastQC mean quality scores plot](images/mqc_fastqc_quality.png)
+### Validate TaxID
 
-![MultiQC - FastQC adapter content plot](images/mqc_fastqc_adapter.png)
+<details markdown="1">
+<summary>Output files</summary>
+
+- `NA`
+
+</details>
+
+Validate TaxID scans through the taxdump to ensure that the input taxid is present in the nxbi taxdump.
+
+
+### Filter Fasta
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `filter/`
+  `*filtered.fasta` - A fasta file that has been filtered for sequences below a given threshold.
+
+</details>
+
+By default scaffolds above 1.9Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments.
+
+
+### GC Content
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `gc/`
+  `*-GC_CONTENT.txt` - A text file describing the GC content of the input genome.
+
+</details>
+
+Calculating the GC content of the input genome.
+
+
+### Generate Genome
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `generate/`
+  `*.genome` - An index-like file describing the input genome.
+
+</details>
+
+An index-like file containing the scaffold and scaffold length of the input genome.
+
+
+### Trailing Ns Check
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `trailingns/`
+  `*_trim_Ns` - A text file containing a report of the Ns found in the genome.
+
+</details>
+
+A text file containing a report of the Ns found in the genome.
+
+
+### Get KMERS profile
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `get/`
+  `*_KMER_COUNTS.csv` - A csv file containing kmers and their counts.
+
+</details>
+
+A csv file containing kmers and their counts.
+
+
+### Extract Tiara Hits
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `tiara/`
+  `*.{txt,txt.gz}` - A text file containing classifications of potential contaminants.
+  `log_*.{txt,txt.gz}` - A log of the tiara run.
+  `*.{fasta,fasta.gz}` - An output fasta file.
+
+</details>
+
+Tiara ...
+
+
+### Mito Organellar Blast
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `blast/`
+  `*.tsv` - A tsv file containing potential contaminants.
+
+</details>
+
+A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.
+
+
+### Chloro Organellar Blast
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `blast/`
+  `*.tsv` - A tsv file containing potential contaminants.
+
+</details>
+
+A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome.
+
+
+### Run FCS Adaptor
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `fcs/`
+  `*.fcs_adaptor_report.txt` - A text file containing potential adaptor sequences and locations.
+  `*.cleaned_sequences.fa.gz` - Cleaned fasta file.
+  `*.fcs_adaptor.log` - Log of the fcs run.
+  `*.pipeline_args.yaml` - Arguments to FCS Adaptor
+  `*.skipped_trims.jsonl` - Skipped sequences
+
+</details>
+
+FCS Adaptor Identified potential locations of retained adaptor sequences from the sequencing run.
+
+
+### Run FCS-GX
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `fcs/`
+  `*out/*.fcs_gx_report.txt` - A text file containing potential contaminant locations.
+  `out/*.taxonomy.rpt` - Taxonomy report of the potential contaminants.
+
+</details>
+
+FCS-GX Identified potential locations of contaminant sequences.
+
+
+### Pacbio Barcode Check
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `filter/`
+  `*_filtered.txt` - Text file of barcodes found in the genome.
+
+</details>
+
+Uses BlastN to identify where given barcode sequences may be in the genome.
+
+
+### Run Read Coverage
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `samtools/`
+  `*.bam` - Aligned BAM file.
+  `*_average_coverage.txt` - Text file containing the coverage information for the genome
+
+</details>
+
+Mapping the read data to the input genome and calculating the average coverage across it.
+
+
+### Run Vecscreen
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `summarise/`
+  `*.vecscreen_contamination` - A text file containing potential vector contaminant locations.
+
+</details>
+
+Vecscreen identifies vector contamination in the input sequence.
+
+
+### Run NT Kraken
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `kraken2/`
+  `*.classified{.,_}*'` - Fastq file containing classified sequence.
+  `*.unclassified{.,_}*'` - Fastq file containing unclassified sequence.
+  `*classifiedreads.txt` - A text file containing a report on reads which have been classified.
+  `*report.txt` - Report of Kraken2 run.
+- `get/`
+  `*txt` - Text file containing lineage information of the reported meta genomic data.
+
+</details>
+
+Kraken assigns taxonomic labels to metagenomic DNA sequences and optionally outputs the fastq of these data.
+
+
+### Nucleotide Diamond Blast
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `diamond/`
+  `*.txt` - A text file containing the genomic locations of hits and scores.
+- `reformat/`
+  `*text` - A Reformated text file continaing the full genomic location of hits and scores.
+- `convert/`
+  `*.hits` - A file containing all hits above the cutoff.
+
+</details>
+
+Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the NCBI db
+
+
+### Uniprot Diamond Blast
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `diamond/`
+  `*.txt` - A text file containing the genomic locations of hits and scores.
+- `reformat/`
+  `*text` - A Reformated text file continaing the full genomic location of hits and scores.
+- `convert/`
+  `*.hits` - A file containing all hits above the cutoff.
+
+</details>
+
+Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the Uniprot db
+
+
+### Create BTK dataset
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `create/`
+  `btk_datasets/` - A btk dataset folder containing data compatible with BTK viewer.
+  `btk_summary_table_full.tsv` - A TSV file summarising the dataset.
+
+</details>
+
+Create BTK, creates a BTK_dataset folder compatible with BTK viewer.
+
+
+### Autofilter and check assembly
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `autofilter/`
+  `autofiltered.fasta` - The decontaminated input genome.
+  `ABNORMAL_CHECK.csv` - Combined FCS and Tiara summary of contamination.
+  `assembly_filtering_removed_sequences.txt` - Sequences deemed contamination and removed from the above assembly.
+  `fcs-gx_alarm_indicator_file.txt` - Contains text to control the running of Blobtoolkit.
+
+</details>
+
+Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found.
+
+
+### Generate samplesheet
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `generate/`
+  `*.csv` - A CSV file containing data locations, for use in Blobtoolkit.
+
+</details>
+
+This produces a CSV containing information on the read data for use in BlobToolKit.
+
+
+### Sanger-TOL BTK
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `sanger/`
+  `*_btk_out/blobtoolkit/${meta.id}*/` - The BTK dataset folder generated by BTK.
+  `*_btk_out/blobtoolkit/plots/` - The plots for display in BTK Viewer.
+  `*_btk_out/blobtoolkit/${meta.id}*/summary.json.gz` - The Summary.json file...
+  `*_btk_out/busco/*` - The BUSCO results returned by BTK.
+  `*_btk_out/multiqc/*` - The MultiQC results returned by BTK.
+  `blobtoolkit_pipeline_info` - The pipeline_info folder.
+
+</details>
+
+Sanger-Tol/BlobToolKit is a Nextflow re-implementation of the snakemake based BlobToolKit pipeline and produces interactive plots used to identify true contamination and seperate sequence from the main assembly.
+
+
+### Merge BTK datasets
+
+<details markdown="1">
+<summary>Output files</summary>
+
+- `merge/`
+  `merged_datasets` - A BTK dataset.
+  `merged_datasets/btk_busco_summary_table_full.tsv` - A TSV file containing a summary of the btk busco results.
+
+</details>
+
+This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer.
 
-> **NB:** The FastQC plots displayed in the MultiQC report shows _untrimmed_ reads. They may contain adapter sequence and potentially regions with low quality.
 
-### MultiQC
+### ASCC Merge Tables
 
 <details markdown="1">
 <summary>Output files</summary>
 
-- `multiqc/`
-  - `multiqc_report.html`: a standalone HTML file that can be viewed in your web browser.
-  - `multiqc_data/`: directory containing parsed statistics from the different tools used in the pipeline.
-  - `multiqc_plots/`: directory containing static images from the report in various formats.
+- `ascc/`
+  `*_contamination_check_merged_table.csv` - ....
+  `*_contamination_check_merged_table_extended.csv` - ....
+  `*_phylum_counts_and_coverage.csv` - A CSV report containing information on the hits per phylum and the coverage of the hits..
 
 </details>
 
-[MultiQC](http://multiqc.info) is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
+Merge Tables merged the summary reports from a number of modules inorder to create a single set of reports.
 
-Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see <http://multiqc.info>.
 
 ### Pipeline information