diff --git a/.github/workflows/linting.yml b/.github/workflows/linting.yml index 888cb4bc..457b5f39 100644 --- a/.github/workflows/linting.yml +++ b/.github/workflows/linting.yml @@ -14,9 +14,11 @@ jobs: EditorConfig: runs-on: ubuntu-latest steps: - - uses: actions/checkout@v3 + - uses: actions/checkout@v4 - - uses: actions/setup-node@v3 + - uses: actions/setup-node@v4 + with: + node-version: "20.11.0" - name: Install editorconfig-checker run: npm install -g editorconfig-checker diff --git a/README.md b/README.md index 25ab772d..03712f50 100644 --- a/README.md +++ b/README.md @@ -8,30 +8,38 @@ --- -# THIS IS AN IN-DEVELOPMENT PIPELINE THAT IS CURRENTLY NOT READY FOR ANY USE - -AS SUCH YOU MAY FIND THAT THE DOCUMENTATION DOES NOT MATCH THE CODE AND IT MAY NOT WORK +## Introduction -ONCE THE PIPELINE REACHES A USABLE STATE A TAGGED RELEASE/PRE-RELEASE WILL BE MADE +**sanger-tol/ascc** is a bioinformatics pipeline that is meant for detecting cobionts and contaminants in genome assemblies. ASCC stands for Assembly Screen for Cobionts and Contaminants. The pipeline aggregates tools such as BLAST, GC and coverage calculation, FCS-adaptor, FCS-GX, VecScreen, BlobToolKit, the BlobToolKit pipeline, Tiara, Kraken, Diamond BLASTX, and kmer counting and with kcounter+scipy. The main outputs are: ---- +- A CSV table with taxonomic classifications of the sequences from the consitutent tools. +- A BlobToolKit dataset that can contain variables that are not present in BlobToolKit datasets produced by the BlobToolKit pipeline (https://github.com/sanger-tol/blobtoolkit) on its own. For example, ASCC can incorporate FCS-GX results into a BlobToolKit dataset. +- Individual report files for adapter, PacBio barcode and organellar contaminants. + The only required input file for ASCC is the assembly FASTA file. Optional inputs are sequencing reads and organellar FASTA files. All individual components of the pipeline are optional, so it is possible to do lightweight runs with assemblies that have a simple composition of species and comprehensive runs with assemblies with complex composition. -## Introduction +![sanger-tol/ascc overview diagram](docs/images/ascc_overview_diagram.png) -**sanger-tol/ascc** is a bioinformatics pipeline that ... +1. Run a selection of processes from the list below (pick any that you think will be useful). - +- FCS-GX +- FCS-adaptor +- VecScreen +- Tiara +- BlobToolKit Pipeline +- nt BLAST +- nr and Uniprot Diamond BLASTX +- GC and coverage calculation +- PacBio barcodes screen +- Organellar BLAST +- nt Kraken2 +- kmer counting + dimensionality reduction - - +2. Postprocess the results of the previous step to produce summary files. What processes were run in the previous step determines what summary files can be generated. The possible outputs are: -1. Read QC ([`FastQC`](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)) -2. Present QC for raw reads ([`MultiQC`](http://multiqc.info/)) +- CSV table of sequence classification results +- BlobToolKit dataset +- CSV table of average coverage per phylum +- Adapter and organellar contamination report files ## Usage @@ -63,8 +71,8 @@ Now, you can run the pipeline using: ```bash nextflow run sanger-tol/ascc \ -profile \ - --input samplesheet.csv \ - --outdir + --input YAML \ + --outdir -entry SANGERTOL_ASCC --include ALL ``` > **Warning:** @@ -74,11 +82,9 @@ nextflow run sanger-tol/ascc \ ## Credits -sanger-tol/ascc was originally written by eeaunin. - -We thank the following people for their extensive assistance in the development of this pipeline: +sanger-tol/ascc was written by [Eerik Aunin](https://github.com/eeaunin), [Damon Lee Pointon](https://github.com/DLBPointon), [James Torrance](https://github.com/jt8-sanger), [Ying Sims](https://github.com/yumisims) and [Will Eagles](https://github.com/weaglesBio). Pipeline development was supervised by [Shane A. McCarthy](https://github.com/mcshane) and [Matthieu Muffato](https://github.com/muffato). - +We thank [Michael Paulini](https://github.com/epaule), Camilla Santos, [Noah Gettle](https://github.com/gettl008) and [Ksenia Krasheninnikova](https://github.com/ksenia-krasheninnikova) for testing the pipeline. ## Contributions and Support diff --git a/docs/images/ascc_overview_diagram.png b/docs/images/ascc_overview_diagram.png new file mode 100644 index 00000000..71a5478d Binary files /dev/null and b/docs/images/ascc_overview_diagram.png differ diff --git a/docs/output.md b/docs/output.md index 99d0d6b2..363e6332 100644 --- a/docs/output.md +++ b/docs/output.md @@ -2,371 +2,362 @@ ## Introduction -This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. +This document describes the output produced by the pipeline. The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. - - ## Pipeline overview The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: -- [YamlInput](#yamlinput) - -- [Validate TaxID](#validate-taxid) - -- [Filter Fasta](#filter-fasta) - -- [GC Content](#gc-content) - -- [Generate Genome](#generate-genome) - -- [Trailing Ns Check](#trailing-ns-check) - -- [Get KMERS profile](#get-kmers-profile) - -- [Extract Tiara Hits](#extract-tiara-hits) - -- [Mito organellar blast](#mito-organellar-blast) - +### Processes that produce the main outputs: + +- [Trailing Ns Check](#trailing-ns-check) +- [Mito Organellar Blast](#mito-organellar-blast) - - [Plastid organellar blast](#plastid-organellar-blast) - - [Run FCS Adaptor](#run-fcs-adaptor) - -- [Run FCS-GX](#run-fcs-gx) - - [Pacbio Barcode Check](#pacbio-barcode-check) - - [Run Read Coverage](#run-read-coverage) - -- [Run Vecscreen](#run-vecscreen) - -- [Run NT Kraken](#run-nt-kraken) - -- [Nucleotide Diamond Blast](#nucleotide-diamond-blast) - -- [Uniprot Diamond Blast](#uniprot-diamond-blast) - -- [Create BTK dataset](#create-btk-dataset) - -- [Autofilter and check assembly](#autofilter-and-check-assembly) - -- [Generate samplesheet](#generate-samplesheet) - +- [Run VecScreen](#run-vecscreen) - +- [Create BTK Dataset](#create-btk-dataset) - +- [Autofilter and Check Assembly](#autofilter-and-check-assembly) - - [Sanger-TOL BTK](#sanger-tol-btk) - - [Merge BTK datasets](#merge-btk-datasets) - - [ASCC Merge Tables](#ascc-merge-tables) - - [Pipeline information](#pipeline-information) - Report metrics generated during the workflow execution -### YamlInput +### Processes that produce intermediate outputs: + +- [YamlInput](#yamlinput) - +- [Generate samplesheet](#generate-samplesheet) - +- [Validate TaxID](#validate-taxid) - +- [Generate Genome](#generate-genome) - +- [Filter Fasta](#filter-fasta) - +- [GC Content](#gc-content) - +- [Get kmers profile](#get-kmers-profile) - +- [Extract Tiara Hits](#extract-tiara-hits) - +- [Run FCS-GX](#run-fcs-gx) - +- [Run nt Kraken](#run-nt-kraken) - +- [nr Diamond BLASTX](#nr-diamond-blastx) - +- [Uniprot Diamond BLASTX](#uniprot-diamond-blastx) - + +## Main outputs + +### Trailing Ns Check
Output files -- `NA` +- `trailingns/` + `*_trim_Ns` - A text file containing a report of trailing Ns found in the genome.
-YamlInput parses the input yaml into channels for later use in the pipeline. +A text file containing a report of trailing Ns found in the genome. Trailing Ns are when a nucleotide sequence starts or ends with Ns instead of A, G, C or T nucleotides. It is advisable to trim off the trailing Ns from sequences in the assembly. If the sequence remaining after trimming is shorter than 200 bp, the script recommends removing it from the assembly. -### Validate TaxID +### Mito Organellar Blast
Output files - -- `NA` - +- `organelle/` + `*-mitochondrial_genome.contamination_recommendation` - A file that contains the names of sequences that are suspected mitochondrial contaminants in the nuclear DNA assembly, tagged as either "REMOVE" or "Investigate" depending on the BLAST hit alignment length and percentage identity. The file is empty if there are no suspected mitochondrial contaminants.
-Validate TaxID scans through the taxdump to ensure that the input taxid is present in the nxbi taxdump. +This subworkflow uses BLAST against a user-provided mitochondrial sequence to detect leftover organellar sequences in the assembly file that should contain only chromosomal DNA sequences. A BLAST nucleotide database is made from the user-provided organellar sequence. BLAST with the chromosomal DNA assembly is then ran against this database with the following settings: -task megablast -word_size 28 -best_hit_overhang 0.1 -best_hit_score_edge 0.1 -dust yes -evalue 0.0001 -perc_identity 80 -soft_masking true. The BLAST results are filtered to keep only hits with alignment length that is at least 200 bp. +Depending on the alignment length and percentage identity, the script can recommend an action for dealing with the putative organellar sequence: either "REMOVE" or "Investigate". -### Filter Fasta +### Plastid Organellar Blast
-Output files - -- `filter/` - `*filtered.fasta` - A fasta file that has been filtered for sequences below a given threshold. +- `organelle/` + `*-plastid_genome.contamination_recommendation` - A file that contains the names of sequences that are suspected plastid contaminants in the nuclear DNA assembly, tagged as either "REMOVE" or "Investigate" depending on the BLAST hit alignment length and percentage identity. The file is empty if there are no suspected mitochondrial contaminants.
-By default scaffolds above 1.9Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments. +This subworkflow uses BLAST against a user-provided plastid sequence to detect leftover organellar sequences in the assembly file that should contain only chromosomal sequences. The method is the same as in the Mito Organellar Blast part. -### GC Content +### Run FCS-adaptor
Output files -- `gc/` - `*-GC_CONTENT.txt` - A text file describing the GC content of the input genome. +- `fcs/` + `*.fcs_adaptor_report.txt` - A text file containing potential adaptor sequences and locations. + `*.cleaned_sequences.fa.gz` - Cleaned FASTA file. + `*.fcs_adaptor.log` - Log of the FCS-adaptor run. + `*.pipeline_args.yaml` - Arguments to FCS-adaptor + `*.skipped_trims.jsonl` - Skipped sequences
-Calculating the GC content of the input genome. +FCS-adaptor (https://github.com/ncbi/fcs) is NCBI software for detecting adapter contamination in genome assemblies. FCS-adaptor uses a built-in database of adapter sequences, provided by NCBI. The FCS-adaptor report shows identified potential locations of retained adapter sequences from the sequencing run. -### Generate Genome +### Pacbio Barcode Check
Output files -- `generate/` - `*.genome` - An index-like file describing the input genome. +- `filter/` + `*_filtered.txt` - Text file log of PacBio barcode sequences found in the genome. The file is empty if no contamination was found.
-An index-like file containing the scaffold and scaffold length of the input genome. +Uses BlastN to identify retained PacBio multiplexing barcode contamination in the assembly. The PacBio multiplexing barcode sequences are stored as the pacbio_adaptors.fa file in the assets directory of this pipeline. -### Trailing Ns Check +### Run Read Coverage
Output files - -- `trailingns/` - `*_trim_Ns` - A text file containing a report of the Ns found in the genome. + `*.bam` - BAM file with aligned reads. + `*_average_coverage.txt` - Text file containing the coverage information for the genome
-A text file containing a report of the Ns found in the genome. +Mapping the read data to the input genome with minimap2 (https://github.com/lh3/minimap2) and calculating the average coverage per sequence. The reads used for mapping can be PacBio HiFi reads or paired end Illumina reads. -### Get KMERS profile +### Run VecScreen
Output files -- `get/` - `*_KMER_COUNTS.csv` - A csv file containing kmers and their counts. +- `summarise/` + `*.vecscreen_contamination` - A text file containing potential vector contaminant locations. The file is empty if no potential contaminants were found.
-A csv file containing kmers and their counts. +VecScreen (https://www.ncbi.nlm.nih.gov/tools/vecscreen/) is a tool for detecting adapter and vector contamination in genome assemblies. It is an older tool than FCS-adaptor. Its advantage over FCS-adaptor is that it can use a custom database of contaminant sequences made by the user, whereas FCS-adaptor comes with its built-in database. -### Extract Tiara Hits +### Create BTK Dataset -
+
Output files -- `tiara/` - `*.{txt,txt.gz}` - A text file containing classifications of potential contaminants. - `log_*.{txt,txt.gz}` - A log of the tiara run. - `*.{fasta,fasta.gz}` - An output fasta file. +- `create/` + `btk_datasets/` - A BlobToolKit (https://blobtoolkit.genomehubs.org) dataset folder containing data compatible with BTK viewer (https://blobtoolkit.genomehubs.org/blobtools2/blobtools2-tutorials/opening-a-dataset-in-the-viewer/). + `btk_summary_table_full.tsv` - A TSV file summarising the contents of the BlobToolKit dataset. This file is created using the `blobtools filter --table` command of BlobToolKit.
-Tiara ... +Creates a BlobToolKit dataset folder compatible with BlobToolKit viewer (https://blobtoolkit.genomehubs.org/blobtools2/blobtools2-tutorials/opening-a-dataset-in-the-viewer/). The BlobToolKit dataset create by ASCC can contain much more variables than what the BlobToolKit pipeline (https://github.com/sanger-tol/blobtoolkit) produces. -### Mito Organellar Blast +### Autofilter and Check Assembly
Output files -- `blast/` - `*.tsv` - A tsv file containing potential contaminants. +- `autofilter/` + `autofiltered.fasta` - The decontaminated input genome. The decontamination is based on the results of FCS-GX. + `ABNORMAL_CHECK.csv` - Combined FCS-GX and Tiara summary of contamination. + `assembly_filtering_removed_sequences.txt` - Sequences deemed contamination by FCS-GX (labelled with the EXCLUDE tag by FCS-GX) and removed from the above assembly. + `fcs-gx_alarm_indicator_file.txt` - Contains text to control the running of BlobToolKit pipeline. If enough contamination is found by FCS-GX, an alarm is triggered to switch on the running of BlobToolKit pipeline.
-A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome. +Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found. -### Chloro Organellar Blast +### Sanger-TOL BTK
Output files -- `blast/` - `*.tsv` - A tsv file containing potential contaminants. +- `sanger-tol-btk/` + `*_btk_out/blobtoolkit/${meta.id}*/` - The BlobToolKit dataset folder generated by the sanger-tol/blobtoolkit pipeline. + `*_btk_out/blobtoolkit/plots/` - BlobToolKit plots as PNG images, exported from the BlobToolKit dataset using blobtk (https://pypi.org/project/blobtk/). + `*_btk_out/blobtoolkit/${meta.id}*/summary.json.gz` - The summary.json.gz file of the BlobToolKit dataset. It contains assembly metrics such as + `*_btk_out/busco/*` - The BUSCO results returned by BlobToolKit. + `*_btk_out/multiqc/*` - The MultiQC results returned by BlobToolKit. + `blobtoolkit_pipeline_info` - The pipeline_info folder.
-A BlastN based subworkflow used on the input genome to filter potential contaminants from the genome. +Sanger-Tol/BlobToolKit (https://github.com/sanger-tol/blobtoolkit) is a Nextflow re-implementation of the Snakemake based BlobToolKit pipeline (https://github.com/blobtoolkit/pipeline) and produces interactive plots used to identify contamination or cobionts and separate these sequences from the main assembly. -### Run FCS Adaptor +### Merge BTK Datasets
Output files -- `fcs/` - `*.fcs_adaptor_report.txt` - A text file containing potential adaptor sequences and locations. - `*.cleaned_sequences.fa.gz` - Cleaned fasta file. - `*.fcs_adaptor.log` - Log of the fcs run. - `*.pipeline_args.yaml` - Arguments to FCS Adaptor - `*.skipped_trims.jsonl` - Skipped sequences +- `merge/` + `merged_datasets` - A BTK dataset. + `merged_datasets/btk_busco_summary_table_full.tsv` - A TSV file containing a summary of the btk busco results.
-FCS Adaptor Identified potential locations of retained adaptor sequences from the sequencing run. +This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer. -### Run FCS-GX +### ASCC Merge Tables
Output files -- `fcs/` - `*out/*.fcs_gx_report.txt` - A text file containing potential contaminant locations. - `out/*.taxonomy.rpt` - Taxonomy report of the potential contaminants. - +- `ASCC-main-output/` +`*_contamination_check_merged_table.csv` - A CSV table that contains the results of most parts of the pipeline (GC content, coverage, Tiara, Kraken, kmers dimensionality reduction, Diamond, BLAST, FCS-GX, BlobToolKit pipeline) for each sequence in the input assembly file. +If a set of prerequisite steps have been run (nt BLAST, nr Diamond, Uniprot Diamond, read mapping for coverage calculation, Tiara, nt Kraken and the creation of a BlobToolKit dataset), the pipeline tries to put together a phylum level combined classification of the input sequences. It first uses BlobToolKit's `bestsum_phylum`, then fills the gaps (caused by `no-hit` sequences) with results from Tiara and then the remaining gaps are filled with results from nt Kraken. The combined classification is in the `merged_classif` column. The `merged_classif_source` column says which tool's output the classification for each sequence is based on. The automated classification usually has some flaws in it but is still useful as a starting point for determining the phyla that the input sequences belong to. + `*_phylum_counts_and_coverage.csv` - A CSV report containing information on the hits per phylum and the average coverage per phylum. This file can only be generated if the`merged_classif` variable has been produced in the `*_contamination_check_merged_table.csv` table, as described above.
-FCS-GX Identified potential locations of contaminant sequences. +Merge Tables merged the summary reports from a number of modules in order to create a single set of reports. -### Pacbio Barcode Check +### Pipeline Information
Output files -- `filter/` - `*_filtered.txt` - Text file of barcodes found in the genome. +- `pipeline_info/` + - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. + - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. + - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`.
-Uses BlastN to identify where given barcode sequences may be in the genome. +[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. -### Run Read Coverage +## Intermediate outputs + +These files are produced by the pipeline's modules but they are stay in Nextflow's work directory and are not included on their own in the final output. + +### YamlInput
Output files -- `samtools/` - `*.bam` - Aligned BAM file. - `*_average_coverage.txt` - Text file containing the coverage information for the genome +- `NA`
-Mapping the read data to the input genome and calculating the average coverage across it. +YamlInput parses the input YAML file into channels for later use in the pipeline. -### Run Vecscreen +### Validate TaxID
Output files -- `summarise/` - `*.vecscreen_contamination` - A text file containing potential vector contaminant locations. +- `NA`
-Vecscreen identifies vector contamination in the input sequence. +Validate TaxID scans through the NCBI taxdump file to ensure that the taxonomy ID (taxID) provided by the user is present in the taxdump. The taxdump originates from https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/. The taxID might be absent in the taxdump either because the user has provided a faulty value for taxID or the taxdump is out of date. -### Run NT Kraken +### Filter FASTA
Output files -- `kraken2/` - `*.classified{.,_}*'` - Fastq file containing classified sequence. - `*.unclassified{.,_}*'` - Fastq file containing unclassified sequence. - `*classifiedreads.txt` - A text file containing a report on reads which have been classified. - `*report.txt` - Report of Kraken2 run. -- `get/` - `*txt` - Text file containing lineage information of the reported meta genomic data. +`*filtered.fasta` - A FASTA file that has been filtered to keep sequences below a given threshold of length.
-Kraken assigns taxonomic labels to metagenomic DNA sequences and optionally outputs the fastq of these data. +By default scaffolds above 1.9 Gb are removed from the assembly, as scaffolds of this size are unlikely to truely have contamination. There is also the issue that scaffolds larger than this use a significant amount of resources which hinders production environments. Furthermore, [FCS-GX](https://github.com/ncbi/fcs) does not work with sequences larger than 2 Gb. -### Nucleotide Diamond Blast +### GC Content
Output files -- `diamond/` - `*.txt` - A text file containing the genomic locations of hits and scores. -- `reformat/` - `*text` - A Reformated text file continaing the full genomic location of hits and scores. -- `convert/` - `*.hits` - A file containing all hits above the cutoff. +`*-GC_CONTENT.txt` - A tab separated table describing the GC content of the input genome. The first column contains the sequence names and the second column contains the GC content of each sequence. The GC content is expressed as a fraction: number of G and C nucleotides in the sequence divided by the number of all nucleotides in the sequence.
-Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the NCBI db +Calculating the GC content of each sequence in the input genome. -### Uniprot Diamond Blast +### Generate Genome
Output files -- `diamond/` - `*.txt` - A text file containing the genomic locations of hits and scores. -- `reformat/` - `*text` - A Reformated text file continaing the full genomic location of hits and scores. -- `convert/` - `*.hits` - A file containing all hits above the cutoff. +`*.genome` - An index-like file describing the input genome.
-Diamond Blast is a sequence aligner for translated and protein sequences, here it is used do identify contamination usin the Uniprot db +An index-like file containing the scaffold and scaffold length of the input genome. -### Create BTK dataset +### Get kmers Profile
Output files -- `create/` - `btk_datasets/` - A btk dataset folder containing data compatible with BTK viewer. - `btk_summary_table_full.tsv` - A TSV file summarising the dataset. +`*_KMER_COUNTS.csv` - A CSV file containing the counts of kmers (by default: 7mers) in each sequence in the assembly. +`KMERS_dim_reduction_embeddings_combined.csv` - A CSV file with the results of dimensionality reduction of kmer counts. The dimensionality reduction embeddings help to separate sequences in the assembly by their origin (sequences originating from the same species likely appear close together in an embedding). When setting up a run, the user can choose multiple methods for dimensionality reduction.
-Create BTK, creates a BTK_dataset folder compatible with BTK viewer. +A CSV file containing the counts of kmers (by default: 7mers) in each sequence in the assembly. Also, a file with the results of dimensionality reduction of kmer counts. +The following dimensionality reduction methods are available: PCA (principal component analysis), kernel PCA, PCA with SVD (singular value decomposition) solver, UMAP (uniform manifold approximation and projection), t-SNE (t-distributed stochastic neighbor embedding), LLE (locally linear embedding), MDS (multidimensional scaling), SE (spectral embedding), random trees, autoencoder and NMF (non-negative matrix factorisation). +The first two dimensions of the dimensionality reduction embeddings are used as the x and y coordinate when visualising the results in BlobToolKit. -### Autofilter and check assembly +### Extract Tiara Hits
Output files -- `autofilter/` - `autofiltered.fasta` - The decontaminated input genome. - `ABNORMAL_CHECK.csv` - Combined FCS and Tiara summary of contamination. - `assembly_filtering_removed_sequences.txt` - Sequences deemed contamination and removed from the above assembly. - `fcs-gx_alarm_indicator_file.txt` - Contains text to control the running of Blobtoolkit. +`TIARA.txt` - A text file containing classifications of the input DNA sequences. Each sequence gets assigned one label out of these: archaea, bacteria, prokarya, eukarya, organelle and unknown). +`log_*.{txt}` - A log of the Tiara run.
-Autofilter and check assembly returns a decontaminated genome file as well as summaries of the contamination found. +Tiara (https://github.com/ibe-uw/tiara) uses a neural network to classify DNA sequences. -### Generate samplesheet +### Run FCS-GX
Output files -- `generate/` - `*.csv` - A CSV file containing data locations, for use in Blobtoolkit. +- `fcs/` + `*out/*.fcs_gx_report.txt` - A text file containing potential contaminant locations. + `out/*.taxonomy.rpt` - Taxonomy report of the potential contaminants.
-This produces a CSV containing information on the read data for use in BlobToolKit. +FCS-GX (https://github.com/ncbi/fcs) is NCBI software that detects contaminants in genome assemblies using a cross species aligner. It uses its own database, provided by NCBI. -### Sanger-TOL BTK +### Run nt Kraken
Output files -- `sanger/` - `*_btk_out/blobtoolkit/${meta.id}*/` - The BTK dataset folder generated by BTK. - `*_btk_out/blobtoolkit/plots/` - The plots for display in BTK Viewer. - `*_btk_out/blobtoolkit/${meta.id}*/summary.json.gz` - The Summary.json file... - `*_btk_out/busco/*` - The BUSCO results returned by BTK. - `*_btk_out/multiqc/*` - The MultiQC results returned by BTK. - `blobtoolkit_pipeline_info` - The pipeline_info folder. +`*.kraken2.classifiedreads.txt` - A text file containing classifications for each input DNA sequence, generated by Kraken2. +`*.kraken2.report.txt` - Summary of the Kraken2 run, generated by Kraken2. +`_nt_kraken_lineage_file.txt` - Kraken2 lineages for each input DNA sequence, reformatted as a CSV table to make it possible to merge this information into a table that contains sequence classifications from other tools, e.g. BLAST and Diamond.
-Sanger-Tol/BlobToolKit is a Nextflow re-implementation of the snakemake based BlobToolKit pipeline and produces interactive plots used to identify true contamination and seperate sequence from the main assembly. +Kraken (https://github.com/DerrickWood/kraken2) assigns taxonomic labels to input DNA sequences based on comparing them to a database of kmers. ASCC uses a Kraken database made from the sequences of the NCBI nt database. The FASTA sequences of NCBI nt database are available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. -### Merge BTK datasets +### nr Diamond BLASTX
Output files -- `merge/` - `merged_datasets` - A BTK dataset. - `merged_datasets/btk_busco_summary_table_full.tsv` - A TSV file containing a summary of the btk busco results. +`*.txt` - A tabular text file containing the raw output of running Diamond BLASTX with sampled chunks of the assembly. The file contains BLASTX hits and scores/ Format: outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames sskingdoms sphylums salltitles +`full_coords.tsv`: A tabular text file containing the results from Diamond BLASTX where the coordinates of the BLASTX of chunks of assembly have been converted to coordinates in the full sequences of the assembly. +`*_diamond_blastx_top_hits.csv` - A file containing Diamond BLASTX top hits for each sequence in the input assembly file. +`*_diamond_outfmt6.tsv` - the `full_coords.tsv` file reformatted to make it compatible with BlobToolKit, so that the hits in it can be added to a BlobToolKit dataset. Format: outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
-This module merged the Create_btk_dataset folder with the Sanger-tol BTK dataset to create one unified dataset for use with btk viewer. +Diamond (https://github.com/bbuchfink/diamond) is a sequence aligner for protein sequences and translated nucleotide sequences. Here it is used to identify contamination using the NCBI nr database. The FASTA sequences of NCBI nr database are available at https://ftp.ncbi.nlm.nih.gov/blast/db/FASTA/. -### ASCC Merge Tables +### Uniprot Diamond BLASTX
Output files -- `ascc/` - `*_contamination_check_merged_table.csv` - .... - `*_contamination_check_merged_table_extended.csv` - .... - `*_phylum_counts_and_coverage.csv` - A CSV report containing information on the hits per phylum and the coverage of the hits.. +`*.txt` - A tabular text file containing the raw output of running Diamond BLASTX with sampled chunks of the assembly. The file contains BLASTX hits and scores/ Format: outfmt 6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore staxids sscinames sskingdoms sphylums salltitles +`full_coords.tsv`: A tabular text file containing the results from Diamond BLASTX where the coordinates of the BLASTX of chunks of assembly have been converted to coordinates in the full sequences of the assembly. +`*_diamond_blastx_top_hits.csv` - A file containing Diamond BLASTX top hits for each sequence in the input assembly file. +`*_diamond_outfmt6.tsv` - the `full_coords.tsv` file reformatted to make it compatible with BlobToolKit, so that the hits in it can be added to a BlobToolKit dataset. Format: outfmt 6 qseqid staxids bitscore qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore
+Diamond (https://github.com/bbuchfink/diamond) is a sequence aligner for protein sequences and translated nucleotide sequences. Here it is used to identify contamination using the Uniprot database. -Merge Tables merged the summary reports from a number of modules inorder to create a single set of reports. - -### Pipeline information +### Generate Samplesheet
Output files -- `pipeline_info/` - - Reports generated by Nextflow: `execution_report.html`, `execution_timeline.html`, `execution_trace.txt` and `pipeline_dag.dot`/`pipeline_dag.svg`. - - Reports generated by the pipeline: `pipeline_report.html`, `pipeline_report.txt` and `software_versions.yml`. The `pipeline_report*` files will only be present if the `--email` / `--email_on_fail` parameter's are used when running the pipeline. - - Reformatted samplesheet files used as input to the pipeline: `samplesheet.valid.csv`. +- `generate/` + `*.csv` - A CSV file containing data locations, for use in BlobToolkit.
-[Nextflow](https://www.nextflow.io/docs/latest/tracing.html) provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage. +This produces a CSV containing information on the read data for use in BlobToolKit.