-
Notifications
You must be signed in to change notification settings - Fork 12
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Updated README.md to follow DAAD format and also updated url to Speci…
…es ID MASH database now hosted in Zenodo.org
- Loading branch information
1 parent
d70a025
commit 1d3ba78
Showing
5 changed files
with
149 additions
and
68 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,8 +9,77 @@ | |
`ECTyper` is a standalone versatile serotyping module for _Escherichia coli_. It supports both _fasta_ (assembled) and _fastq_ (raw reads) file formats. | ||
The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping. | ||
|
||
# Introduction | ||
*Escherichia coli* is a priority foodborne pathogen of public health concern and popular model organism. Phenotypic characterization such as serotyping, toxin typing and pathotyping provide critical information for surveillance and outbreak detection activities and research including source attribution, outbreak cluster assignment, pathogenicy potential, risk assessement and others. | ||
|
||
# Dependencies: | ||
`ECTyper` uses whole-genome sequencing (WGS) for E.coli characterizion including species identification, *in silico* serotyping covering O and H antigens, Shiga toxin typing and DEC pathotyping. It is a versatile, scallable, easy to use tool allowing to obtain key information on E.coli accepting both raw and assembled inputs. | ||
|
||
As WGS becomes standard within public health and research laboratories, it is important to harness the high thourghput and resolution potential of this technology providing accurate and rapid at scale typing of E.coli both in public health, clinical and research contexts. | ||
|
||
## Citation | ||
Bessonov, Kyrylo, Chad Laing, James Robertson, Irene Yong, Kim Ziebell, Victor PJ Gannon, Anil Nichani, Gitanjali Arya, John HE Nash, and Sara Christianson. "ECTyper: in silico Escherichia coli serotype and species prediction from raw and assembled whole-genome sequence data." Microbial genomics 7, no. 12 (2021): 000728. [https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728) | ||
|
||
## Contact | ||
For any questions, issues or comments please make a Github issue or reach out to [Kyrylo Bessonov]([email protected]). | ||
|
||
# Installation | ||
Multiple installation options are available depending on the user context and needs. The most convinient installation is as a `conda` package as it will install all required dependencies. | ||
|
||
### Images | ||
Docker and Singularity images are also available from [https://biocontainers.pro/tools/ectyper](https://biocontainers.pro/tools/ectyper) that could be useful for NextFlow or hassle-free deployment | ||
|
||
### Databases | ||
ECTyper uses multiple databases | ||
- the species identification database is available from [https://zenodo.org/records/10211569](https://zenodo.org/records/10211569) | ||
- the O and H antigen allele sequences are stored in [ectyper_alleles_db.json](ectyper/Data/ectyper_alleles_db.json) | ||
- the toxin and pathotype signature marker sequences are stored in [ectyper_patho_stx_toxin_typing_database.json](ectyper/Data/ectyper_patho_stx_toxin_typing_database.json) | ||
|
||
## Option 1: As a conda package | ||
Optionally if you do not have a conda environment, get and install `miniconda` or `anaconda`: | ||
|
||
``` | ||
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh | ||
bash miniconda.sh -b -p $HOME/miniconda | ||
echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc | ||
source ~/.bashrc | ||
``` | ||
|
||
Install the latest `ectyper` conda package from a `bioconda` channel | ||
|
||
``` | ||
conda install -c bioconda ectyper | ||
``` | ||
|
||
## Option 2: Install using pip | ||
Install using `pip3` utility including python but missing on [non-python dependencies](#dependencies) | ||
``` | ||
pip3 install ectyper | ||
``` | ||
## Option 3: From source code | ||
Second option is to install from the source allowing to excercise maximum control over installation process. | ||
|
||
Install dependencies. On Ubuntu distro run | ||
``` | ||
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk | ||
``` | ||
|
||
Install python dependencies via `pip`: | ||
``` | ||
pip3 install pandas biopython | ||
``` | ||
Clone the repository or checkout a particular release (e.g `v1.0.0`, `v2.0.0` etc.): | ||
``` | ||
git clone https://github.com/phac-nml/ecoli_serotyping.git | ||
git checkout v1.0.0 #optionally checkout a specific release version | ||
``` | ||
|
||
Finally, install ectyper | ||
``` | ||
python3 setup.py install # option 1 | ||
pip3 install . # option 2 | ||
``` | ||
## Compatibility | ||
### Dependencies: | ||
- python >= 3.5 | ||
- bcftools >= 1.8 | ||
- blast == 2.7.1 | ||
|
@@ -19,58 +88,26 @@ The tool provides convenient species identification coupled to quality control m | |
- bowtie2 >= 2.3.4.1 | ||
- mash >= 2.0 | ||
|
||
# Python packages: | ||
### Python packages: | ||
- biopython >= 1.70 | ||
- pandas >= 0.23.1 | ||
- requests >= 2.0 | ||
|
||
|
||
# Installation | ||
|
||
## Option 1: As a conda package | ||
1. If you do not have conda environment, get and install `miniconda` or `anaconda`: | ||
|
||
```wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh | ||
bash miniconda.sh -b -p $HOME/miniconda | ||
echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc | ||
source ~/.bashrc``` | ||
2. Install conda package from `bioconda` channel | ||
```conda install -c bioconda ectyper``` | ||
## Option 2: From the source directly | ||
Second option is to install from the source. | ||
1. Install dependencies. On Ubuntu distro run | ||
``` | ||
apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk | ||
``` | ||
1. Install python dependencies via `pip`: | ||
|
||
``` | ||
pip3 install pandas biopython | ||
``` | ||
|
||
1. Clone the repository or checkout a particular release (e.g v1.0.0, etc.): | ||
|
||
``` | ||
git clone https://github.com/phac-nml/ecoli_serotyping.git | ||
git checkout v1.0.0 #optionally checkout release version | ||
``` | ||
|
||
1. Install ectyper: `python3 setup.py install` | ||
|
||
# Basic Usage | ||
# Getting started | ||
## Basic Usage | ||
1. Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity) | ||
1. `ectyper -i [file path] -o [output_dir]` | ||
1. View the results on the console or in `cat [output folder]/output.csv` | ||
|
||
# Example Usage | ||
* `ectyper -i ecoliA.fasta` for a single file | ||
* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir` | ||
* `ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna` for multiple files (comma-delimited) | ||
* `ectyper -i ecoli_folder` for a folder (all files in the folder will be checked by the tool) | ||
## Example Input Scenarios | ||
* `ectyper -i ecoliA.fasta` for a single file (the output folder will be named using `ectyper_<date>_<time>` pattern) | ||
* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir` folder | ||
* `ectyper -i ecoliA.fasta ecoliB.fastq ecoli_folder/` for multiple files and directory separated by space | ||
* `ectyper -i ecoliA.fasta ecoliB.fastq,ecoliC.fna` | ||
* `ectyper -i ecoli_folder` scan for input files in a folder and subdirectories (all files in the folder will be checked by the tool) | ||
* `ectyper -i ecoli_folder/*.fasta` scan for FASTA input files in a folder and subdirectories | ||
|
||
# Advanced Usage | ||
## Advanced Usage | ||
``` | ||
usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE] | ||
[-hpid PERCENTIDENTITYHTYPE] [-oplen PERCENTLENGTHOTYPE] | ||
|
@@ -112,7 +149,8 @@ optional arguments: | |
Data/ectyper_database.json for more information | ||
``` | ||
|
||
# Fine-tunning parameters | ||
|
||
## Configuration and fine-tunning parameters | ||
`ECTyper` requires minimum options to run (`-i` and `-o`) but allows for extensive configuration to accomodate wide variaty of typing scenarios | ||
|
||
| Parameter| Explanation | Usage scenario | | ||
|
@@ -125,8 +163,23 @@ optional arguments: | |
| `-r` | Specify custom MASH sketch of reference genomes that will be used for species inference | User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to `assembly_summary_refseq.txt` and provide custom accession number that start with `GCF_` prefix| | ||
|`--dbpath`| Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database `ectyper_alleles_db.json` | User wants to add new alleles to the alleles database to improve typing performance | | ||
|
||
# Data Input | ||
Both raw and assembled reads are accepted in FASTA and FASTQ formats from any sequencing platform. The tool was designed for single sample inputs, but was shown to work on multi-taxa metagenomic raw reads FASTQ inputs. | ||
|
||
# Quality Control (QC) module | ||
# Data Output | ||
The output of the tool is stored in text files with the main report stored in `output.tsv` tab-delimited text file. | ||
|
||
The BLASTN hits of the O and H antigen database are stored in `blastn_output_alleles.txt` tab-delimited file. | ||
|
||
The log messages are stored in `ectyper.log` text file | ||
``` | ||
{out folder name} | ||
├── blastn_output_alleles.txt | ||
├── ectyper.log | ||
└── output.tsv | ||
``` | ||
|
||
## Quality Control (QC) module | ||
To provide an easier interpretation of the results and typing metrics, following QC codes were developed. | ||
These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, `MinPident` and `MinPcov` fields. | ||
For each reference allele minimum `%identity` and `%coverage` values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting). | ||
|
@@ -144,7 +197,7 @@ The QC module covers the following serotyping scenarios. More scenarios might be | |
|WARNING (H NON-REPORT)|H antigen alleles do not meet min %id or %cov thresholds| | ||
|WARNING (O and H NON-REPORT)| Both O and H antigen alleles do not meet min %identity or %coverage thresholds| | ||
|
||
# Report format | ||
## Report format | ||
`ECTyper` capitalizes on a concise minimum output coupled to easy results interpretation and reporting. `ECTyper v1.0` serotyping results are available in a tab-delimited `output.tsv` file consisting of the 16 columns listed below: | ||
|
||
1. **Name**: Sample name (usually a unique identifier) | ||
|
@@ -173,6 +226,24 @@ Selected columns from the `ECTyper` typical report are shown below. | |
EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx:1;wzy:0.999;fliC:1|O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin;|100;99.916;99.934; | 100;100;100; | contig00002;contig00002;contig00003; | 62558-63949;64651-65835;59962-61467; | 1392;1185;1506; |v1.0 (2020-05-07) | - | | ||
|
||
|
||
FAQs | ||
|
||
## FAQ | ||
|
||
**Does ECTyper can be run on multiple samples in a directory?** | ||
|
||
ECTyper proves flexible ways to specify inputs located in different locations. One can provide multiple paths to several directories separated by space. In addition, one can specify file type to look for in a given diretory(ies). Note that paths that contain a star `*` symbol would only look for files in specified directory and would not look in subdirectories. For example, | ||
|
||
- Process all files in `folder1` and `folder2` directories and file `sample.fasta` located in `folder3` | ||
|
||
`ectyper -i folder1/ folder2/ folder3/sample.fasta -o ectyper_results` | ||
- Process all fasta files in `folder1` and all fastq files in `folder2`. All sub-directories in those 2 folders will be ignored. To process those sub-folders either specify path to them or provide paths to directories without the `*` wildcard symbol. | ||
|
||
`ectyper -i folder1/*.fasta folder2/*.fastq` | ||
|
||
**Why ECTyper sometimes provides serotype results separated by forward slash / for O-antigen** | ||
|
||
Some O-antigens display very high degree of homology and are very hard to discern even using wet-lab agglutination assays. Even using both `wzx` and `wzy` genes it is not possible to reliably resolve those O-antigens. The 16 high similarity groups were identified by [Joensen, Katrine G., et al.](https://journals.asm.org/doi/full/10.1128/jcm.00008-15). Thus, if a given O-antigen is a member of any of those high similarity groups, all potential O-antigens are reported separated by `/` such as group 9 reporeted as `O17/O44/O73/O77/O106`. | ||
|
||
|
||
# Availability | ||
|
@@ -188,3 +259,14 @@ EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx: | |
|[Galaxy Europe](https://usegalaxy.eu/root?tool_id=ectyper)| Galaxy public server to execute your analysis from anywhere|Web-based| | ||
|[IRIDA plugin](https://github.com/phac-nml/irida-plugin-ectyper)| IRIDA instances could easily install additional pipeline|Web-based| | ||
|
||
# Legal and Compliance Information | ||
|
||
Copyright Government of Canada 2024 | ||
|
||
Written by: National Microbiology Laboratory, Public Health Agency of Canada | ||
|
||
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at: | ||
|
||
[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0) | ||
|
||
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters