diff --git a/.github/workflows/github-actions.yaml b/.github/workflows/github-actions.yaml new file mode 100644 index 0000000..e0b6590 --- /dev/null +++ b/.github/workflows/github-actions.yaml @@ -0,0 +1,41 @@ +# This workflow will install Python dependencies, run tests and lint with a single version of Python +# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python + +name: Python application + +on: + push: + branches: [ "master", "v2.0.0" ] + pull_request: + branches: [ "master", "v2.0.0" ] + +permissions: + contents: read + +jobs: + build: + + runs-on: ubuntu-22.04 + + steps: + - uses: actions/checkout@v4 + - name: Set up Python 3.12 + uses: actions/setup-python@v4 + with: + python-version: "3.12" + - name: Install dependencies + run: | + sudo apt-get update + sudo apt-get install samtools bowtie2 mash bcftools ncbi-blast+ seqtk libcurl4-openssl-dev libssl-dev ca-certificates -y + sudo apt-get install python3-pip python3-dev python3-pandas python3-requests python3-biopython -y + python3 -m pip install --upgrade pip setuptools + pip3 install pytest + if [ -f requirements.txt ]; then + pip3 install -r requirements.txt; + else + pip3 install -e . + fi + ectyper_init + - name: Test with pytest + run: | + pytest -o log_cli=true --basetemp=tmp-pytest diff --git a/CHANGELOG.md b/CHANGELOG.md index e6b8ca8..c332bd1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -38,4 +38,15 @@ specified and MASH distance to RefSeq genomes fails * Now species verification via MASH distance to RefSeq genomes and E.coli specific alleles is done only if `--verify` parameter is specified. * If `--verify` is not specified, all input genomes are treated as E.coli without doing any species verification + +**v2.0.0** +* Updated species identification module now based on GTDB + custom Escherichia and Shigella sketch covering all known bacterial species +* Implemented pathotyping covering 7 DEC *Escherichia coli* pathotypes (`DAEC`, `EAEC`, `EHEC`, `EIEC`, `EPEC`, `ETEC` and `STEC`) supporting simultaneous presence of multiple signatures (e.g. `ETEC/STEC`). Note that `EHEC` is reported as `EHEC-STEC` as this is a more severe subtype of `STEC`. +* Implemented Shiga 1 and 2 toxin typing supporting multiple toxin signatures present in a single sample. + * A total of 4 *stx1* subtypes are supported: `stx1a`, `stx1c`, `stx1d` and `stx1e`. + * A total of 15 *stx2* subtypes are supported: `stx2a`, `stx2b`, `stx2c`, `stx2d`, `stx2e`, `stx2f`, `stx2g` ,`stx2h`, `stx2i`, `stx2j`, `stx2k`, `stx2l`, `stx2m`, `stx2n`, `stx2o`. +* new database of pathotypes and toxins in JSON clear transparent format composed of the key virulence factors based on both BioNumerics and literature sources +* support for gzip compressed inputs `fastq.gz` and `fasta.gz` saving storage and increasing versatility +* other toxin typing covering enterohemolysin A (`ehxA`), hemolysin E (`hlyE`), hemolysin A (`hlyA`) + \ No newline at end of file diff --git a/Dockerfile b/Dockerfile new file mode 100644 index 0000000..f0aa42e --- /dev/null +++ b/Dockerfile @@ -0,0 +1,13 @@ +FROM ubuntu:22.04 +ENV DEBIAN_FRONTEND="noninteractive" TZ="America/New_York" +RUN apt update && apt install git python3-pip -y +RUN apt install libcurl4-openssl-dev libssl-dev -y +RUN pip3 install Cython numpy +RUN apt install mash ncbi-blast+ bowtie2 seqtk samtools bcftools -y +RUN git clone https://github.com/phac-nml/ecoli_serotyping.git +# install the tool and initialize its species ID MASH database +RUN cd ecoli_serotyping && git checkout v2.0.0 && pip3 install . +RUN ectyper_init + +#build image: docker build --tag ectyper:2.0.0 . +#type a sample: docker run -it --rm -v $PWD:/mnt ectyper:2.0.0 ectyper -i /mnt/assembly.fasta -o /mnt/temp/ --pathotype \ No newline at end of file diff --git a/README.md b/README.md index dc8d3f0..75c9bb7 100644 --- a/README.md +++ b/README.md @@ -9,10 +9,81 @@ # ECTyper (an easy typer) `ECTyper` is a standalone versatile serotyping module for _Escherichia coli_. It supports both _fasta_ (assembled) and _fastq_ (raw reads) file formats. -The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping. +The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on *E.coli* serotyping, Shiga toxin typing and pathotyping. +# Introduction +*Escherichia coli* is a priority foodborne pathogen of public health concern and popular model organism. Phenotypic characterization such as serotyping, toxin typing and pathotyping provide critical information for surveillance and outbreak detection activities and research including source attribution, outbreak cluster assignment, pathogenicity potential, risk assessement and others. -# Dependencies: +`ECTyper` uses whole-genome sequencing (WGS) for E.coli characterization including species identification, *in silico* serotyping covering O and H antigens, Shiga toxin typing and DEC pathotyping. It is a versatile, scallable, easy to use tool allowing to obtain key information on E.coli accepting both raw and assembled inputs. + +As WGS becomes standard within public health and research laboratories, it is important to harness the high throughput and resolution potential of this technology providing accurate and rapid at scale typing of E.coli both in public health, clinical and research contexts. + +## Citation +If you find `ECTyper` useful, please cite the following paper: + +> Bessonov, Kyrylo, Chad Laing, James Robertson, Irene Yong, Kim Ziebell, Victor PJ Gannon, Anil Nichani, Gitanjali Arya, John HE Nash, and Sara Christianson. **"ECTyper: in silico Escherichia coli serotype and species prediction from raw and assembled whole-genome sequence data."** Microbial genomics 7, no. 12 (2021): 000728. [https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728) + +## Contact +For any questions, issues or comments please make a Github issue or reach out to [Kyrylo Bessonov](kyrylo.bessonov@phac-aspc.gc.ca). + +# Installation +Multiple installation options are available depending on the user context and needs. The most convinient installation is as a `conda` package as it will install all required dependencies. + +### Docker and Singularity images availability +Docker and Singularity images are also available from [https://biocontainers.pro/tools/ectyper](https://biocontainers.pro/tools/ectyper) that could be useful for NextFlow or hassle-free deployment + +### Databases +ECTyper uses multiple databases + - the species identification database is available from [Zenodo](https://doi.org/10.5281/zenodo.10211568) repository + - the O and H antigen allele sequences are stored in [ectyper_alleles_db.json](ectyper/Data/ectyper_alleles_db.json) + - the toxin and pathotype signature marker sequences are stored in [ectyper_patho_stx_toxin_typing_database.json](ectyper/Data/ectyper_patho_stx_toxin_typing_database.json) + +## Option 1: As a conda package +Optionally if you do not have a conda environment, get and install `miniconda` or `anaconda`: + + ``` + wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh + bash miniconda.sh -b -p $HOME/miniconda + echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc + source ~/.bashrc + ``` + +Install the latest `ectyper` conda package from a `bioconda` channel + + ``` + conda install -c bioconda ectyper + ``` + +## Option 2: Install using pip +Install using `pip3` utility including python but missing on [non-python dependencies](#dependencies) + ``` + pip3 install ectyper + ``` +## Option 3: From source code +Second option is to install from the source allowing to excercise maximum control over installation process. + +Install dependencies. On Ubuntu distro run + ``` + apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk + ``` + +Install python dependencies via `pip`: + ``` + pip3 install pandas biopython + ``` +Clone the repository or checkout a particular release (e.g `v1.0.0`, `v2.0.0` etc.): + ``` + git clone https://github.com/phac-nml/ecoli_serotyping.git + git checkout v2.0.0 #optionally checkout a specific release version + ``` + +Finally, install ectyper from source +``` +python3 setup.py install # option 1 +pip3 install . # option 2 +``` +## Compatibility +### Dependencies: - python >= 3.5 - bcftools >= 1.8 - blast == 2.7.1 @@ -21,72 +92,46 @@ The tool provides convenient species identification coupled to quality control m - bowtie2 >= 2.3.4.1 - mash >= 2.0 -# Python packages: +### Python packages: - biopython >= 1.70 - pandas >= 0.23.1 - requests >= 2.0 - -# Installation - -## Option 1: As a conda package -1. If you do not have conda environment, get and install `miniconda` or `anaconda`: - - ```wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh - bash miniconda.sh -b -p $HOME/miniconda - echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc - source ~/.bashrc``` - -2. Install conda package from `bioconda` channel - ```conda install -c bioconda ectyper``` - -## Option 2: From the source directly -Second option is to install from the source. -1. Install dependencies. On Ubuntu distro run -``` -apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk -``` -1. Install python dependencies via `pip`: - -``` -pip3 install pandas biopython -``` - -1. Clone the repository or checkout a particular release (e.g v1.0.0, etc.): - -``` -git clone https://github.com/phac-nml/ecoli_serotyping.git -git checkout v1.0.0 #optionally checkout release version -``` - -1. Install ectyper: `python3 setup.py install` - -# Basic Usage +# Getting started +## Basic Usage 1. Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity) 1. `ectyper -i [file path] -o [output_dir]` 1. View the results on the console or in `cat [output folder]/output.csv` -# Example Usage -* `ectyper -i ecoliA.fasta` for a single file -* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir` -* `ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna` for multiple files (comma-delimited) -* `ectyper -i ecoli_folder` for a folder (all files in the folder will be checked by the tool) +## Example Input Scenarios +* `ectyper -i ecoliA.fasta` for a single file (the output folder will be named using `ectyper__