Updated README.md to follow DAAD format and also updated url to Speci…

…es ID MASH database now hosted in Zenodo.org
phac-nml · Sep 12, 2024 · 1d3ba78 · 1d3ba78
1 parent d70a025
commit 1d3ba78
Show file tree

Hide file tree

Showing 5 changed files with 149 additions and 68 deletions.
diff --git a/README.md b/README.md
@@ -9,8 +9,77 @@
 `ECTyper` is a standalone versatile serotyping module for _Escherichia coli_. It supports both _fasta_ (assembled) and _fastq_ (raw reads) file formats.
 The tool provides convenient species identification coupled to quality control module giving a complete, transparent and reference laboratories suitable report on E.coli serotyping.
 
+# Introduction
+*Escherichia coli* is a priority foodborne pathogen of public health concern and popular model organism. Phenotypic characterization such as serotyping, toxin typing and pathotyping provide critical information for surveillance and outbreak detection activities and research including source attribution, outbreak cluster assignment, pathogenicy potential, risk assessement and others. 
 
-# Dependencies:
+`ECTyper` uses whole-genome sequencing (WGS) for E.coli characterizion including species identification, *in silico* serotyping covering O and H antigens, Shiga toxin typing and DEC pathotyping. It is a versatile, scallable, easy to use tool allowing to obtain key information on E.coli accepting both raw and assembled inputs.
+
+As WGS becomes standard within public health and research laboratories, it is important to harness the high thourghput and resolution potential of this technology providing accurate and rapid at scale typing of E.coli both in public health, clinical and research contexts.
+
+## Citation
+Bessonov, Kyrylo, Chad Laing, James Robertson, Irene Yong, Kim Ziebell, Victor PJ Gannon, Anil Nichani, Gitanjali Arya, John HE Nash, and Sara Christianson. "ECTyper: in silico Escherichia coli serotype and species prediction from raw and assembled whole-genome sequence data." Microbial genomics 7, no. 12 (2021): 000728. [https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728](https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000728)
+
+## Contact
+For any questions, issues or comments please make a Github issue or reach out to [Kyrylo Bessonov]([email protected]).
+
+# Installation
+Multiple installation options are available depending on the user context and needs. The most convinient installation is as a `conda` package as it will install all required dependencies. 
+
+### Images
+Docker and Singularity images are also available from [https://biocontainers.pro/tools/ectyper](https://biocontainers.pro/tools/ectyper) that could be useful for NextFlow or hassle-free deployment
+
+### Databases
+ECTyper uses multiple databases 
+  - the species identification database is available from [https://zenodo.org/records/10211569](https://zenodo.org/records/10211569)
+  - the O and H antigen allele sequences are stored in [ectyper_alleles_db.json](ectyper/Data/ectyper_alleles_db.json)
+  - the toxin and pathotype signature marker sequences are stored in [ectyper_patho_stx_toxin_typing_database.json](ectyper/Data/ectyper_patho_stx_toxin_typing_database.json)
+
+## Option 1: As a conda package
+Optionally if you do not have a conda environment, get and install `miniconda` or `anaconda`:
+
+  ```
+  wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
+  bash miniconda.sh -b -p $HOME/miniconda
+  echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
+  source ~/.bashrc
+  ```
+
+Install the latest `ectyper` conda package from a `bioconda` channel 
+
+  ```
+  conda install -c bioconda ectyper
+  ```
+
+## Option 2: Install using pip
+Install using `pip3` utility including python but missing on [non-python dependencies](#dependencies)
+  ```
+  pip3 install ectyper
+  ```
+## Option 3: From source code
+Second option is to install from the source allowing to excercise maximum control over installation process.
+
+Install dependencies. On Ubuntu distro run
+  ```
+  apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
+  ```
+
+Install python dependencies via `pip`:
+  ```
+  pip3 install pandas biopython
+  ```
+Clone the repository or checkout a particular release (e.g `v1.0.0`, `v2.0.0` etc.):
+  ```
+  git clone https://github.com/phac-nml/ecoli_serotyping.git
+  git checkout v1.0.0 #optionally checkout a specific release version
+  ```
+
+Finally, install ectyper  
+```
+python3 setup.py install # option 1
+pip3 install .   # option 2
+```
+## Compatibility
+### Dependencies:
 - python >= 3.5
 - bcftools >= 1.8
 - blast == 2.7.1
@@ -19,58 +88,26 @@ The tool provides convenient species identification coupled to quality control m
 - bowtie2 >= 2.3.4.1
 - mash >= 2.0
 
-# Python packages:
+### Python packages:
 - biopython >= 1.70
 - pandas >= 0.23.1
 - requests >= 2.0
 
-
-# Installation
-
-## Option 1: As a conda package
-1. If you do not have conda environment, get and install `miniconda` or `anaconda`:
-
-    ```wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh
-    bash miniconda.sh -b -p $HOME/miniconda
-    echo ". $HOME/miniconda/etc/profile.d/conda.sh" >> ~/.bashrc
-    source ~/.bashrc```
-    
-2. Install conda package from `bioconda` channel 
-	```conda install -c bioconda ectyper```
-
-## Option 2: From the source directly
-Second option is to install from the source.
-1. Install dependencies. On Ubuntu distro run
-```
-apt install samtools bowtie2 mash bcftools ncbi-blast+ seqtk
-```
-1. Install python dependencies via `pip`:
-
-```
-pip3 install pandas biopython
-```
-
-1. Clone the repository or checkout a particular release (e.g v1.0.0, etc.):
-
-```
-git clone https://github.com/phac-nml/ecoli_serotyping.git
-git checkout v1.0.0 #optionally checkout release version
-```
-
-1. Install ectyper: `python3 setup.py install`
-
-# Basic Usage
+# Getting started
+## Basic Usage
 1. Put the fasta/fastq files for serotyping analyses in one folder (concatenate paired raw reads files if you would like them to be considered a single entity)
 1. `ectyper -i [file path] -o [output_dir]`
 1. View the results on the console or in `cat [output folder]/output.csv`
 
-# Example Usage
-* `ectyper -i ecoliA.fasta`  for a single file
-* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir`
-* `ectyper -i ecoliA.fasta,ecoliB.fastq,ecoliC.fna`	for multiple files  (comma-delimited)
-* `ectyper -i ecoli_folder`	for a folder (all files in the folder will be checked by the tool)
+## Example Input Scenarios
+* `ectyper -i ecoliA.fasta`  for a single file (the output folder will be named using `ectyper_<date>_<time>` pattern)
+* `ectyper -i ecoliA.fasta -o output_dir` for a single file, results stored in `output_dir` folder
+* `ectyper -i ecoliA.fasta ecoliB.fastq ecoli_folder/`	for multiple files and directory separated by space
+* `ectyper -i ecoliA.fasta ecoliB.fastq,ecoliC.fna`
+* `ectyper -i ecoli_folder`	scan for input files in a folder and subdirectories (all files in the folder will be checked by the tool)
+* `ectyper -i ecoli_folder/*.fasta` scan for FASTA input files in a folder and subdirectories
 
-# Advanced Usage
+## Advanced Usage
 ```
 usage: ectyper [-h] [-V] -i INPUT [-c CORES] [-opid PERCENTIDENTITYOTYPE]
                [-hpid PERCENTIDENTITYHTYPE] [-oplen PERCENTLENGTHOTYPE]
@@ -112,7 +149,8 @@ optional arguments:
                         Data/ectyper_database.json for more information
 ```
 
-# Fine-tunning parameters
+
+## Configuration and fine-tunning parameters
 `ECTyper` requires minimum options to run (`-i` and `-o`) but allows for extensive configuration to accomodate wide variaty of typing scenarios
 
 | Parameter|      Explanation                                                 | Usage scenario                                                                    |
@@ -125,8 +163,23 @@ optional arguments:
 | `-r`     |  Specify custom MASH sketch of reference genomes that will be used for species inference | User has a new assembled genome that is not available in NCBI RefSeq database. Make sure to add metadata to `assembly_summary_refseq.txt` and provide custom accession number that start with `GCF_` prefix|
 |`--dbpath`|  Provide custom appended database of O and H antigen reference alleles in JSON format following structure and field names as default database `ectyper_alleles_db.json` | User wants to add new alleles to the alleles database to improve typing performance |
 
+# Data Input
+Both raw and assembled reads are accepted in FASTA and FASTQ formats from any sequencing platform. The tool was designed for single sample inputs, but was shown to work on multi-taxa metagenomic raw reads FASTQ inputs.
 
-# Quality Control (QC) module
+# Data Output
+The output of the tool is stored in text files with the main report stored in `output.tsv` tab-delimited text file.
+
+The BLASTN hits of the O and H antigen database are stored in `blastn_output_alleles.txt` tab-delimited file.
+
+The log messages are stored in `ectyper.log` text file
+```
+{out folder name}
+├── blastn_output_alleles.txt
+├── ectyper.log
+└── output.tsv
+```
+
+## Quality Control (QC) module
 To provide an easier interpretation of the results and typing metrics, following QC codes were developed. 
 These codes allow to quickly filter "reportable" and "non-reportable" samples. The QC module is tightly linked to ECTyper allele database, specifically, `MinPident` and `MinPcov` fields.
 For each reference allele minimum `%identity` and `%coverage` values were determined as a function of potential "cross-talk" between antigens (i.e. multiple potential antigen calls at a given setting).
@@ -144,7 +197,7 @@ The QC module covers the following serotyping scenarios. More scenarios might be
 |WARNING (H NON-REPORT)|H antigen alleles do not meet min %id or %cov thresholds|
 |WARNING (O and H NON-REPORT)| Both O and H antigen alleles do not meet min %identity or %coverage thresholds|
 
-# Report format
+## Report format
 `ECTyper` capitalizes on a concise minimum output coupled to easy results interpretation and reporting. `ECTyper v1.0` serotyping results are available in a tab-delimited `output.tsv` file consisting of the 16 columns listed below:
 
 1. **Name**: Sample name (usually a unique identifier) 
@@ -173,6 +226,24 @@ Selected columns from the `ECTyper` typical report are shown below.
 EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx:1;wzy:0.999;fliC:1|O157-5-wzx-origin;O157-9-wzy-origin;H43-1-fliC-origin;|100;99.916;99.934;   |   100;100;100;  |  contig00002;contig00002;contig00003; |   62558-63949;64651-65835;59962-61467;   | 1392;1185;1506; |v1.0 (2020-05-07)  |     - |
 
 
+FAQs
+
+## FAQ
+
+**Does ECTyper can be run on multiple samples in a directory?**
+
+ECTyper proves flexible ways to specify inputs located in different locations. One can provide multiple paths to several directories separated by space. In addition, one can specify file type to look for in a given diretory(ies). Note that paths that contain a star `*` symbol would only look for files in specified directory and would not look in subdirectories. For example,
+
+- Process all files in `folder1` and `folder2` directories and file `sample.fasta` located in `folder3` 
+
+    `ectyper -i folder1/ folder2/ folder3/sample.fasta -o ectyper_results` 
+- Process all fasta files in `folder1` and all fastq files in `folder2`. All sub-directories in those 2 folders will be ignored. To process those sub-folders either specify path to them or provide paths to directories without the `*` wildcard symbol. 
+
+  `ectyper -i folder1/*.fasta folder2/*.fastq` 
+
+**Why ECTyper sometimes provides serotype results separated by forward  slash / for O-antigen**
+
+Some O-antigens display very high degree of homology and are very hard to discern even using wet-lab agglutination assays. Even using both `wzx` and `wzy` genes it is not possible to reliably resolve those O-antigens. The 16 high similarity groups were identified by [Joensen, Katrine G., et al.](https://journals.asm.org/doi/full/10.1128/jcm.00008-15). Thus, if a given O-antigen is a member of any of those high similarity groups, all potential O-antigens are reported separated by `/` such as group 9 reporeted as `O17/O44/O73/O77/O106`.
 
 
 # Availability
@@ -188,3 +259,14 @@ EC20151709|Escherichia coli|O157:H43|Based on 3 allele(s)|PASS (REPORTABLE)|wzx:
 |[Galaxy Europe](https://usegalaxy.eu/root?tool_id=ectyper)| Galaxy public server to execute your analysis from anywhere|Web-based| 
 |[IRIDA plugin](https://github.com/phac-nml/irida-plugin-ectyper)| IRIDA instances could easily install additional pipeline|Web-based|
 
+# Legal and Compliance Information
+
+Copyright Government of Canada 2024
+
+Written by: National Microbiology Laboratory, Public Health Agency of Canada
+
+Licensed under the Apache License, Version 2.0 (the "License"); you may not use this work except in compliance with the License. You may obtain a copy of the License at:
+
+[http://www.apache.org/licenses/LICENSE-2.0](http://www.apache.org/licenses/LICENSE-2.0)
+
+Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
diff --git a/ectyper/commandLineOptions.py b/ectyper/commandLineOptions.py
@@ -63,7 +63,7 @@ def checkdbversion():
 
     parser.add_argument(
         "--maxdirdepth",
-        help="Maximum number of directories to descend when searching an input directory of files",
+        help="Maximum number of directories to descend when searching an input directory of files [default %(default)s levels]. Only works on path inputs not containing '*' wildcard",
         default=0, 
         type=int,   
         required=False

diff --git a/ectyper/definitions.py b/ectyper/definitions.py
@@ -32,7 +32,7 @@
                          '15':['O89','O101','O162'],
                          '16':['O169','O183']
                          }
-MASH_URLS = ["https://drive.usercontent.google.com/download?id=1p0XVb7PuiApYk5ndjLksIc3RcDmUwi6L&export=download&confirm=f"]
+MASH_URLS = ["https://zenodo.org/records/10211569/files/EnteroRef_GTDBSketch_20231003_V2.msh?download=1"]
 
 HIGH_SIMILARITY_THRESHOLD_O = 0.00771 # alleles that are 99.23% apart will be reported as mixed call ~ 8 nt difference on average
 MIN_O_IDENTITY_LS = 95 #low similarity group O antigen min identity threshold to pre-filter BLAST output  (identical to global threshold)

diff --git a/ectyper/ectyper.py b/ectyper/ectyper.py
@@ -66,7 +66,7 @@ def run_program():
     args = commandLineOptions.parse_command_line()
 
 
-    output_directory = create_output_directory(args.output)
+    output_directory = create_output_directory(args)
 
     # Create a file handler for log messages in the output directory for the root thread
     fh = logging.FileHandler(os.path.join(output_directory, 'ectyper.log'), 'w', 'utf-8')
@@ -121,6 +121,7 @@ def run_program():
     os.makedirs(temp_dir, exist_ok=True)
 
     LOG.info("Gathering genome files list ...")
+
     input_files_list = genomeFunctions.get_files_as_list(args.input, args.maxdirdepth)
     raw_genome_files = decompress_gunzip_files(input_files_list, temp_dir)
 
@@ -256,9 +257,9 @@ def getOantigenHighSimilarGroup(final_predictions, sample):
 
 
 
-def create_output_directory(output_dir):
+def create_output_directory(args):
     """
-    Create the output directory for ectyper
+    Create the output directory for ectyper if does not exist already
 
     :param output_dir: The user-specified output directory, if any
     :return: The output directory
@@ -267,26 +268,27 @@ def create_output_directory(output_dir):
 
 
 
-    if output_dir is None:
+    if args.output is None:
         date_dir = ''.join([
             'ectyper_',
             str(datetime.datetime.now().date()),
             '_',
             str(datetime.datetime.now().time()).replace(':', '.')
         ])
         out_dir = os.path.join(definitions.WORKPLACE_DIR, date_dir)
+        args.output = out_dir
     else:
-        if os.path.isabs(output_dir):
-            out_dir = output_dir
+        if os.path.isabs(args.output):
+            out_dir = args.output 
         else:
-            out_dir = os.path.join(definitions.WORKPLACE_DIR, output_dir)
+            out_dir = os.path.join(definitions.WORKPLACE_DIR, args.output)
 
     if not os.path.exists(out_dir):
         os.makedirs(out_dir)
-
+    
     # clean previous ECTyper output files if the directory was used in previous runs 
     for file in definitions.OUTPUT_FILES_LIST:
-        path2file = os.path.join(output_dir,file)
+        path2file = os.path.join(out_dir,file)
         if os.path.exists(path2file):
             LOG.info(f"Cleaning ECTyper previous files. Removing previously generated {path2file} ...")
             os.remove(path2file) 

diff --git a/ectyper/genomeFunctions.py b/ectyper/genomeFunctions.py
@@ -5,7 +5,7 @@
 '''
 
 import logging
-import os
+import os, glob
 import tempfile
 from tarfile import is_tarfile
 from Bio import SeqIO
@@ -28,7 +28,7 @@ def get_files_as_list(files_or_directories, max_depth_level):
     directory specified (where each file name is its absolute path).
 
     Args:
-        file_or_directory (str): file or directory name given on commandline
+        file_or_directory (str): file or directory name given on command line
 
     Returns:
         files_list (list(str)): List of all the files found.
@@ -38,11 +38,9 @@ def get_files_as_list(files_or_directories, max_depth_level):
 
 
     init_min_dir_level = min([os.path.abspath(p).count(os.sep)+1 if os.path.isdir(p) else os.path.abspath(p).count(os.sep) for p in files_or_directories])
-
     for file_or_directory in sorted([os.path.abspath(p) for p in files_or_directories if len(p) != 0]):
-
         dir_level_current = get_relative_directory_level(file_or_directory, init_min_dir_level)
-
+ 
         if dir_level_current > max_depth_level:
             LOG.info(f"Directory level exceeded ({dir_level_current} > {max_depth_level}), skipping {file_or_directory} ...")
             continue
@@ -53,9 +51,9 @@ def get_files_as_list(files_or_directories, max_depth_level):
             # Create a list containing the file names
             for root, dirs, files in os.walk(os.path.abspath(file_or_directory)):
                 dir_level = get_relative_directory_level(root, init_min_dir_level)
-                LOG.info(f"In '{root}' level {dir_level} identified {len(dirs)} sub-directory(ies) and {len(files)} file(s) ...")
                 if dir_level > max_depth_level:
                     continue
+                LOG.info(f"In '{root}' level {dir_level} identified {len(dirs)} sub-directory(ies) and {len(files)} file(s) ...")
                 for filename in files:
                     files_list.append(os.path.join(root, filename))
         # check if input is concatenated file locations separated by , (comma)
@@ -73,7 +71,6 @@ def get_files_as_list(files_or_directories, max_depth_level):
             LOG.info(f"Total of {len(files_list)} files identified with a valid path and {missing_inputs_count} are missing ...")           
         # a path to a file is specified
         else:
-            LOG.info("Checking existence of file " + file_or_directory)
             input_abs_file_path = os.path.abspath(file_or_directory)
             if os.path.exists(input_abs_file_path):
                 files_list.append(os.path.abspath(input_abs_file_path))
@@ -82,9 +79,9 @@ def get_files_as_list(files_or_directories, max_depth_level):
 
 
     if not files_list:
-        LOG.critical("No files were found for the ectyper run")
+        LOG.critical("No files were found for the ectyper to run on")
         raise FileNotFoundError("No files were found to run on")
-    LOG.info(f"Overall identified {len(files_list)} file(s) to process ...");
+    LOG.info(f"Overall identified {len(files_list)} file(s) ({','.join([os.path.basename(f) for f in files_list])}) to process ...");
     sorted_files = sorted(list(set(files_list)))
     LOG.debug(sorted_files)
     return sorted_files
@@ -402,7 +399,7 @@ def create_combined_alleles_and_markers_file(alleles_fasta, temp_dir, pathotype)
     """
 
     combined_file = os.path.join(temp_dir, 'combined_ident_serotype.fasta')
-    LOG.info("Creating combined serotype and identification fasta file")
+    LOG.info(f"Creating combined reference database fasta file at {combined_file} ...")
 
     with open(combined_file, 'w') as ofh:
         #with open(definitions.ECOLI_MARKERS, 'r') as mfh: