New SNP methods

- Remove kSNP3 as it was creating too much conflicts for the conda environment creation. Please use v0.1 if you want to be able to produce a pan-SNP tree. Consequently, only core-SNP methods remain. - Addition of snippy v4.6.0 as alternative method to find SNPs. Now default method. Use FastTree to make tree. - Addition of phame v1.0.3 as alternative method to find SNPs. Seems to have lower resolution than snippy. Use FastTree to make tree. - README.md updated accordingly.
duceppemo · Mar 21, 2023 · 3a46085 · 3a46085
1 parent e0ea693
commit 3a46085
Show file tree

Hide file tree

Showing 4 changed files with 181 additions and 223 deletions.
diff --git a/README.md b/README.md
@@ -8,7 +8,7 @@ BACoN follows these steps:
 1. Extract Nanopore reads matching reference with `minimap2` or `bbduk` (ideally using the same reference used for targeted sequencing).
 2. Remove Nanopore adapters with `porechop` and filter small (min 500bp) and lower quality (discard bottom 5% or keep top 100X) reads with `filtlong`.
 3. Assemble reads with `flye`, `shasta` or `rebaler`. Note that the "ont-hq" flag is used for Flye, assuming that Guppy5+ or Q20+ chemistry (<5% error) was used for basecalling. Flye and Shasta are *de novo* assemblers while rebaler is a reference-guided assembler. Flye seems to work better with circular genomes. Shasta works well too with circular references (like full organelle genomes), but will also work with linear references (like ribosomal DNA). Rebaler is more experimental (not fully validated for this type of application, but I thought it would be good to have a "different" type of assembler to try in case the *de novo* methods don't work well).
-4. Compare samples (make tree) with `parsnp` (core) or `kSNP3` (pan and core). Boostrap trees are also created using `RAxML` and `FastTree` from SNPs from parsnp.
+4. Extract core SNPs with `PhaME`, `parsnp` or `Snippy`. Boostrap trees (n=100) are also created using `RAxML` or `FastTree`.
 
 ## Installation
 Requires conda. See [here](https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links) for instructions if needed.
@@ -26,21 +26,27 @@ git clone https://github.com/duceppemo/BACoN
 cd BACoN
 conda create -n BACoN -y -c bioconda -c etetoolkit -c hcc --file requirements.txt
 
+# Alternative environment creation method 
+conda create -n BACoN -y -c bioconda -c etetoolkit -c hcc -c conda-forge porechop=0.2.4 filtlong=0.2.1 minimap2=2.24 \
+    samtools=1.16 bcftools=1.16 flye=2.9.1 shasta=0.11.1 bbmap=39.01 git ete3=3.1.2 pysam=0.20.0 bandage=0.8.1 \
+    parsnp=1.7.4 harvesttools=1.2 raxml=8.2.12 fasttree=2.1.11 psutil=5.9.4 pandas=1.5.3 rebaler=0.2.0 \
+    snippy=4.6.0 phame=1.0.3 mummer=3.23 bowtie2=2.5.1 bwa=0.7.17 perl-bioperl=1.7.8 perl=5.32.1
+
 # Activate enviroment:
 conda activate BACoN
 
-# Test program. No errors should be displayed:
+# Test program. No errors should promp:
 python bacon.py -h
 ```
 
 ## Usage
 ```commandline
 usage: python bacon.py [-h] -r /path/to/reference_organelle/genome.fasta -i /path/to/input/folder/ or /path/to/my_fastq.gz -o /path/to/output/folder/ [-b {minimap2,bbduk}] [-a {flye,shasta,rebaler}] [--min-size 3000] [-t 16]
-                [-p 2] [-m 57] [-s 150000] [-k 99] [--keep-bam] [-snp {parsnp,ksnp3}]
+                       [-p 2] [-m 57] [-s 150000] [-k 99] [--keep-bam] [-snp {parsnp,snippy,phame}] [-v]
 
 Extract, assemble and compare Nanopore reads matching a reference sequence.
 
-optional arguments:
+options:
   -h, --help            show this help message and exit
   -r /path/to/reference_organelle/genome.fasta, --reference /path/to/reference_organelle/genome.fasta
                         Reference genome for read mapping. Mandatory.
@@ -61,8 +67,9 @@ optional arguments:
   -k 99, --kmer-size 99
                         Kmer size for baiting. Only used if "--baiting-method" is "bbduk". Optional.
   --keep-bam            Do not delete BAM files. Only used if "--baiting-method" is "minimap2". Optional.
-  -snp {parsnp,ksnp3}, --snp-method {parsnp,ksnp3}
-                        SNP calling method. Default "parsnp". Optional.
+  -snp {parsnp,snippy,phame}, --snp-method {parsnp,snippy,phame}
+                        SNP calling method. Default "phame". Optional.
+  -v, --version         show program's version number and exit
 ```
 
 ## Example
@@ -149,3 +156,7 @@ bacon
 - `4_assembled` contains the assembly sub-folders for each sample. All assemblies are located in the `all_assemblies` folder. Assembly quality can be assessed quickly from the `assembly_graphs` sub-folder.
 - `5_compared` contains the SNPs and tree files from `parsnp`, `RAxML` and `FastTree`. RAxML and FastTree are provided to add bootstrap values to the tree.
 - The `done` files are checkpoint files.
+
+Note that the output files and their structure within each folder will vary depending on method used.
+## Tips
+If you want to compare different SNP detection methods, you can rename the `5_compared` folder, remove the `done_comparing` file, change the method for the `-snp` option and rerun the command. Only the last step will be executed.
diff --git a/bacon.py b/bacon.py
@@ -9,29 +9,16 @@
 
 
 __author__ = 'duceppemo'
-__version__ = '0.1'
-
-
-"""requirements
-porechop=0.2.4
-filtlong=0.2.1
-minimap2=2.24
-samtools=1.15.1
-flye=2.9.1
-shasta=0.8.0
-bbmap=38.98
-git=2.37.2
-ete3=3.1.2
-pysam=0.19.1
-bandage=0.8.1
-parsnp=1.7.4
-harvesttools=1.2
-raxml=8.2.12
-fasttree=2.1.11
-psutil=5.9.1
-pandas=1.4.3
-rebaler=0.2.0
-ksnp=3.1
+__version__ = '0.2'
+
+
+#  Requirements
+"""
+mamba create -n BACoN -y -c bioconda -c etetoolkit -c hcc -c conda-forge porechop=0.2.4 filtlong=0.2.1 minimap2=2.24 \
+    samtools=1.16 bcftools=1.16 flye=2.9.1 shasta=0.11.1 bbmap=39.01 git ete3=3.1.2 pysam=0.20.0 bandage=0.8.1 \
+    parsnp=1.7.4 harvesttools=1.2 raxml=8.2.12 fasttree=2.1.11 psutil=5.9.4 pandas=1.5.3 rebaler=0.2.0 \
+    snippy=4.6.0 phame=1.0.3 mummer=3.23 bowtie2=2.5.1 bwa=0.7.17 perl-bioperl=1.7.8 perl=5.32.1
+    # ksnp=3.1
 """
 
 
@@ -230,24 +217,47 @@ def run(self):
                 Methods.convert_xmfa_to_fastq(compared_folder + 'parsnp.xmfa', compared_folder + 'parsnp.fasta')
 
                 # Create ML tree with bootstraps
-                print('Making RAxML tree with 1,000 bootstraps...')
+                print('\tRemaking tree with RAxML using 100 bootstraps...')
                 Methods.make_tree_raxml(compared_folder + 'parsnp.fasta', compared_folder, self.cpu)
                 # Plot tree to PDF, .SVG or .PNG
                 # This command won't work for this tree...
                 # Methods.plot_newick_tree(compared_folder + 'RAxML_bipartitionsBranchLabels.raxml.tree',
                 #                          compared_folder + 'raxml.pdf')
 
                 # Create FastTree
-                print('Making FastTree tree with 1,000 peudo-bootstraps...')
+                print('\tRemaking tree with FastTree using 100 peudo-bootstraps...')
                 Methods.make_tree_fasttree(compared_folder + 'parsnp.fasta', compared_folder)
                 Methods.plot_newick_tree(compared_folder + 'fasttree.tree',
                                          compared_folder + 'fasttree.pdf')
-            else:  # elif self.snp_method == 'ksnp3':
-                print('Making pan-SNP ans core-SNP trees with kSNP3...')
+            # elif self.snp_method == 'ksnp3':
+            #     print('Making pan-SNP and core-SNP trees with kSNP3...')
+            #
+            #     # Create pan-SNP tree
+            #     Methods.run_ksnp3(assembly_list, compared_folder, self.reference, self.cpu)
+            elif self.snp_method == 'snippy':
+                print('Extracting core SNPs with Snippy...')
+
+                # ref, assembly_list, output_folder, cpu, mem, parallel
+                Methods.run_snippy_parallel(self.reference, assembly_list, compared_folder,
+                                            self.cpu, self.mem, self.parallel)
+
+                # Find core SNP
+                Methods.snippy_core(compared_folder, self.reference)
+
+                # Make tree
+                print('\tMaking tree with FastTree using 100 peudo-bootstraps...')
+                Methods.make_tree_fasttree(compared_folder + 'core.full.aln', compared_folder)
+                Methods.plot_newick_tree(compared_folder + 'fasttree.tree',
+                                         compared_folder + 'fasttree.pdf')
+
+
+            else:  # elif self.snp_method == 'phame':
+                print('Making core-SNP tree with PhaME (FastTree with 100 pseudo-bootstraps...')
 
-                # Create pan-SNP tree
-                Methods.run_ksnp3(assembly_list, compared_folder, self.reference, self.cpu)
+                # Run PhaME (Fastree and 100 pseudo-bootstraps)
+                Methods.run_phame(self.reference, assembled_folder + 'all_assemblies/', compared_folder, self.cpu)
 
+            # Create "done" flag
             Methods.flag_done(done_comparing)
         else:
             print('Skipping sample comparison. Already done.')
@@ -308,10 +318,11 @@ def run(self):
                         required=False, action='store_true',
                         help='Do not delete BAM files. Only used if "--baiting-method" is "minimap2". Optional.')
     parser.add_argument('-snp', '--snp-method',
-                        required=False, default='parsnp',
-                        choices=['parsnp', 'ksnp3'],
+                        required=False, default='snippy',
+                        # choices=['parsnp', 'snippy', 'ksnp3', 'phame'],
+                        choices=['parsnp', 'snippy', 'phame'],
                         type=str,
-                        help='SNP calling method. Default "parsnp". Optional.')
+                        help='SNP calling method. Default "phame". Optional.')
     parser.add_argument('-v', '--version', action='version',
                         version=f'{os.path.basename(__file__)}: version {__version__}')
 

diff --git a/bacon_methods.py b/bacon_methods.py
@@ -28,9 +28,7 @@ class Methods(object):
     def check_input(my_input):
         error_message_list = ['Please provide a folder as input.',
                               'The input folder provided does not exist.',
-                              'Make sure files in input folder end with {}'.format(Methods.accepted_extensions),
-
-                              ]
+                              'Make sure files in input folder end with {}'.format(Methods.accepted_extensions)]
 
         if not os.path.exists(my_input):
             raise Exception('Please select an existing file or folder as input.')
@@ -45,6 +43,17 @@ def check_input(my_input):
         if not all([f.endswith(tuple(Methods.accepted_extensions)) for f in file_list]):
             raise Exception('Make sure files in input folder end with {}'.format(Methods.accepted_extensions))
 
+    @staticmethod
+    def check_ref(ref):
+        if not os.path.isfile(ref):
+            raise Exception('The reference file provided does not exist')
+
+        with gzip.open(ref, 'rt') if ref.endswith('.gz') else open(ref, 'r') as f:
+            first_header = f.readline()
+            first_character = split(first_header)[0]
+            if first_character != '>':
+                raise Exception('The reference file provided does not appear to be a valid fasta file.')
+
     @staticmethod
     def check_cpus(requested_cpu, n_proc):
         total_cpu = cpu_count()
@@ -675,7 +684,7 @@ def make_tree_raxml(aligned_fasta, output_folder, cpu):
                '-w', output_folder,
                '-n', 'raxml.tree',
                '-m', 'GTRCAT',
-               '-N', str(1000),
+               '-N', str(100),
                '-d', '-f', 'a', '-T', str(cpu),
                '-x', str(1234), '-p', str(123)]
         subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
@@ -685,6 +694,7 @@ def make_tree_fasttree(aligned_fasta, output_folder):
         cmd = ['FastTree',
                '-nt',
                '-gtr',
+               '-boot', str(100),
                aligned_fasta]
         p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
         with open(output_folder + 'fasttree.tree', 'wb') as f:
@@ -700,3 +710,85 @@ def plot_newick_tree(tree_file, output_file):
             t.render(output_file, w=183, units='mm', tree_style=ts)
         except NewickError:
             print('Could not convert tree to picture file.')
+
+    @staticmethod
+    def run_phame(ref, fasta_folder, output_folder, cpu):
+        Methods.make_folder(output_folder)
+
+        cfg_dict = {
+            'refdir': fasta_folder,
+            'workdir': output_folder,  # # directory where reference (Complete) files are located
+            'reference': '1',  # 0:pick a random ref from refdir; 1:use given ref; 2: use ANI based ref
+            'reffile': os.path.basename(ref),  # reference filename when option 1 is chosen
+            'project': 'phame',  # main alignment file name
+            'cdsSNPS': '0',  # 0:no cds SNPS; 1:cds SNPs, divides SNPs into coding and non-coding sequences, gff file is required
+            'buildSNPdb': '0',  # 0: only align to reference 1: build SNP database of all complete genomes from refdir
+            'FirstTime': '1',  # 1:yes; 2:update existing SNP alignment, only works when buildSNPdb is used first time to build DB
+            'data': '3',  # 0:only complete(F); 1:only contig(C); 2:only reads(R); 3:combination F+C; 4:combination F+R; 5:combination C+R; 6:combination F+C+R; 7:realignment
+            'reads': '2',  # 1: single reads; 2: paired reads; 3: both types present
+            'tree': '1',  # 0:no tree; 1:use FastTree; 2:use RAxML; 3:use both
+            'bootstrap': '1',  # 0:no; 1:yes;  # Run bootstrapping
+            'N': '100',  # Number of bootstraps to run
+            'PosSelect': '0',  # 0:No; 1:use PAML; 2:use HyPhy; 3:use both # these analysis need gff file to parse genomes to genes
+            'code': '0',  # 0:Bacteria; 1:Virus; 2: Eukarya # Bacteria and Virus sets ploidy to haploid
+            'clean': '1',  # 0:no clean; 1:clean # remove intermediate and temp files after analysis is complete
+            'threads': str(cpu),  # Number of threads to use
+            'cutoff': '0.1'  # Linear alignment (LA) coverage against reference - ignores SNPs from organism that have lower cutoff.
+        }
+
+        # Create config file
+        cfg = output_folder + 'phame.cfg'
+        with open(cfg, 'w') as f:
+            for k, v in cfg_dict.items():
+                f.write('{}={}\n'.format(k, v))
+
+        # Add reference to assemblies
+        os.symlink(ref, fasta_folder + '/' + os.path.basename(ref))
+
+        # Run Phame
+        cmd = ['phame', cfg]
+        subprocess.run(cmd)
+
+    @staticmethod
+    def run_snippy(ref, assembly, output_folder, cpu, mem):
+        # # Prep input tab to run "snippy-multi"
+        # input_tab = output_folder + 'input.tsv'
+        # with open(input_tab, 'w') as f:
+        #     for assembly in assembly_list:
+        #         sample = os.path.basename(assembly)
+        #         f.write('{}\t{}\n'.format(sample, assembly))
+
+        sample = '.'.join(os.path.basename(assembly).split('.')[:-1])
+
+        # Create script
+        cmd = ['snippy',
+               '--reference', ref,
+               '--cpus', str(cpu),
+               '--outdir', output_folder + sample,
+               '--prefix', sample,
+               '--ram', str(mem),
+               '--force',
+               '--cleanup',
+               '--report',
+               '--ctgs', assembly]
+        os.chdir(output_folder)
+        subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
+
+    @staticmethod
+    def run_snippy_parallel(ref, assembly_list, output_folder, cpu, mem, parallel):
+        Methods.make_folder(output_folder)
+
+        with futures.ThreadPoolExecutor(max_workers=int(parallel)) as executor:
+            args = ((ref, assembly, output_folder, int(cpu / parallel), int(mem / parallel))
+                    for assembly in assembly_list)
+            for results in executor.map(lambda x: Methods.run_snippy(*x), args):
+                pass
+
+    @staticmethod
+    def snippy_core(output_folder, ref):
+        snippy_folder_list = next(os.walk(output_folder))[1]
+
+        cmd = ['snippy-core',
+               '--ref', ref,
+               '--prefix={}'.format(output_folder + 'core')] + snippy_folder_list
+        subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)