Skip to content

Commit

Permalink
New SNP methods
Browse files Browse the repository at this point in the history
- Remove kSNP3 as it was creating too much conflicts for the conda environment creation. Please use v0.1 if you want to be able to produce a pan-SNP tree. Consequently, only core-SNP methods remain.
- Addition of snippy v4.6.0 as alternative method to find SNPs. Now default method. Use FastTree to make tree.
- Addition of phame v1.0.3 as alternative method to find SNPs. Seems to have lower resolution than snippy. Use FastTree to make tree.
- README.md updated accordingly.
  • Loading branch information
duceppemo committed Mar 21, 2023
1 parent e0ea693 commit 3a46085
Show file tree
Hide file tree
Showing 4 changed files with 181 additions and 223 deletions.
23 changes: 17 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ BACoN follows these steps:
1. Extract Nanopore reads matching reference with `minimap2` or `bbduk` (ideally using the same reference used for targeted sequencing).
2. Remove Nanopore adapters with `porechop` and filter small (min 500bp) and lower quality (discard bottom 5% or keep top 100X) reads with `filtlong`.
3. Assemble reads with `flye`, `shasta` or `rebaler`. Note that the "ont-hq" flag is used for Flye, assuming that Guppy5+ or Q20+ chemistry (<5% error) was used for basecalling. Flye and Shasta are *de novo* assemblers while rebaler is a reference-guided assembler. Flye seems to work better with circular genomes. Shasta works well too with circular references (like full organelle genomes), but will also work with linear references (like ribosomal DNA). Rebaler is more experimental (not fully validated for this type of application, but I thought it would be good to have a "different" type of assembler to try in case the *de novo* methods don't work well).
4. Compare samples (make tree) with `parsnp` (core) or `kSNP3` (pan and core). Boostrap trees are also created using `RAxML` and `FastTree` from SNPs from parsnp.
4. Extract core SNPs with `PhaME`, `parsnp` or `Snippy`. Boostrap trees (n=100) are also created using `RAxML` or `FastTree`.

## Installation
Requires conda. See [here](https://docs.conda.io/en/latest/miniconda.html#latest-miniconda-installer-links) for instructions if needed.
Expand All @@ -26,21 +26,27 @@ git clone https://github.com/duceppemo/BACoN
cd BACoN
conda create -n BACoN -y -c bioconda -c etetoolkit -c hcc --file requirements.txt
# Alternative environment creation method
conda create -n BACoN -y -c bioconda -c etetoolkit -c hcc -c conda-forge porechop=0.2.4 filtlong=0.2.1 minimap2=2.24 \
samtools=1.16 bcftools=1.16 flye=2.9.1 shasta=0.11.1 bbmap=39.01 git ete3=3.1.2 pysam=0.20.0 bandage=0.8.1 \
parsnp=1.7.4 harvesttools=1.2 raxml=8.2.12 fasttree=2.1.11 psutil=5.9.4 pandas=1.5.3 rebaler=0.2.0 \
snippy=4.6.0 phame=1.0.3 mummer=3.23 bowtie2=2.5.1 bwa=0.7.17 perl-bioperl=1.7.8 perl=5.32.1
# Activate enviroment:
conda activate BACoN
# Test program. No errors should be displayed:
# Test program. No errors should promp:
python bacon.py -h
```

## Usage
```commandline
usage: python bacon.py [-h] -r /path/to/reference_organelle/genome.fasta -i /path/to/input/folder/ or /path/to/my_fastq.gz -o /path/to/output/folder/ [-b {minimap2,bbduk}] [-a {flye,shasta,rebaler}] [--min-size 3000] [-t 16]
[-p 2] [-m 57] [-s 150000] [-k 99] [--keep-bam] [-snp {parsnp,ksnp3}]
[-p 2] [-m 57] [-s 150000] [-k 99] [--keep-bam] [-snp {parsnp,snippy,phame}] [-v]
Extract, assemble and compare Nanopore reads matching a reference sequence.
optional arguments:
options:
-h, --help show this help message and exit
-r /path/to/reference_organelle/genome.fasta, --reference /path/to/reference_organelle/genome.fasta
Reference genome for read mapping. Mandatory.
Expand All @@ -61,8 +67,9 @@ optional arguments:
-k 99, --kmer-size 99
Kmer size for baiting. Only used if "--baiting-method" is "bbduk". Optional.
--keep-bam Do not delete BAM files. Only used if "--baiting-method" is "minimap2". Optional.
-snp {parsnp,ksnp3}, --snp-method {parsnp,ksnp3}
SNP calling method. Default "parsnp". Optional.
-snp {parsnp,snippy,phame}, --snp-method {parsnp,snippy,phame}
SNP calling method. Default "phame". Optional.
-v, --version show program's version number and exit
```

## Example
Expand Down Expand Up @@ -149,3 +156,7 @@ bacon
- `4_assembled` contains the assembly sub-folders for each sample. All assemblies are located in the `all_assemblies` folder. Assembly quality can be assessed quickly from the `assembly_graphs` sub-folder.
- `5_compared` contains the SNPs and tree files from `parsnp`, `RAxML` and `FastTree`. RAxML and FastTree are provided to add bootstrap values to the tree.
- The `done` files are checkpoint files.

Note that the output files and their structure within each folder will vary depending on method used.
## Tips
If you want to compare different SNP detection methods, you can rename the `5_compared` folder, remove the `done_comparing` file, change the method for the `-snp` option and rerun the command. Only the last step will be executed.
75 changes: 43 additions & 32 deletions bacon.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,29 +9,16 @@


__author__ = 'duceppemo'
__version__ = '0.1'


"""requirements
porechop=0.2.4
filtlong=0.2.1
minimap2=2.24
samtools=1.15.1
flye=2.9.1
shasta=0.8.0
bbmap=38.98
git=2.37.2
ete3=3.1.2
pysam=0.19.1
bandage=0.8.1
parsnp=1.7.4
harvesttools=1.2
raxml=8.2.12
fasttree=2.1.11
psutil=5.9.1
pandas=1.4.3
rebaler=0.2.0
ksnp=3.1
__version__ = '0.2'


# Requirements
"""
mamba create -n BACoN -y -c bioconda -c etetoolkit -c hcc -c conda-forge porechop=0.2.4 filtlong=0.2.1 minimap2=2.24 \
samtools=1.16 bcftools=1.16 flye=2.9.1 shasta=0.11.1 bbmap=39.01 git ete3=3.1.2 pysam=0.20.0 bandage=0.8.1 \
parsnp=1.7.4 harvesttools=1.2 raxml=8.2.12 fasttree=2.1.11 psutil=5.9.4 pandas=1.5.3 rebaler=0.2.0 \
snippy=4.6.0 phame=1.0.3 mummer=3.23 bowtie2=2.5.1 bwa=0.7.17 perl-bioperl=1.7.8 perl=5.32.1
# ksnp=3.1
"""


Expand Down Expand Up @@ -230,24 +217,47 @@ def run(self):
Methods.convert_xmfa_to_fastq(compared_folder + 'parsnp.xmfa', compared_folder + 'parsnp.fasta')

# Create ML tree with bootstraps
print('Making RAxML tree with 1,000 bootstraps...')
print('\tRemaking tree with RAxML using 100 bootstraps...')
Methods.make_tree_raxml(compared_folder + 'parsnp.fasta', compared_folder, self.cpu)
# Plot tree to PDF, .SVG or .PNG
# This command won't work for this tree...
# Methods.plot_newick_tree(compared_folder + 'RAxML_bipartitionsBranchLabels.raxml.tree',
# compared_folder + 'raxml.pdf')

# Create FastTree
print('Making FastTree tree with 1,000 peudo-bootstraps...')
print('\tRemaking tree with FastTree using 100 peudo-bootstraps...')
Methods.make_tree_fasttree(compared_folder + 'parsnp.fasta', compared_folder)
Methods.plot_newick_tree(compared_folder + 'fasttree.tree',
compared_folder + 'fasttree.pdf')
else: # elif self.snp_method == 'ksnp3':
print('Making pan-SNP ans core-SNP trees with kSNP3...')
# elif self.snp_method == 'ksnp3':
# print('Making pan-SNP and core-SNP trees with kSNP3...')
#
# # Create pan-SNP tree
# Methods.run_ksnp3(assembly_list, compared_folder, self.reference, self.cpu)
elif self.snp_method == 'snippy':
print('Extracting core SNPs with Snippy...')

# ref, assembly_list, output_folder, cpu, mem, parallel
Methods.run_snippy_parallel(self.reference, assembly_list, compared_folder,
self.cpu, self.mem, self.parallel)

# Find core SNP
Methods.snippy_core(compared_folder, self.reference)

# Make tree
print('\tMaking tree with FastTree using 100 peudo-bootstraps...')
Methods.make_tree_fasttree(compared_folder + 'core.full.aln', compared_folder)
Methods.plot_newick_tree(compared_folder + 'fasttree.tree',
compared_folder + 'fasttree.pdf')


else: # elif self.snp_method == 'phame':
print('Making core-SNP tree with PhaME (FastTree with 100 pseudo-bootstraps...')

# Create pan-SNP tree
Methods.run_ksnp3(assembly_list, compared_folder, self.reference, self.cpu)
# Run PhaME (Fastree and 100 pseudo-bootstraps)
Methods.run_phame(self.reference, assembled_folder + 'all_assemblies/', compared_folder, self.cpu)

# Create "done" flag
Methods.flag_done(done_comparing)
else:
print('Skipping sample comparison. Already done.')
Expand Down Expand Up @@ -308,10 +318,11 @@ def run(self):
required=False, action='store_true',
help='Do not delete BAM files. Only used if "--baiting-method" is "minimap2". Optional.')
parser.add_argument('-snp', '--snp-method',
required=False, default='parsnp',
choices=['parsnp', 'ksnp3'],
required=False, default='snippy',
# choices=['parsnp', 'snippy', 'ksnp3', 'phame'],
choices=['parsnp', 'snippy', 'phame'],
type=str,
help='SNP calling method. Default "parsnp". Optional.')
help='SNP calling method. Default "phame". Optional.')
parser.add_argument('-v', '--version', action='version',
version=f'{os.path.basename(__file__)}: version {__version__}')

Expand Down
100 changes: 96 additions & 4 deletions bacon_methods.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,9 +28,7 @@ class Methods(object):
def check_input(my_input):
error_message_list = ['Please provide a folder as input.',
'The input folder provided does not exist.',
'Make sure files in input folder end with {}'.format(Methods.accepted_extensions),

]
'Make sure files in input folder end with {}'.format(Methods.accepted_extensions)]

if not os.path.exists(my_input):
raise Exception('Please select an existing file or folder as input.')
Expand All @@ -45,6 +43,17 @@ def check_input(my_input):
if not all([f.endswith(tuple(Methods.accepted_extensions)) for f in file_list]):
raise Exception('Make sure files in input folder end with {}'.format(Methods.accepted_extensions))

@staticmethod
def check_ref(ref):
if not os.path.isfile(ref):
raise Exception('The reference file provided does not exist')

with gzip.open(ref, 'rt') if ref.endswith('.gz') else open(ref, 'r') as f:
first_header = f.readline()
first_character = split(first_header)[0]
if first_character != '>':
raise Exception('The reference file provided does not appear to be a valid fasta file.')

@staticmethod
def check_cpus(requested_cpu, n_proc):
total_cpu = cpu_count()
Expand Down Expand Up @@ -675,7 +684,7 @@ def make_tree_raxml(aligned_fasta, output_folder, cpu):
'-w', output_folder,
'-n', 'raxml.tree',
'-m', 'GTRCAT',
'-N', str(1000),
'-N', str(100),
'-d', '-f', 'a', '-T', str(cpu),
'-x', str(1234), '-p', str(123)]
subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
Expand All @@ -685,6 +694,7 @@ def make_tree_fasttree(aligned_fasta, output_folder):
cmd = ['FastTree',
'-nt',
'-gtr',
'-boot', str(100),
aligned_fasta]
p = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.DEVNULL)
with open(output_folder + 'fasttree.tree', 'wb') as f:
Expand All @@ -700,3 +710,85 @@ def plot_newick_tree(tree_file, output_file):
t.render(output_file, w=183, units='mm', tree_style=ts)
except NewickError:
print('Could not convert tree to picture file.')

@staticmethod
def run_phame(ref, fasta_folder, output_folder, cpu):
Methods.make_folder(output_folder)

cfg_dict = {
'refdir': fasta_folder,
'workdir': output_folder, # # directory where reference (Complete) files are located
'reference': '1', # 0:pick a random ref from refdir; 1:use given ref; 2: use ANI based ref
'reffile': os.path.basename(ref), # reference filename when option 1 is chosen
'project': 'phame', # main alignment file name
'cdsSNPS': '0', # 0:no cds SNPS; 1:cds SNPs, divides SNPs into coding and non-coding sequences, gff file is required
'buildSNPdb': '0', # 0: only align to reference 1: build SNP database of all complete genomes from refdir
'FirstTime': '1', # 1:yes; 2:update existing SNP alignment, only works when buildSNPdb is used first time to build DB
'data': '3', # 0:only complete(F); 1:only contig(C); 2:only reads(R); 3:combination F+C; 4:combination F+R; 5:combination C+R; 6:combination F+C+R; 7:realignment
'reads': '2', # 1: single reads; 2: paired reads; 3: both types present
'tree': '1', # 0:no tree; 1:use FastTree; 2:use RAxML; 3:use both
'bootstrap': '1', # 0:no; 1:yes; # Run bootstrapping
'N': '100', # Number of bootstraps to run
'PosSelect': '0', # 0:No; 1:use PAML; 2:use HyPhy; 3:use both # these analysis need gff file to parse genomes to genes
'code': '0', # 0:Bacteria; 1:Virus; 2: Eukarya # Bacteria and Virus sets ploidy to haploid
'clean': '1', # 0:no clean; 1:clean # remove intermediate and temp files after analysis is complete
'threads': str(cpu), # Number of threads to use
'cutoff': '0.1' # Linear alignment (LA) coverage against reference - ignores SNPs from organism that have lower cutoff.
}

# Create config file
cfg = output_folder + 'phame.cfg'
with open(cfg, 'w') as f:
for k, v in cfg_dict.items():
f.write('{}={}\n'.format(k, v))

# Add reference to assemblies
os.symlink(ref, fasta_folder + '/' + os.path.basename(ref))

# Run Phame
cmd = ['phame', cfg]
subprocess.run(cmd)

@staticmethod
def run_snippy(ref, assembly, output_folder, cpu, mem):
# # Prep input tab to run "snippy-multi"
# input_tab = output_folder + 'input.tsv'
# with open(input_tab, 'w') as f:
# for assembly in assembly_list:
# sample = os.path.basename(assembly)
# f.write('{}\t{}\n'.format(sample, assembly))

sample = '.'.join(os.path.basename(assembly).split('.')[:-1])

# Create script
cmd = ['snippy',
'--reference', ref,
'--cpus', str(cpu),
'--outdir', output_folder + sample,
'--prefix', sample,
'--ram', str(mem),
'--force',
'--cleanup',
'--report',
'--ctgs', assembly]
os.chdir(output_folder)
subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)

@staticmethod
def run_snippy_parallel(ref, assembly_list, output_folder, cpu, mem, parallel):
Methods.make_folder(output_folder)

with futures.ThreadPoolExecutor(max_workers=int(parallel)) as executor:
args = ((ref, assembly, output_folder, int(cpu / parallel), int(mem / parallel))
for assembly in assembly_list)
for results in executor.map(lambda x: Methods.run_snippy(*x), args):
pass

@staticmethod
def snippy_core(output_folder, ref):
snippy_folder_list = next(os.walk(output_folder))[1]

cmd = ['snippy-core',
'--ref', ref,
'--prefix={}'.format(output_folder + 'core')] + snippy_folder_list
subprocess.run(cmd, stdout=subprocess.DEVNULL, stderr=subprocess.STDOUT)
Loading

0 comments on commit 3a46085

Please sign in to comment.